項(xiàng)目位置:node2:/home/disk1/xukaituo/expriments/ngram-2016-11/
Step 1. 轉(zhuǎn)換編碼
iconv -f gbk//IGNORE -t utf-8//IGNORE filename > new_format_file
Step 2. 將非漢字去掉
#!/usr/bin/env python
# coding: utf-8
import codecs
import re
import sys
def remove_non_Chinese_word(input_file, output_file):
re_non_chinese = ur"[^\u4e00-\u9fa5]+"
with codecs.open(input_file, 'r', 'utf-8') as inputf:
with codecs.open(output_file, 'w', 'utf-8') as outputf:
for line in inputf:
re_result = re.sub(re_non_chinese, u"", line)
# new_line = " ".join(re_result)
new_line = re_result
outputf.write(new_line + '\n')
if __name__ == '__main__':
if len(sys.argv) < 3:
print "Usage: python 0-filter_non_chinese.py input-file output-file"
sys.exit()
remove_non_Chinese_word(sys.argv[1], sys.argv[2])
Step 3. 刪除空白行
sed -i '/^$/d' filename
Step 4. 分詞
使用ltp分詞工具
[1]github https://github.com/HIT-SCIR/ltp
[2]文檔 http://ltp.readthedocs.io/zh_CN/latest/api.html#id2
[3]模型 https://pan.baidu.com/share/link?shareid=1988562907&uk=2738088569
部分bash腳本:
cd /home/disk1/xukaituo/expriments/ngram-2016-11/utils
CWSTOOL=/home/disk1/xukaituo/projects/Chinese-word-segmentation
1-Chinese-word-segmentor/cws ${CWSTOOL}/ltp_data/cws.model $2 $3
調(diào)用ltp接口的分詞程序:
// cws.cc
// Copyright 2016 ASLP(Author: Kaituo Xu)
#include <iostream>
#include <fstream>
#include <string>
#include <vector>
#include "segment_dll.h"
int main(int argc, char *argv[])
{
try {
if (argc < 4) {
std::cerr << "cws [model path] [input file path] [output file path]" << std::endl;
return 1;
}
void *engine = segmentor_create_segmentor(argv[1]);
std::ifstream input(argv[2]);
std::ofstream output(argv[3], std::ofstream::app);
if (!engine || !input || !output) {
return -1;
}
std::string line;
while (getline(input, line)) {
std::vector<std::string> words;
int len = segmentor_segment(engine, line, words);
for (int i = 0; i < len; ++i) {
output << words[i] << " ";
}
output << std::endl;
}
segmentor_release_segmentor(engine);
return 0;
} catch(const std::exception &e) {
std::cerr << e.what();
return -1;
}
}
Step 5. 將暫時(shí)不用的數(shù)據(jù)進(jìn)行壓縮,節(jié)省磁盤空間
# 使用`gzip`對(duì)文件進(jìn)行壓縮
gzip <filename>
# 解壓縮
gzip -d <filename>.gz
壓縮后原文件消失串稀,默認(rèn)在<filename>
后加.gz
;解壓縮后,.gz
文件會(huì)消失狮杨。