編程環(huán)境:
anaconda + python3.7
完整代碼及數(shù)據(jù)已經(jīng)更新至GitHub,歡迎fork~GitHub鏈接
聲明:創(chuàng)作不易,未經(jīng)授權(quán)不得復(fù)制轉(zhuǎn)載
statement:No reprinting without authorization
中文分詞工具
1纳胧、Jieba(重點(diǎn))传货,三種分詞模式與自定義詞典功能
2辫红、SnowNLP
3谜叹、THULAC
4交煞、NLPIR:https://github.com/tsroten/pynlpir
5、NLPIR:https://blog.csdn.net/weixin_34613450/article/details/78695166
6砖第、StanfordCoreNLP
7、HanLP(需要額外安裝 Microsoft Visual C++ 14.0) 或安裝教程
英文分詞工具
1环凿、NLTK:
http://www.nltk.org/index.html
https://github.com/nltk/nltk
http://www.reibang.com/p/9d232e4a3c28
主要內(nèi)容:
? ? ? ?利用給定的中英文文本序列(見(jiàn) Chinese.txt 和 English.txt)梧兼,分別利用以下給定的中 英文分詞工具進(jìn)行分詞并對(duì)不同分詞工具產(chǎn)生的結(jié)果進(jìn)行簡(jiǎn)要對(duì)比分析。
一智听、在python環(huán)境中安裝好各種工具包:
主要利用到pip命令羽杰,一般都能快速成功安裝渡紫。其中有幾個(gè)需注意的地方:
(1)第一是Stanfordcorenlp的安裝過(guò)程中:
? ? ? ?首先要配置好java環(huán)境,下載安裝1.8版本以上的JDK;配置好Java的環(huán)境變量Path和java_home等考赛,需要注意最好重啟讓系統(tǒng)環(huán)境變量生效惕澎;否則可能遇到如下報(bào)錯(cuò):
? ? ? ?而后需要下載外接文件包,注意python的版本下載對(duì)應(yīng)的包颜骤,而后進(jìn)行解壓唧喉,對(duì)中文進(jìn)行處理時(shí)需要再另外下載一個(gè)對(duì)應(yīng)的Chinese的jar包,放入之前的解壓文件夾中忍抽。
(2)第二是spacy安裝的過(guò)程中八孝,可能會(huì)遇到權(quán)限不夠的提示,
方法1是:要用管理員模式啟動(dòng)命令行
方法2是:用 nlp = spacy.load('en_core_web_sm')代替原來(lái)的nlp = spacy.load(’en’)
(3)第三是注意文字編碼:
有些工具包需要指定為Unicode的編碼模式鸠项,不然可能會(huì)有一些問(wèn)題干跛。
二、編寫(xiě)代碼運(yùn)行測(cè)試:
# -*- coding: utf-8 -*-
"""
Created on Tue Mar 18 09:43:47 2019
@author: Mr.relu
"""
import time
import jieba
from snownlp import SnowNLP
import thulac
import pynlpir
from stanfordcorenlp import StanfordCoreNLP
import nltk
import spacy
spacy_nlp = spacy.load('en_core_web_sm')
f = open('Chinese.txt')
document = f.read()
f.close()
print(document)
"""
測(cè)試工具包jieba
"""
print(">>>>>jieba tokenization start...")
start = time.process_time()
seg_list = jieba.cut(str(document), cut_all=True)
elapsed = (time.process_time() - start)
print("jieba 01 Time used:",elapsed)
print("《《jieba Full Mode》》: \n" + "/ ".join(seg_list)) # 全模式
start = time.process_time()
seg_list = jieba.cut(document, cut_all=False)
elapsed = (time.process_time() - start)
print("jieba 02 Time used:",elapsed)
print("《《jieba Default Mode》》: \n" + "/ ".join(seg_list)) # 精確模式
start = time.process_time()
seg_list = jieba.cut_for_search(document) # 搜索引擎模式
elapsed = (time.process_time() - start)
print("jieba 03 Time used:",elapsed)
print("《《jieba Search Model》》: \n" + "/ ".join(seg_list))
"""
測(cè)試工具包SnowNLP
"""
print(">>>>>SnowNLP tokenization start...")
start = time.process_time()
s = SnowNLP(document)
result = s.words # [u'這個(gè)', u'東西', u'真心',
elapsed = (time.process_time() - start)
print("SnowNLP Time used:",elapsed)
print("《《SnowNLP》》: \n" + "/ ".join(result)) # u'很', u'贊']
#result = s.tags # [(u'這個(gè)', u'r'), (u'東西', u'n'),
#print(result) # (u'真心', u'd'), (u'很', u'd'),
# # (u'贊', u'Vg')]
#result = s.sentiments # 0.9769663402895832 positive的概率
#print(result)
#result = s.pinyin # [u'zhe', u'ge', u'dong', u'xi',
#print(result) # u'zhen', u'xin', u'hen', u'zan']
#s = SnowNLP(u'「繁體字」「繁體中文」的叫法在臺(tái)灣亦很常見(jiàn)祟绊。')
#
#s.han # u'「繁體字」「繁體中文」的叫法
# # 在臺(tái)灣亦很常見(jiàn)楼入。'
"""
測(cè)試thulac工具包
"""
print(">>>>>thulac tokenization start...")
start = time.process_time()
thu1 = thulac.thulac(seg_only=True) #默認(rèn)模式
text = thu1.cut(document, text=True) #進(jìn)行一句話分詞
elapsed = (time.process_time() - start)
print("thulac Time used:",elapsed)
print("《《thulac》》: \n" + "/ ".join(text))
#thu1 = thulac.thulac(seg_only=True) #只進(jìn)行分詞,不進(jìn)行詞性標(biāo)注
#thu1.cut_f("Chinese.txt", "output.txt") #對(duì)input.txt文件內(nèi)容進(jìn)行分詞牧抽,輸出到output.txt
"""
測(cè)試pynlpir工具包
"""
print(">>>>>pynlpir tokenization start...")
start = time.process_time()
pynlpir.open()
resu = pynlpir.segment(document,pos_tagging=False)
elapsed = (time.process_time() - start)
print("pynlpir Time used:",elapsed)
print("《《pynlpir》》: \n" + "/ ".join(resu))
"""
pynlpir.segment(s, pos_tagging=True, pos_names=‘parent‘, pos_english=True)
pynlpir.get_key_words(s, max_words=50, weighted=False)
分詞:pynlpir.segment(s, pos_tagging=True, pos_names=‘parent‘, pos_english=True)
S: 句子
pos_tagging:是否進(jìn)行詞性標(biāo)註
pos_names:顯示詞性的父類(parent)還是子類(child) 或者全部(all)
pos_english:詞性顯示英語(yǔ)還是中文
獲取關(guān)鍵詞:pynlpir.get_key_words(s, max_words=50, weighted=False)
s: 句子
max_words:最大的關(guān)鍵詞數(shù)
weighted:是否顯示關(guān)鍵詞的權(quán)重
"""
"""
測(cè)試StanfordCoreNLP工具包
"""
print(">>>>>StanfordCoreNLP tokenization start...")
start = time.process_time()
nlp = StanfordCoreNLP(r'D:\anaconda\Lib\stanford-corenlp-full-2018-02-27',lang = 'zh')
outWords = nlp.word_tokenize(document)
elapsed = (time.process_time() - start)
print("StanfordCoreNLP Time used:",elapsed)
print("《《StanfordCoreNLP》》: \n" + "/ ".join(outWords))
#print 'Part of Speech:', nlp.pos_tag(sentence)
#print 'Named Entities:', nlp.ner(sentence)
#print 'Constituency Parsing:', nlp.parse(sentence)
#print 'Dependency Parsing:', nlp.dependency_parse(sentence)
nlp.close() # Do not forget to close! The backend server will consume a lot memery.
"""
英文分詞NLTK
"""
f = open('English.txt')
doc = f.read()
f.close()
print(doc)
print(">>>>>NLTK tokenization start...")
start = time.process_time()
tokens = nltk.word_tokenize(doc)
elapsed = (time.process_time() - start)
print("NLTK Time used:",elapsed)
print("《《NLTK》》: \n" + "/ ".join(tokens))
"""
英文分詞spacy
"""
print(">>>>>spacy tokenization start...")
start = time.process_time()
s_doc = spacy_nlp(doc)
elapsed = (time.process_time() - start)
print("spacy Time used:",elapsed)
token_doc =[]
for token in s_doc:
token = str(token)
token_doc.append(token)
print("《《Spacy》》: \n" + "/ ".join(token_doc))
"""
英文分詞StanfordCoreNLP
"""
print(">>>>>StanfordCoreNLP tokenization start...")
start = time.process_time()
nlp2 = StanfordCoreNLP(r'D:\anaconda\Lib\stanford-corenlp-full-2018-02-27')
outWords = nlp2.word_tokenize(doc)
elapsed = (time.process_time() - start)
print("StanfordCoreNLP Time used:",elapsed)
print("《《StanfordCoreNLP>>: \n" + "/ ".join(outWords))
nlp2.close() # Do not forget to close! The backend server will consume a lot memery.