本文主要介紹Python中NLTK文本分析的內(nèi)容蝌矛,咱先來(lái)看看文本分析的整個(gè)流程:
原始文本 - 分詞 - 詞性標(biāo)注 - 詞形歸一化 - 去除停用詞 - 去除特殊字符 - 單詞大小寫轉(zhuǎn)換 - 文本分析
一训唱、分詞
使用DBSCAN聚類算法的英文介紹文本為例:
from nltk import word_tokenize
sentence = "DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density "
token_words = word_tokenize(sentence)
print(token_words)
輸出分詞結(jié)果:
['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'of', 'Applications', 'with', 'Noise', '.', 'Finds', 'core', 'samples', 'of', 'high', 'density', 'and', 'expands', 'clusters', 'from', 'them', '.', 'Good', 'for', 'data', 'which', 'contains', 'clusters', 'of', 'similar', 'density']
二薯蝎、詞性標(biāo)注
為什么要進(jìn)行詞性標(biāo)注倦挂?咱先來(lái)看看不做詞性標(biāo)注豺撑,直接按照第一步分詞結(jié)果進(jìn)行詞形歸一化的情形:
常見詞形歸一化有兩種方式(詞干提取與詞形歸并):
1逼肯、詞干提取
from nltk.stem.lancaster import LancasterStemmer
lancaster_stemmer = LancasterStemmer()
words_stemmer = [lancaster_stemmer.stem(token_word) for token_word in token_words]
print(words_stemmer)
輸出結(jié)果:
['dbscan', '-', 'density-based', 'spat', 'clust', 'of', 'apply', 'with', 'nois', '.', 'find', 'cor', 'sampl', 'of', 'high', 'dens', 'and', 'expand', 'clust', 'from', 'them', '.', 'good', 'for', 'dat', 'which', 'contain', 'clust', 'of', 'simil', 'dens']
說明:詞干提取默認(rèn)提取單詞詞根阱当,容易得出一些不具實(shí)際意義的單詞沟突,比如上面的”Spatial“變?yōu)椤眘pat“,”Noise“變?yōu)椤眓ois“丈积,在常規(guī)文本分析中沒意義筐骇,在信息檢索中用該方法會(huì)比較合適。
2江滨、詞形歸并(單詞變體還原)
from nltk.stem import WordNetLemmatizer
wordnet_lematizer = WordNetLemmatizer()
words_lematizer = [wordnet_lematizer.lemmatize(token_word) for token_word in token_words]
print(words_lematizer)
輸出結(jié)果:
['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'of', 'Applications', 'with', 'Noise', '.', 'Finds', 'core', 'sample', 'of', 'high', 'density', 'and', 'expands', 'cluster', 'from', 'them', '.', 'Good', 'for', 'data', 'which', 'contains', 'cluster', 'of', 'similar', 'density']
說明:這種方法主要在于將過去時(shí)铛纬、將來(lái)時(shí)、第三人稱等單詞還原為原始詞唬滑,不會(huì)產(chǎn)生詞根這些無(wú)意義的單詞饺鹃,但是仍存在有些詞無(wú)法還原的情況,比如“Finds”间雀、“expands”、”contains“仍是第三人稱的形式镊屎,原因在于wordnet_lematizer.lemmatize函數(shù)默認(rèn)將其當(dāng)做一個(gè)名詞惹挟,以為這就是單詞原型,如果我們?cè)谑褂迷摵瘮?shù)時(shí)指明動(dòng)詞詞性缝驳,就可以將其變?yōu)椤眂ontain“了连锯。所以要先進(jìn)行詞性標(biāo)注獲取單詞詞性(詳情如下)归苍。
3、詞性標(biāo)注
先分詞运怖,再詞性標(biāo)注:
from nltk import word_tokenize,pos_tag
sentence = "DBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density"
token_word = word_tokenize(sentence) #分詞
token_words = pos_tag(token_word) #詞性標(biāo)注
print(token_words)
輸出結(jié)果:
[('DBSCAN', 'NNP'), ('-', ':'), ('Density-Based', 'JJ'), ('Spatial', 'NNP'), ('Clustering', 'NNP'), ('of', 'IN'), ('Applications', 'NNP'), ('with', 'IN'), ('Noise', 'NNP'), ('.', '.'), ('Finds', 'NNP'), ('core', 'NN'), ('samples', 'NNS'), ('of', 'IN'), ('high', 'JJ'), ('density', 'NN'), ('and', 'CC'), ('expands', 'VBZ'), ('clusters', 'NNS'), ('from', 'IN'), ('them', 'PRP'), ('.', '.'), ('Good', 'JJ'), ('for', 'IN'), ('data', 'NNS'), ('which', 'WDT'), ('contains', 'VBZ'), ('clusters', 'NNS'), ('of', 'IN'), ('similar', 'JJ'), ('density', 'NN')]
說明:列表中每個(gè)元組第二個(gè)元素顯示為該詞的詞性拼弃,具體每個(gè)詞性注釋可運(yùn)行代碼”nltk.help.upenn_tagset()“或參看說明文檔:詞性標(biāo)簽說明
三、詞形歸一化(指明詞性)
from nltk.stem import WordNetLemmatizer
words_lematizer = []
wordnet_lematizer = WordNetLemmatizer()
for word, tag in token_words:
if tag.startswith('NN'):
word_lematizer = wordnet_lematizer.lemmatize(word, pos='n') # n代表名詞
elif tag.startswith('VB'):
word_lematizer = wordnet_lematizer.lemmatize(word, pos='v') # v代表動(dòng)詞
elif tag.startswith('JJ'):
word_lematizer = wordnet_lematizer.lemmatize(word, pos='a') # a代表形容詞
elif tag.startswith('R'):
word_lematizer = wordnet_lematizer.lemmatize(word, pos='r') # r代表代詞
else:
word_lematizer = wordnet_lematizer.lemmatize(word)
words_lematizer.append(word_lematizer)
print(words_lematizer)
輸出結(jié)果:
['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'of', 'Applications', 'with', 'Noise', '.', 'Finds', 'core', 'sample', 'of', 'high', 'density', 'and', 'expand', 'cluster', 'from', 'them', '.', 'Good', 'for', 'data', 'which', 'contain', 'cluster', 'of', 'similar', 'density']
說明:可以看到單詞變體已經(jīng)還原成單詞原型摇展,如“Finds”吻氧、“expands”、”contains“均已變?yōu)楦髯缘脑汀?/p>
四咏连、去除停用詞
經(jīng)過分詞與詞形歸一化之后盯孙,得到各個(gè)詞性單詞的原型,但仍存在一些無(wú)實(shí)際意義的介詞祟滴、量詞等在文本分析中不重要的詞(這類詞在文本分析中稱作停用詞)振惰,需要將其去除。
from nltk.corpus import stopwords
cleaned_words = [word for word in words_lematizer if word not in stopwords.words('english')]
print('原始詞:', words_lematizer)
print('去除停用詞后:', cleaned_words)
輸出結(jié)果:
原始詞: ['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'of', 'Applications', 'with', 'Noise', '.', 'Finds', 'core', 'sample', 'of', 'high', 'density', 'and', 'expand', 'cluster', 'from', 'them', '.', 'Good', 'for', 'data', 'which', 'contain', 'cluster', 'of', 'similar', 'density']
去除停用詞后: ['DBSCAN', '-', 'Density-Based', 'Spatial', 'Clustering', 'Applications', 'Noise', '.', 'Finds', 'core', 'sample', 'high', 'density', 'expand', 'cluster', '.', 'Good', 'data', 'contain', 'cluster', 'similar', 'density']
說明:of垄懂、for骑晶、and這類停用詞已被去除。
五草慧、去除特殊字符
標(biāo)點(diǎn)符號(hào)在文本分析中也是不需要的桶蛔,也將其剔除,這里我們采用循環(huán)列表判斷的方式來(lái)剔除冠蒋,可自定義要去除的標(biāo)點(diǎn)符號(hào)羽圃、要剔除的特殊單詞也可以放在這將其剔除,比如咱將"DBSCAN"也連同標(biāo)點(diǎn)符號(hào)剔除抖剿。
characters = [',', '.','DBSCAN', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%','-','...','^','{','}']
words_list = [word for word in cleaned_words if word not in characters]
print(words_list)
輸出結(jié)果:
['Density-Based', 'Spatial', 'Clustering', 'Applications', 'Noise', 'Finds', 'core', 'sample', 'high', 'density', 'expand', 'cluster', 'Good', 'data', 'contain', 'cluster', 'similar', 'density']
說明:處理后的單詞列表已不存在“-”朽寞、“.”等特殊字符。
六斩郎、大小寫轉(zhuǎn)換
為防止同一個(gè)單詞同時(shí)存在大小寫而算作兩個(gè)單詞的情況脑融,還需要統(tǒng)一單詞大小寫(此處統(tǒng)一為小寫)。
words_lists = [x.lower() for x in words_list ]
print(words_lists)
輸出結(jié)果:
['density-based', 'spatial', 'clustering', 'applications', 'noise', 'finds', 'core', 'sample', 'high', 'density', 'expand', 'cluster', 'good', 'data', 'contain', 'cluster', 'similar', 'density']
七缩宜、文本分析
經(jīng)以上六步的文本預(yù)處理后肘迎,已經(jīng)得到干凈的單詞列表做文本分析或文本挖掘(可轉(zhuǎn)換為DataFrame之后再做分析)。
統(tǒng)計(jì)詞頻(這里我們以統(tǒng)計(jì)詞頻為例):
from nltk import FreqDist
freq = FreqDist(words_lists)
for key,val in freq.items():
print (str(key) + ':' + str(val))
輸出結(jié)果:
density-based:1
spatial:1
clustering:1
applications:1
noise:1
finds:1
core:1
sample:1
high:1
density:2
expand:1
cluster:2
good:1
data:1
contain:1
similar:1
可視化(折線圖):
freq.plot(20,cumulative=False)
可視化(詞云):
繪制詞云需要將單詞列表轉(zhuǎn)換為字符串
words = ' '.join(words_lists)
words
輸出結(jié)果:
'density-based spatial clustering applications noise finds core sample high density expand cluster good data contain cluster similar density'
繪制詞云
from wordcloud import WordCloud
from imageio import imread
import matplotlib.pyplot as plt
pic = imread('./picture/china.jpg')
wc = WordCloud(mask = pic,background_color = 'white',width=800, height=600)
wwc = wc.generate(words)
plt.figure(figsize=(10,10))
plt.imshow(wwc)
plt.axis("off")
plt.show()
文本分析結(jié)論:根據(jù)折線圖或詞云锻煌,咱可以直觀看到“density”與“cluster”兩個(gè)單詞出現(xiàn)最多妓布,詞云中字體越大。