本篇文檔捞慌,帶大家用Python做一下詞頻統(tǒng)計
本章需要用到Python的jieba模塊
jieba模塊是一個經(jīng)典的用于中文分詞的模塊
首先呢 我們需要讀取文章的內(nèi)容,并用jieba庫的lcut進(jìn)行分詞
import jieba
# 讀取紅樓夢的文本內(nèi)容
txt = open('紅樓夢.txt', 'r', encoding='utf-8').read()
# 運用jieba庫對文本內(nèi)容進(jìn)行分詞
words = jieba.lcut(txt)
然后 我們?nèi)ソy(tǒng)計人名的出現(xiàn)次數(shù)
這里需要分析什么詞語是人名柬批,我們?nèi)?chuàng)建一個文檔啸澡,當(dāng)做字典存儲人名信息
人名還會有其他的表示,我們將它轉(zhuǎn)化成一樣的名字
# 初始化count字典 用于存放人名出現(xiàn)頻率
counts = {}
# 讀取紅樓夢人名信息
names = open('人名.txt', 'r', encoding='utf-8').read().split('氮帐、')
# 對分詞數(shù)據(jù)進(jìn)行篩選 將不需要的數(shù)據(jù)跳過 只保存有效數(shù)據(jù)
for word in words:
if len(word) == 1:
continue
elif word == '賈母' or word == '老太太':
word = '賈母'
elif word in '賈珍—尤氏'.split('—'):
word = '賈珍'
elif word in '賈蓉—秦可卿'.split('-'):
word = '賈蓉'
elif word in '賈赦—邢夫人'.split('-'):
word = '賈赦'
elif word in '賈政—王夫人'.split('-'):
word = '賈政'
elif word in '襲人-蕊珠'.split('-'):
word = '襲人'
elif word in '賈璉—王熙鳳'.split('-'):
word = '賈璉'
elif word in '紫鵑-鸚哥'.split('-'):
word = '紫鵑'
elif word in '翠縷-縷兒'.split('-'):
word = '翠縷'
elif word in '香菱-甄英蓮'.split('-'):
word = '香菱'
elif word in '豆官-豆童'.split('-'):
word = '豆官'
elif word in '薛蝌—邢岫煙'.split('-'):
word = '薛蝌'
elif word in '薛蟠—夏金桂'.split('-'):
word = '薛蟠'
elif word in '賈寶玉-寶玉'.split('-'):
word = '賈寶玉'
elif word in '林黛玉-林姑娘-黛玉'.split('-'):
word = '林黛玉'
if word not in names:
continue
counts[word] = counts.get(word, 0)+1
最后我們將數(shù)據(jù)排序整理一下
# 將人名按照次數(shù)排序 降序
items = list(counts.items())
# 排序規(guī)則 以次數(shù)為參考進(jìn)行排序
items.sort(key=lambda x: x[1], reverse=True)
完整代碼如下:
import jieba
# 讀取紅樓夢的文本內(nèi)容
txt = open('紅樓夢.txt', 'r', encoding='utf-8').read()
# 運用jieba庫對文本內(nèi)容進(jìn)行分詞
words = jieba.lcut(txt)
# 初始化count字典 用于存放人名出現(xiàn)頻率
counts = {}
# 讀取紅樓夢人名信息
names = open('人名.txt', 'r', encoding='utf-8').read().split('嗅虏、')
# 對分詞數(shù)據(jù)進(jìn)行篩選 將不需要的數(shù)據(jù)跳過 只保存有效數(shù)據(jù)
for word in words:
if len(word) == 1:
continue
elif word == '賈母' or word == '老太太':
word = '賈母'
elif word in '賈珍—尤氏'.split('—'):
word = '賈珍'
elif word in '賈蓉—秦可卿'.split('-'):
word = '賈蓉'
elif word in '賈赦—邢夫人'.split('-'):
word = '賈赦'
elif word in '賈政—王夫人'.split('-'):
word = '賈政'
elif word in '襲人-蕊珠'.split('-'):
word = '襲人'
elif word in '賈璉—王熙鳳'.split('-'):
word = '賈璉'
elif word in '紫鵑-鸚哥'.split('-'):
word = '紫鵑'
elif word in '翠縷-縷兒'.split('-'):
word = '翠縷'
elif word in '香菱-甄英蓮'.split('-'):
word = '香菱'
elif word in '豆官-豆童'.split('-'):
word = '豆官'
elif word in '薛蝌—邢岫煙'.split('-'):
word = '薛蝌'
elif word in '薛蟠—夏金桂'.split('-'):
word = '薛蟠'
elif word in '賈寶玉-寶玉'.split('-'):
word = '賈寶玉'
elif word in '林黛玉-林姑娘-黛玉'.split('-'):
word = '林黛玉'
if word not in names:
continue
counts[word] = counts.get(word, 0)+1
# 將人名按照次數(shù)排序 降序
items = list(counts.items())
# 排序規(guī)則 以次數(shù)為參考進(jìn)行排序
items.sort(key=lambda x: x[1], reverse=True)
# print(items)
print('出現(xiàn)次數(shù)最多的是:', items[0][0], '出現(xiàn)了:', items[0][1], '次')
print('出現(xiàn)次數(shù)最少的是:', items[-1][0], '出現(xiàn)了:', items[-1][1], '次')
for item in items:
print(item[0], '出現(xiàn)了:', item[1], '次')
效果圖如下: