偶看<<孔乙己>>,用爬蟲爬出了孔乙己文中三個(gè)字的詞組.
孔乙己爬蟲
代碼如下
import jieba
# 導(dǎo)入結(jié)巴模塊
with open('kongyiji.txt','r', encoding='utf-8') as f:
kongyiji = f.read()
seg_list = jieba.cut(kongyiji)
words = list(seg_list)
d = {}
for w in words:
count = d.get(w, 0)
d[w] = count + 1
keys = d.keys()
word_list = []
for k in keys:
word = [k, d.get(k)]
word_list.append(word)
def max(array):
m = array[0]
for i in array:
if m[1] < i[1]:
m = i
return m
def sort(array):
result = []
for i in range(len(array)):
m = max(array)
result.append(m)
array.remove(m)
return result
def fliter(array):
result = []
for w in array:
if len(w[0]) >= 3:
result.append(w)
return result
sorted_words = sort(word_list)
result = fliter(sorted_words)
for w in result[:]:
print(w)
結(jié)果是(后面的數(shù)字是文章中詞組出現(xiàn)的次數(shù))
['孔乙己', 33]
['茴香豆', 5]
['十九個(gè)', 4]
['不耐煩', 2]
['掌柜的', 2]
['之乎者也', 2]
['怎么樣', 2]
['半懂不懂', 2]
['端出去', 1]
['睜大眼睛', 1]
['自此以后', 1]
['免不了', 1]
['嘆一口氣', 1]
['十多年', 1]
['伸出頭', 1]
['這時(shí)候', 1]
['不一會(huì)', 1]
['壞脾氣', 1]
['第二年', 1]
['背地里', 1]
['做點(diǎn)事', 1]
['漲紅了臉', 1]
['大半夜', 1]
['一九一九年', 1]
['努著嘴', 1]
['兩三天', 1]
['多不多', 1]
['二十多年', 1]
['亂蓬蓬', 1]
['君子固窮', 1]
['十二歲', 1]
['嘮嘮叨叨', 1]
['趕熱鬧', 1]
['曲尺形', 1]
['說笑聲', 1]
['對柜里', 1]
['看一看', 1]
['讀書人', 1]
['替人家', 1]
['干不了', 1]
['纏夾不清', 1]
當(dāng)然想查找其他的索引要求可以隨意添加,技術(shù)比較好實(shí)現(xiàn)