最近發(fā)現(xiàn)Python的wordcloud包可以構(gòu)建詞云并且可以自定義圖片夯尽,非常有趣呐萌,故記錄一下使用過(guò)程。
官網(wǎng) 肺孤,可以查看api reference
github
下面說(shuō)一下使用過(guò)程
安裝
pip install wordcloud
,嫌慢的話可以用豆瓣加速哦
數(shù)據(jù)
數(shù)據(jù)隨便找就是了赠堵,自己從網(wǎng)上爬了戰(zhàn)狼的前10頁(yè)的短評(píng)法褥,因?yàn)榇a還在完善登錄和驗(yàn)證碼,就不貼出來(lái)了揍愁。
分詞
分詞使用的是結(jié)巴分析github,pip install jieba
即可安裝
生成詞云方法
word_cloud 生成詞云有兩個(gè)方法杀饵。from text 和 from frequencies 切距。即文本生成和頻率生成,每一個(gè)都有對(duì)應(yīng)的函數(shù)可以使用
generate(text) Generate wordcloud from text.
generate_from_text(text) Generate wordcloud from text.
generate_from_frequencies Create a word_cloud from words and frequencies.
fit_words Create a word_cloud from words and frequencies.
代碼示例
import re
import jieba.analyse as analyse
from wordcloud import WordCloud, ImageColorGenerator
import matplotlib.pyplot as plt
from scipy.misc import imread
comments = []
cleaned_comments = ''
def filterComments():
cleaned_comments = ''
with open('zhanlang.txt', 'r', encoding='utf8') as f:
for line in f:
comments.append(line)
for k in range(len(comments)):
cleaned_comments = cleaned_comments + (str(comments[k])).strip()
pattern = re.compile(r'[\u4e00-\u9fa5]+')
filterdata = re.findall(pattern, cleaned_comments)
cleaned_comments = ''.join(filterdata)
return cleaned_comments
cleaned_comments = filterComments()
analyse.set_stop_words("stopwords.txt")
#withWeight=True為顯示字符出現(xiàn)的頻率话肖,格式為[('aa',0.23),('a',0.11)]
#不加這個(gè)參數(shù)葡幸,或者參數(shù)值為false的時(shí)候格式為['a','b']
words_df = analyse.extract_tags(cleaned_comments,topK=100,withWeight=True)
back_coloring = imread('./love.jpg')
wordcloud = WordCloud( font_path='C:\Windows\Fonts\simhei.ttf',#設(shè)置字體
background_color="black", #背景顏色
max_words=2000,# 詞云顯示的最大詞數(shù)
mask=back_coloring,#設(shè)置背景圖片
max_font_size=100, #字體最大值
random_state=42,
)
# 從背景圖片生成顏色值
image_colors = ImageColorGenerator(back_coloring)
#generate和generate_from_text接受的為文本礼患,例'a b c d'
#generat參數(shù)為以分隔符比如空格隔開(kāi)的字符串
#words = " ".join(words_df)
#wordcloud.generate(words)
#wordcloud.generate_from_text(words)
#有頻率的時(shí)候用generate_from_frequencies,fit_words
#generate_from_frequencies接受的參數(shù)為字典,注意必須為字典可以直接使用結(jié)巴分詞帶頻率的結(jié)果
#wordcloud.generate_from_frequencies(dict(words_df))
wordcloud.fit_words(dict(words_df))
#plt.imshow(wordcloud.recolor(color_func=image_colors))
plt.imshow(wordcloud)
plt.axis("off")
plt.show()