很多時(shí)候我們要了解一部電視劇或電影的好壞時(shí)都會(huì)去豆瓣上查看評(píng)分和評(píng)論斟览,本文基于豆瓣上對(duì)某一部電視劇評(píng)論的爬取饲窿,然后進(jìn)行SnowNLP情感分析蔽午,最后生成詞云樟遣,給人一個(gè)直觀的印象
1. 爬取評(píng)論
以前段時(shí)間比較火熱的《獵場(chǎng)》為例,因豆瓣網(wǎng)有反爬蟲(chóng)機(jī)制姿骏,所以在爬取時(shí)要帶登錄后的cookie文件糖声,保存在cookie.txt文件里,具體代碼及結(jié)果如下:
import requests, codecs
from lxml import html
import time
import random
header = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:54.0) Gecko/20100101 Firefox/54.0'}
f_cookies = open('cookie.txt', 'r')
cookies = {}
for line in f_cookies.read().split(';'):
name, value = line.strip().split('=', 1)
cookies[name] = value
#print(cookies)
for num in range(0, 500, 20):
url = 'https://movie.douban.com/subject/26322642/comments?start=' + str(
num) + '&limit=20&sort=new_score&status=P&percent_type='
with codecs.open('comment.txt', 'a', encoding='utf-8') as f:
try:
r = requests.get(url, headers=header, cookies=cookies)
result = html.fromstring(r.text)
comment = result.xpath( " // div[@class ='comment'] / p / text() ")
for i in comment:
f.write(i.strip() + '\r\n')
except Exception as e:
print(e)
time.sleep(1 + float(random.randint(1, 100)) / 20)
評(píng)論爬取結(jié)果
2. 情感分析
SnowNLP是python中用來(lái)處理文本內(nèi)容的工腋,可以用來(lái)分詞姨丈、標(biāo)注、文本情感分析等擅腰,情感分析是簡(jiǎn)單的將文本分為兩類(lèi)蟋恬,積極和消極,返回值為情緒的概率趁冈,越接近1為積極歼争,接近0為消極。代碼如下:
import numpy as np
from snownlp import SnowNLP
import matplotlib.pyplot as plt
f = open('comment.txt', 'r', encoding='UTF-8')
list = f.readlines()
sentimentslist = []
for i in list:
s = SnowNLP(i)
# print s.sentiments
sentimentslist.append(s.sentiments)
plt.hist(sentimentslist, bins=np.arange(0, 1, 0.01), facecolor='g')
plt.xlabel('Sentiments Probability')
plt.ylabel('Quantity')
plt.title('Analysis of Sentiments')
plt.show()
情感分析結(jié)果
3. 生成詞云
詞云的生成主要用到了結(jié)巴分詞和wordcloud渗勘,前者是針對(duì)中文進(jìn)行分詞的處理庫(kù)沐绒,后者可以根據(jù)分詞處理結(jié)果定制化生成詞云,詳細(xì)代碼如下:
#coding=utf-8
import matplotlib.pyplot as plt
from scipy.misc import imread
from wordcloud import WordCloud
import jieba, codecs
from collections import Counter
text = codecs.open('comment.txt', 'r', encoding='utf-8').read()
text_jieba = list(jieba.cut(text))
c = Counter(text_jieba) # 計(jì)數(shù)
word = c.most_common(800) # 取前500
bg_pic = imread('src.jpg')
wc = WordCloud(
font_path='C:\Windows\Fonts\SIMYOU.TTF', # 指定中文字體
background_color='white', # 設(shè)置背景顏色
max_words=2000, # 設(shè)置最大顯示的字?jǐn)?shù)
mask=bg_pic, # 設(shè)置背景圖片
max_font_size=200, # 設(shè)置字體最大值
random_state=20 # 設(shè)置多少種隨機(jī)狀態(tài)旺坠,即多少種配色
)
wc.generate_from_frequencies(dict(word)) # 生成詞云
wc.to_file('result.jpg')
# show
plt.imshow(wc)
plt.axis("off")
plt.figure()
plt.imshow(bg_pic, cmap=plt.cm.gray)
plt.axis("off")
plt.show()
原始圖和詞云圖