大佬用python寫了個豆瓣短評爬蟲，來試試你喜歡的電影吧

前言

本篇主要實現(xiàn)的是對任意一部電影短評(熱門)的抓取以及可視化分析次询。也就是你只要提供鏈接和一些基本信息屯吊，他就可以

分析

對于豆瓣爬蟲盒卸，what shold we 考慮蔽介？怎么分析呢虹蓄？豆瓣電影首頁

這個首先的話嘗試就可以啦圆凰，打開任意一部電影，這里以姜子牙為例跃须。打開姜子牙你就會發(fā)現(xiàn)它是非動態(tài)渲染的頁面菇民，也就是傳統(tǒng)的渲染方式投储，直接請求這個url即可獲取數(shù)據(jù)。但是翻著翻著頁面你就會發(fā)現(xiàn)：未登錄用戶只能訪問優(yōu)先的界面娇掏，登錄的用戶才能有權(quán)限去訪問后面的頁面婴梧。

所以這個流程應(yīng)該是登錄——> 爬蟲——>存儲——>可視化分析。

這里提一下環(huán)境和所需要的安裝讶坯，環(huán)境為python3闽巩，代碼在win和linux可成功跑，如果mac和linux不能跑友字體亂碼問題還請私我隅很。其中pip用到包如下,直接用清華鏡像下載不然很慢很慢(夠貼心不)。如果大家在學(xué)習(xí)中遇到困難畜挥，想找一個python學(xué)習(xí)交流環(huán)境蟹但，可以加入我們的python圈华糖，裙號930900780客叉，可領(lǐng)取python學(xué)習(xí)資料兼搏，會節(jié)約很多時間沙郭，減少很多遇到的難題棠绘。

pip install requests -i https://pypi.tuna.tsinghua.edu.cn/simple

pip install matplotlib -i https://pypi.tuna.tsinghua.edu.cn/simple

pip install numpy -i https://pypi.tuna.tsinghua.edu.cn/simple

pip install xlrd -i https://pypi.tuna.tsinghua.edu.cn/simple

pip install xlwt -i https://pypi.tuna.tsinghua.edu.cn/simple

pip install bs4 -i https://pypi.tuna.tsinghua.edu.cn/simple

pip install lxml -i https://pypi.tuna.tsinghua.edu.cn/simple

pip install wordcloud -i https://pypi.tuna.tsinghua.edu.cn/simple

pip install jieba -i https://pypi.tuna.tsinghua.edu.cn/simple

復(fù)制代碼

登錄

豆瓣的登錄地址

進(jìn)去后有個密碼登錄欄夜矗，我們要分析在登錄的途中發(fā)生了啥紊撕，打開F12控制臺是不夠的赡突，我們還要使用Fidder抓包惭缰。

打開F12控制臺然后點擊登錄络凿，多次試探之后發(fā)現(xiàn)登錄接口也很簡單：

查看請求的參數(shù)發(fā)現(xiàn)就是普通請求絮记，無加密怨愤，當(dāng)然這里可以用fidder進(jìn)行抓包撰洗，這里我簡單測試了一下用錯誤密碼進(jìn)行測試了赵。如果失敗的小伙伴可以嘗試手動登陸再退出這樣再跑程序柿汛。

這樣編寫登錄模塊的代碼：

url='https://accounts.douban.com/j/mobile/login/basic'

header={'user-agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',

'Referer': 'https://accounts.douban.com/passport/login_popup?login_source=anony',

? ? ? ? 'Origin': 'https://accounts.douban.com',

'content-Type':'application/x-www-form-urlencoded',

'x-requested-with':'XMLHttpRequest',

'accept':'application/json',

'accept-encoding':'gzip, deflate, br',

'accept-language':'zh-CN,zh;q=0.9',

'connection': 'keep-alive'

,'Host': 'accounts.douban.com'

}

data={

? ? 'ck':'',

? ? 'name':'',

? ? 'password':'',

? ? 'remember':'false',

? ? 'ticket':''

}

def login(username,password):

? ? global? data

? ? data['name']=username

? ? data['password']=password

? ? data=urllib.parse.urlencode(data)

? ? print(data)

? ? req=requests.post(url,headers=header,data=data,verify=False)

? ? cookies = requests.utils.dict_from_cookiejar(req.cookies)

? ? print(cookies)

? ? return cookies

復(fù)制代碼

這塊高清之后貌笨，整個執(zhí)行流程大概為：

爬取

成功登錄之后昌腰，我們就可以攜帶登錄的信息訪問網(wǎng)站為所欲為的爬取信息了膀跌。雖然它是傳統(tǒng)交互方式劫流，但是每當(dāng)你切換頁面時候會發(fā)現(xiàn)有個ajax請求祠汇。

這部分接口我們可以直接拿到評論部分的數(shù)據(jù)可很，就不需要請求整個頁面然后提取這部分的內(nèi)容了根穷。而這部分的url規(guī)律和之前分析的也是一樣，只有一個start表示當(dāng)前的條數(shù)在變化尘惧，所以直接拼湊url就行喷橙。

也就是用邏輯拼湊url一直到不能正確操作為止菠秒。

https://movie.douban.com/subject/25907124/comments?percent_type=&start=0&其他參數(shù)省略

https://movie.douban.com/subject/25907124/comments?percent_type=&start=20&其他參數(shù)省略

https://movie.douban.com/subject/25907124/comments?percent_type=&start=40&其他參數(shù)省略

復(fù)制代碼

對于每個url訪問之后如何提取信息呢践叠？我們根據(jù)css選擇器進(jìn)行篩選數(shù)據(jù)管挟，因為每個評論他們的樣式相同僻孝，在html中就很像一個列表中的元素一樣穿铆。

再觀察我們剛剛那個ajax接口返回的數(shù)據(jù)剛好是下面紅色區(qū)域塊悴务，所以我們直接根據(jù)class搜素分成若干小組進(jìn)行操作就可以讯檐。

在具體的實現(xiàn)上别洪，我們使用requests發(fā)送請求獲取結(jié)果挖垛，使用BeautifulSoup去解析html格式文件痢毒。而我們所需要的數(shù)據(jù)也很容易分析對應(yīng)部分哪替。

實現(xiàn)的代碼為：

import requests

from? bs4 import BeautifulSoup

url='https://movie.douban.com/subject/25907124/comments?percent_type=&start=0&limit=20&status=P&sort=new_score&comments_only=1&ck=C7di'

header = {

? ? 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',

}

req = requests.get(url,headers=header,verify=False)

res = req.json() # 返回的結(jié)果是一個json

res = res['html']

soup = BeautifulSoup(res, 'lxml')

node = soup.select('.comment-item')

for va in node:

? ? name = va.a.get('title')

? ? star = va.select_one('.comment-info').select('span')[1].get('class')[0][-2]

? ? comment = va.select_one('.short').text

? ? votes=va.select_one('.votes').text

? ? print(name, star,votes, comment)

復(fù)制代碼

這個測試的執(zhí)行結(jié)果為：

儲存

數(shù)據(jù)爬取完就要考慮存儲，我們將數(shù)據(jù)儲存到cvs中匆背。

使用xlwt將數(shù)據(jù)寫入excel文件中钝尸，xlwt基本應(yīng)用實例：

import xlwt

#創(chuàng)建可寫的workbook對象

workbook = xlwt.Workbook(encoding='utf-8')

#創(chuàng)建工作表sheet

worksheet = workbook.add_sheet('sheet1')

#往表中寫內(nèi)容,第一個參數(shù) 行,第二個參數(shù)列,第三個參數(shù)內(nèi)容

worksheet.write(0, 0, 'bigsai')

#保存表為test.xlsx

workbook.save('test.xlsx')

復(fù)制代碼

使用xlrd讀取excel文件中蝶怔，本案例xlrd基本應(yīng)用實例：

import xlrd

#讀取名稱為test.xls文件

workbook = xlrd.open_workbook('test.xls')

# 獲取第一張表

table =? workbook.sheets()[0]? # 打開第1張表

# 每一行是個元組

nrows = table.nrows

for i in range(nrows):

? ? print(table.row_values(i))#輸出每一行

復(fù)制代碼

到這里踢星，我們對登錄模塊+爬取模塊+存儲模塊就可把數(shù)據(jù)存到本地了沐悦，具體整合的代碼為：

import requests

from bs4 import BeautifulSoup

import urllib.parse

import xlwt

import xlrd

# 賬號密碼

def login(username, password):

? ? url = 'https://accounts.douban.com/j/mobile/login/basic'

? ? header = {

? ? ? ? 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',

? ? ? ? 'Referer': 'https://accounts.douban.com/passport/login_popup?login_source=anony',

? ? ? ? 'Origin': 'https://accounts.douban.com',

? ? ? ? 'content-Type': 'application/x-www-form-urlencoded',

? ? ? ? 'x-requested-with': 'XMLHttpRequest',

? ? ? ? 'accept': 'application/json',

? ? ? ? 'accept-encoding': 'gzip, deflate, br',

? ? ? ? 'accept-language': 'zh-CN,zh;q=0.9',

? ? ? ? 'connection': 'keep-alive'

? ? ? ? , 'Host': 'accounts.douban.com'

? ? }

? ? # 登陸需要攜帶的參數(shù)

? ? data = {

? ? ? ? 'ck' : '',

? ? ? ? 'name': '',

? ? ? ? 'password': '',

? ? ? ? 'remember': 'false',

? ? ? ? 'ticket': ''

? ? }

? ? data['name'] = username

? ? data['password'] = password

? ? data = urllib.parse.urlencode(data)

? ? print(data)

? ? req = requests.post(url, headers=header, data=data, verify=False)

? ? cookies = requests.utils.dict_from_cookiejar(req.cookies)

? ? print(cookies)

? ? return cookies

def getcomment(cookies, mvid):? # 參數(shù)為登錄成功的cookies(后臺可通過cookies識別用戶，電影的id)

? ? start = 0

? ? w = xlwt.Workbook(encoding='ascii')? # #創(chuàng)建可寫的workbook對象

? ? ws = w.add_sheet('sheet1')? # 創(chuàng)建工作表sheet

? ? index = 1? # 表示行的意思副签，在xls文件中寫入對應(yīng)的行數(shù)

? ? while True:

? ? ? ? # 模擬瀏覽器頭發(fā)送請求

? ? ? ? header = {

? ? ? ? ? ? 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36',

? ? ? ? }

? ? ? ? # try catch 嘗試，一旦有錯誤說明執(zhí)行完成家浇，沒錯誤繼續(xù)進(jìn)行

? ? ? ? try:

? ? ? ? ? ? # 拼湊url 每次star加20

? ? ? ? ? ? url = 'https://movie.douban.com/subject/' + str(mvid) + '/comments?start=' + str(

? ? ? ? ? ? ? ? start) + '&limit=20&sort=new_score&status=P&comments_only=1'

? ? ? ? ? ? start += 20

? ? ? ? ? ? # 發(fā)送請求

? ? ? ? ? ? req = requests.get(url, cookies=cookies, headers=header)

? ? ? ? ? ? # 返回的結(jié)果是個json字符串通過req.json()方法獲取數(shù)據(jù)

? ? ? ? ? ? res = req.json()

? ? ? ? ? ? res = res['html']? # 需要的數(shù)據(jù)在`html`鍵下

? ? ? ? ? ? soup = BeautifulSoup(res, 'lxml')? # 把這個結(jié)構(gòu)化html創(chuàng)建一個BeautifulSoup對象用來提取信息

? ? ? ? ? ? node = soup.select('.comment-item')? # 每組class 均為comment-item? 這樣分成20條記錄(每個url有20個評論)

? ? ? ? ? ? for va in node:? # 遍歷評論

? ? ? ? ? ? ? ? name = va.a.get('title')? # 獲取評論者名稱

? ? ? ? ? ? ? ? star = va.select_one('.comment-info').select('span')[1].get('class')[0][-2]? # 星數(shù)好評

? ? ? ? ? ? ? ? votes = va.select_one('.votes').text? # 投票數(shù)

? ? ? ? ? ? ? ? comment = va.select_one('.short').text? # 評論文本

? ? ? ? ? ? ? ? print(name, star, votes, comment)

? ? ? ? ? ? ? ? ws.write(index, 0, index)? # 第index行点额，第0列寫入 index

? ? ? ? ? ? ? ? ws.write(index, 1, name)? # 第index行还棱，第1列寫入評論者

? ? ? ? ? ? ? ? ws.write(index, 2, star)? # 第index行诱贿，第2列寫入評星

? ? ? ? ? ? ? ? ws.write(index, 3, votes)? # 第index行珠十，第3列寫入投票數(shù)

? ? ? ? ? ? ? ? ws.write(index, 4, comment)? # 第index行焙蹭，第4列寫入評論內(nèi)容

? ? ? ? ? ? ? ? index += 1

? ? ? ? except Exception as e:? # 有異常退出

? ? ? ? ? ? print(e)

? ? ? ? ? ? break

? ? w.save('test.xls')? # 保存為test.xls文件

if __name__ == '__main__':

? ? username = input('輸入賬號：')

? ? password = input('輸入密碼：')

? ? cookies = login(username, password)

? ? mvid = input('電影的id為：')

? ? getcomment(cookies, mvid)

復(fù)制代碼

執(zhí)行之后成功存儲數(shù)據(jù)：

可視化分析

我們要對評分進(jìn)行統(tǒng)計、詞頻統(tǒng)計帖努。還有就是生成詞云展示撰豺。而對應(yīng)的就是matplotlib、WordCloud庫拼余。

實現(xiàn)的邏輯思路：讀取xls的文件污桦，將評論使用分詞處理統(tǒng)計詞頻，統(tǒng)計出現(xiàn)最多的詞語制作成直方圖和詞語匙监。將評星數(shù)量做成餅圖展示一下凡橱，主要代碼均有注釋，具體的代碼為：

其中代碼為：

import matplotlib.pyplot as plt

import matplotlib

import jieba

import jieba.analyse

import xlwt

import xlrd

from wordcloud import WordCloud

import numpy as np

from collections import Counter

# 設(shè)置字體有的linux字體有問題

matplotlib.rcParams['font.sans-serif'] = ['SimHei']

matplotlib.rcParams['axes.unicode_minus'] = False

# 類似comment 為評論的一些數(shù)據(jù) [? ['1','名稱'稼钩，'star星','贊同數(shù)','評論內(nèi)容']? ,['2','名稱'，'star星','贊同數(shù)','評論內(nèi)容'] ]元組

def anylasescore(comment):

? ? score = [0, 0, 0, 0, 0, 0]? # 分別對應(yīng)0 1 2 3 4 5分出現(xiàn)的次數(shù)

? ? count = 0? # 評分總次數(shù)

? ? for va in comment:? # 遍歷每條評論的數(shù)據(jù)? ['1','名稱'达罗，'star星','贊同數(shù)','評論內(nèi)容']

? ? ? ? try:

? ? ? ? ? ? score[int(va[2])] += 1? # 第3列為star星要強(qiáng)制轉(zhuǎn)換成int格式

? ? ? ? ? ? count += 1

? ? ? ? except Exception as e:

? ? ? ? ? ? continue

? ? print(score)

? ? label = '1分', '2分', '3分', '4分', '5分'

? ? color = 'blue', 'orange', 'yellow', 'green', 'red'? # 各類別顏色

? ? size = [0, 0, 0, 0, 0]? # 一個百分比數(shù)字合起來為100

? ? explode = [0, 0, 0, 0, 0]? # explode :(每一塊)離開中心距離坝撑；

? ? for i in range(1, 5):? # 計算

? ? ? ? size[i] = score[i] * 100 / count

? ? ? ? explode[i] = score[i] / count / 10

? ? pie = plt.pie(size, colors=color, explode=explode, labels=label, shadow=True, autopct='%1.1f%%')

? ? for font in pie[1]:

? ? ? ? font.set_size(8)

? ? for digit in pie[2]:

? ? ? ? digit.set_size(8)

? ? plt.axis('equal')? # 該行代碼使餅圖長寬相等

? ? plt.title(u'各個評分占比', fontsize=12)? # 標(biāo)題

? ? plt.legend(loc=0, bbox_to_anchor=(0.82, 1))? # 圖例

? ? # 設(shè)置legend的字體大小

? ? leg = plt.gca().get_legend()

? ? ltext = leg.get_texts()

? ? plt.setp(ltext, fontsize=6)

? ? plt.savefig("score.png")

? ? # 顯示圖

? ? plt.show()

def getzhifang(map):? # 直方圖二維，需要x和y兩個坐標(biāo)

? ? x = []

? ? y = []

? ? for k, v in map.most_common(15):? # 獲取前15個最大數(shù)值

? ? ? ? x.append(k)

? ? ? ? y.append(v)

? ? Xi = np.array(x)? # 轉(zhuǎn)成numpy的坐標(biāo)

? ? Yi = np.array(y)

? ? width = 0.6

? ? plt.rcParams['font.sans-serif'] = ['SimHei']? # 用來正常顯示中文標(biāo)簽

? ? plt.figure(figsize=(8, 6))? # 指定圖像比例： 8：6

? ? plt.bar(Xi, Yi, width, color='blue', label='熱門詞頻統(tǒng)計', alpha=0.8, )

? ? plt.xlabel("詞頻")

? ? plt.ylabel("次數(shù)")

? ? plt.savefig('zhifang.png')

? ? plt.show()

? ? return

def getciyun_most(map):? # 獲取詞云

? ? # 一個存對應(yīng)中文單詞粮揉，一個存對應(yīng)次數(shù)

? ? x = []

? ? y = []

? ? for k, v in map.most_common(300):? # 在前300個常用詞語中

? ? ? ? x.append(k)

? ? ? ? y.append(v)

? ? xi = x[0:150]? # 截取前150個

? ? xi = ' '.join(xi)? # 以空格 ` `將其分割為固定格式(詞云需要)

? ? print(xi)

? ? # backgroud_Image = plt.imread('')? # 如果需要個性化詞云

? ? # 詞云大小巡李，字體等基本設(shè)置

? ? wc = WordCloud(background_color="white",

? ? ? ? ? ? ? ? ? width=1500, height=1200,

? ? ? ? ? ? ? ? ? # min_font_size=40,

? ? ? ? ? ? ? ? ? # mask=backgroud_Image,

? ? ? ? ? ? ? ? ? font_path="simhei.ttf",

? ? ? ? ? ? ? ? ? max_font_size=150,? # 設(shè)置字體最大值

? ? ? ? ? ? ? ? ? random_state=50,? # 設(shè)置有多少種隨機(jī)生成狀態(tài)，即有多少種配色方案

? ? ? ? ? ? ? ? ? )? # 字體這里有個坑滔蝉，一定要設(shè)這個參數(shù)击儡。否則會顯示一堆小方框wc.font_path="simhei.ttf"? # 黑體

? ? # wc.font_path="simhei.ttf"

? ? my_wordcloud = wc.generate(xi)? #需要放入詞云的單詞，這里前150個單詞

? ? plt.imshow(my_wordcloud)? # 展示

? ? my_wordcloud.to_file("img.jpg")? # 保存

? ? xi = ' '.join(x[150:300])? # 再次獲取后150個單詞再保存一張詞云

? ? my_wordcloud = wc.generate(xi)

? ? my_wordcloud.to_file("img2.jpg")

? ? plt.axis("off")

def anylaseword(comment):

? ? # 這個過濾詞蝠引，有些詞語沒意義需要過濾掉

? ? list = ['這個', '一個', '不少', '起來', '沒有', '就是', '不是', '那個', '還是', '劇情', '這樣', '那樣', '這種', '那種', '故事', '人物', '什么']

? ? print(list)

? ? commnetstr = ''? # 評論的字符串

? ? c = Counter()? # python一種數(shù)據(jù)集合阳谍，用來存儲字典

? ? index = 0

? ? for va in comment:

? ? ? ? seg_list = jieba.cut(va[4], cut_all=False)? ## jieba分詞

? ? ? ? index += 1

? ? ? ? for x in seg_list:

? ? ? ? ? ? if len(x) > 1 and x != '\r\n':? # 不是單個字并且不是特殊符號

? ? ? ? ? ? ? ? try:

? ? ? ? ? ? ? ? ? ? c[x] += 1? # 這個單詞的次數(shù)加一

? ? ? ? ? ? ? ? except:

? ? ? ? ? ? ? ? ? ? continue

? ? ? ? commnetstr += va[4]

? ? for (k, v) in c.most_common():? # 過濾掉次數(shù)小于5的單詞

? ? ? ? if v < 5 or k in list:

? ? ? ? ? ? c.pop(k)

? ? ? ? ? ? continue

? ? ? ? # print(k,v)

? ? print(len(c), c)

? ? getzhifang(c)? # 用這個數(shù)據(jù)進(jìn)行畫直方圖

? ? getciyun_most(c)? # 詞云

? ? # print(commnetstr)

def anylase():

? ? data = xlrd.open_workbook('test.xls')? # 打開xls文件

? ? table = data.sheets()[0]? # 打開第i張表

? ? nrows = table.nrows? # 若干列的一個集合

? ? comment = []

? ? for i in range(nrows):

? ? ? ? comment.append(table.row_values(i))? # 將該列數(shù)據(jù)添加到元組中

? ? # print(comment)

? ? anylasescore(comment)

? ? anylaseword(comment)

if __name__ == '__main__':

? ? anylase()

復(fù)制代碼

我們再來查看一下執(zhí)行的效果：

這里我選了姜子牙和千與千尋電影的一些數(shù)據(jù)蛀柴，兩個電影評分比例對比為：

從評分可以看出明顯千與千尋好評度更高，大部分人愿意給他五分矫夯「爰玻基本算是最好看的動漫之一了，再來看看直方圖的詞譜：

很明顯千與千尋的作者更出名训貌，并且有很大的影響力制肮，以至于大家紛紛提起他。再看看兩者詞云圖：

宮崎駿递沪、白龍豺鼻、婆婆，真的是滿滿的回憶款慨，好了不說了儒飒，有啥想說的歡迎討論！

最后多說一句檩奠，想學(xué)習(xí)Python可聯(lián)系小編桩了，這里有我自己整理的整套python學(xué)習(xí)資料和路線，想要這些資料的都可以進(jìn)q裙930900780領(lǐng)取埠戳。

本文章素材來源于網(wǎng)絡(luò)井誉，如有侵權(quán)請聯(lián)系刪除。

大佬用python寫了個豆瓣短評爬蟲孟岛，來試試你喜歡的電影吧

大佬用python寫了個豆瓣短評爬蟲渠羞，來試試你喜歡的電影吧