也是前幾天看到一個(gè)公眾號(hào)推了一篇文章,是爬取戰(zhàn)狼的影評(píng)。今天自己也來試一下
我選擇爬的是《殺破狼》
image.png
然后就是打開短評(píng)頁面荚板,可以看到comment-item,這就是影評(píng)了
image.png
現(xiàn)在已經(jīng)找到想要的了吩屹,但是這僅僅是第一頁的跪另,可以看到一共有六千多條記錄,那么怎么拿到其他的呢煤搜,頁面拉到下方的后頁免绿,可以看到地址欄變成了下面的這個(gè)地址
image.png
所以可以知道limit應(yīng)該是每頁記錄數(shù),start是從第幾條開始擦盾,知道這個(gè)我們就知道了所有的地址啦
url_list = ['https://movie.douban.com/subject/26826398/comments?' \
'start={}&limit=20&sort=new_score&status=P' .format(x)for x in range(0, 6317, 20)]
爬取過程就是利用bs4拿到想要的就ok
response = requests.get(url=url, headers=header)
response.encoding = 'utf-8'
html = BeautifulSoup(response.text, 'html.parser')
comment_items = html.select('div.comment-item')
for item in comment_items:
comment = item.find('p')
然后把爬取的文本寫入txt中最后用來作數(shù)據(jù)分析
image.png
要作數(shù)據(jù)分析首先到網(wǎng)上找個(gè)停用詞表嘲驾,然后利用jieba來分析,代碼如下(這里也是看了羅羅攀的文章:http://www.reibang.com/p/b277199346ae)
def fenci():
path = '/Users/mocokoo/Documents/shapolang.txt'
with open(path, mode='r', encoding='utf-8') as f:
content = f.read()
analyse.set_stop_words('/Users/mocokoo /Documents/tycibiao.txt')
tags = analyse.extract_tags(content, topK=100, withWeight=True)
for item in tags:
print(item[0] + '\t' + str(int(item[1] * 1000)))
image.png
最后利用這個(gè)網(wǎng)站來制作一下輸出結(jié)果
https://wordart.com/create
image.png
最后附上完整代碼:
#!usr/bin/env python3
# -*- coding:utf-8-*-
import requests
from bs4 import BeautifulSoup
import jieba.analyse as analyse
header = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Connection': 'keep-alive',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
url_list = ['https://movie.douban.com/subject/26826398/comments?' \
'start={}&limit=20&sort=new_score&status=P' .format(x)for x in range(0, 6317, 20)]
# 爬取所有短評(píng)寫入文件中
def get_comments():
with open(file='/Users/mocokoo/Documents/shapolang.txt', mode='w', encoding='utf-8') as f:
i = 1
for url in url_list:
print('正在爬取殺破狼影評(píng)第_%d_頁' % i)
response = requests.get(url=url, headers=header)
response.encoding = 'utf-8'
html = BeautifulSoup(response.text, 'html.parser')
comment_items = html.select('div.comment-item')
for item in comment_items:
comment = item.find('p')
f.write(comment.get_text().strip() + '\n')
print('第_%d_頁完成' % i)
i += 1
# 分詞
def fenci():
path = '/Users/mocokoo/Documents/shapolang.txt'
with open(path, mode='r', encoding='utf-8') as f:
content = f.read()
analyse.set_stop_words('/Users/mocokoo/Documents/tycibiao.txt')
tags = analyse.extract_tags(content, topK=100, withWeight=True)
for item in tags:
print(item[0] + '\t' + str(int(item[1] * 1000)))
if __name__ == '__main__':
get_comments() # 將影評(píng)寫入文檔中
# fenci()