前言:
準(zhǔn)備工具:python3.7枚驻、vscode裙士、chrome
安裝urllib心剥、beautifulsoup、jieba瘩缆、wordcloud(pip install 庫(kù))
一关拒、分析豆瓣頁(yè)面
首先我們先觀察豆瓣的搜索頁(yè)面
我們可以看到左側(cè)的導(dǎo)航欄,結(jié)合url我們會(huì)發(fā)現(xiàn)cat后面的值和q后面的書名電影名影響著搜索的變化庸娱,可以找出如下規(guī)律:
讀書 1001
電影1002
音樂1003
我們查看網(wǎng)頁(yè)的源代碼(F12)可以發(fā)現(xiàn)我們所需要的內(nèi)容全部都在a標(biāo)簽之下着绊,我們利用豆瓣優(yōu)秀的排序算法可以直接獲取搜索排序的第一名作為我們的待爬取內(nèi)容,我們也只需要其中的sid號(hào)熟尉,其余的事情就交給待爬取頁(yè)面的爬蟲去做了归露。
下面給出源代碼:
import ssl
import string
import urllib
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
def create_url(keyword: str, kind: str) -> str:
'''
Create url through keywords
Args:
keyword: the keyword you want to search
kind: a string indicating the kind of search result
type: 讀書; num: 1001
type: 電影; num: 1002
type: 音樂; num: 1003
Returns: url
'''
num = ''
if kind == '讀書':
num = 1001
elif kind == '電影':
num = 1002
elif kind == '音樂':
num = 1003
url = 'https://www.douban.com/search?cat=' + \
str(num) + '&q=' + keyword
return url
def get_html(url: str) -> str:
'''send a request'''
headers = {
# 'Cookie': 你的cookie,
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
'Connection': 'keep-alive'
}
ssl._create_default_https_context = ssl._create_unverified_context
s = urllib.parse.quote(url, safe=string.printable) # safe表示可以忽略的部分
req = urllib.request.Request(url=s, headers=headers)
req = urllib.request.urlopen(req)
content = req.read().decode('utf-8')
return content
def get_content(keyword: str, kind: str) -> str:
'''
Create url through keywords
Args:
keyword: the keyword you want to search
kind: a string indicating the kind of search result
type: 讀書; num: 1001
type: 電影; num: 1002
type: 音樂; num: 1003
Returns: url
'''
url = create_url(keyword=keyword, kind=kind)
html = get_html(url)
# print(html)
soup_content = BeautifulSoup(html, 'html.parser')
contents = soup_content.find_all('h3', limit=1)
result = str(contents[0])
return result
def find_sid(raw_str: str) -> str:
'''
find sid in raw_str
Args:
raw_str: a html info string contains sid
Returns:
sid
'''
assert type(raw_str) == str, \
'''the type of raw_str must be str'''
start_index = raw_str.find('sid:')
sid = raw_str[start_index + 5: start_index + 13]
sid.strip(',')
return sid
if __name__ == "__main__":
raw_str = get_content('看見', '讀書')
print(find_sid(raw_str))
這樣我們就有了具有唯一標(biāo)實(shí)的圖書(電影)的sid
其次我們先觀察待爬取頁(yè)面并查看網(wǎng)頁(yè)源代碼(F12)
通過(guò)觀察我們不難發(fā)現(xiàn)我們所需的評(píng)論都在<span class="short"> 標(biāo)簽下,想要爬取的作者斤儿、時(shí)間剧包、推薦星級(jí)也分別藏在其他幾個(gè)子標(biāo)簽下,代碼如下:
comments = soupComment.findAll('span', 'short')
time = soupComment.select( '.comment-item > div > h3 > .comment-info > span:nth-of-type(2)')
name = soupComment.select('.comment-item > div > h3 > .comment-info > a')
第一頁(yè)評(píng)論url:https://book.douban.com/subject/20427187/comments/hot?p=1
第二頁(yè)評(píng)論url:https://book.douban.com/subject/20427187/comments/hot?p=2
...
第n頁(yè)評(píng)論url:https://book.douban.com/subject/20427187/comments/hot?p=n
通過(guò)翻取評(píng)論往果,url的規(guī)律這樣就找到了疆液,只需要改變p后面的一個(gè)變量就可以
二、豆瓣評(píng)論數(shù)據(jù)抓取
我們需要為爬蟲偽裝一個(gè)頭部信息防止網(wǎng)站的反爬蟲
headers = {
# 'Cookie': 你的cookie,
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
'Referer': 'https: // movie.douban.com / subject / 20427187 / comments?status = P',
'Connection': 'keep-alive'
}
關(guān)于cookie你可以先在網(wǎng)頁(yè)登陸你的豆瓣賬號(hào)然后F12->network->all->heders中尋找
爬蟲代碼如下:
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
import time
import jieba
import wordcloud
import crawler_tools
def creat_url(num):
urls = []
for page in range(1, 20):
url = 'https://book.douban.com/subject/' + \
str(num)+'/comments/hot?p='+str(page)+''
urls.append(url)
print(urls)
return urls
def get_html(urls):
headers = {
# 'Cookie': 你的cookie,
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36 OPR/26.0.1656.60',
'Connection': 'keep-alive'
}
for url in urls:
print('正在爬壬轮:'+url)
req = urllib.request.Request(url=url, headers=headers)
req = urllib.request.urlopen(req)
content = req.read().decode('utf-8')
time.sleep(10)
return content
def get_comment(num):
a = creat_url(num)
html = get_html(a)
soupComment = BeautifulSoup(html, 'html.parser')
comments = soupComment.findAll('span', 'short')
onePageComments = []
for comment in comments:
# print(comment.getText()+'\n')
onePageComments.append(comment.getText()+'\n')
print(onePageComments)
f = open('數(shù)據(jù).txt', 'a', encoding='utf-8')
for sentence in onePageComments:
f.write(sentence)
f.close()
raw_str = crawler_tools.get_content('看見', '讀書')
sid = crawler_tools.find_sid(raw_str)
print('sid:'+sid)
get_comment(sid)
三堕油、數(shù)據(jù)清洗、特征提取及詞云顯示
首先利用jieba庫(kù)分詞肮之,并使用其其庫(kù)里內(nèi)置的TFIDF算法對(duì)分詞進(jìn)行權(quán)重運(yùn)算
然后利用wordcloud庫(kù)生成詞云掉缺,具體設(shè)置參數(shù)如下:
font_path='FZQiTi-S14S.TTF', # 設(shè)置字體
max_words=66, # 設(shè)置最大顯示字?jǐn)?shù)
max_font_size=600, # 設(shè)置字體最大值
random_state=666, # 設(shè)置隨機(jī)生成狀態(tài)
width=1400, height=900, # 設(shè)置圖像大小
background_color='black', # 設(shè)置背景大小
stopwords=(type(list)) # 設(shè)置停用辭典
我們把做了數(shù)據(jù)處理的詞云和普通詞云做個(gè)對(duì)比:
數(shù)據(jù)處理代碼如下:
import jieba
import jieba.analyse
import wordcloud
f = open('/Users/money666/Desktop/The_new_crawler/看見.txt',
'r', encoding='utf-8')
contents = f.read()
f.close()
stopWords_dic = open(
'/Users/money666/Desktop/stopwords.txt', 'r', encoding='gb18030') # 從文件中讀入停用詞
stopWords_content = stopWords_dic.read()
stopWords_list = stopWords_content.splitlines() # 轉(zhuǎn)為list備用
stopWords_dic.close()
keywords = jieba.analyse.extract_tags(
contents, topK=75, withWeight=False,)
print(keywords)
w = wordcloud.WordCloud(background_color="black",
font_path='/Users/money666/Desktop/字體/粗黑.TTF',
width=1400, height=900, stopwords=stopWords_list)
txt = ' '.join(keywords)
w.generate(txt)
w.to_file("/Users/money666/Desktop/The_new_crawler/看見.png")
四、問(wèn)題及解決辦法
1戈擒、pip timeout
一眶明、創(chuàng)建或修改pip.conf配置文件:
$ sudo vi ~/.pip/pip.config
timeout =500 #設(shè)置pip超時(shí)時(shí)間
二、使用國(guó)內(nèi)鏡像
使用鏡像來(lái)替代原來(lái)的官網(wǎng)筐高,方法如下:
1. pip install redis -i https://pypi.douban.com/simple
-i:指定鏡像地址
2. 創(chuàng)建或修改pip.conf配置文件指定鏡像地址:
[global]
timeout =6000
index-url = http://pypi.douban.com/simple/
[install]
use-mirrors =true
mirrors = http://pypi.douban.com/simple/
trusted-host = pypi.douban.com
補(bǔ)充:可以在多個(gè)路徑下找到pip.conf搜囱,沒有則創(chuàng)建,另外,還可以通過(guò)環(huán)境變量Linux*:/etc/pip.conf *
*~/.pip/pip.conf *
*~/.config/pip/pip.conf *
Windows: %APPDATA%\pip\pip.ini
- %HOME%\pip\pip.ini *
C:\Documents and Settings\All Users\Application Data\PyPA\pip\pip.conf (Windows XP)
- C:\ProgramData\PyPA\pip\pip.conf (Windows 7及以后)*
Mac OSX*: ~/Library/Application Support/pip/pip.conf *
*~/.pip/pip.conf *
*/Library/Application Support/pip/pip.conf *