關(guān)鍵詞:
爬蟲 urllib3 BeautifulSoup4
思路:
之前用過python寫爬蟲,用的urllib疤祭,看了看現(xiàn)在還有urllib3,API更簡單饵婆,性能可能更好勺馆,然后分析網(wǎng)頁還是之前用過的BeautifuleSoup4
過程:
1.先試試urllib3 獲得斗魚分類內(nèi)容
pip install urllib3
import urllib3
http = urllib3.PoolManager()
r = http.request('GET', "https://www.douyu.com/directory")
plain_text = r.data.decode("utf-8")
file = open("content.html", "w", encoding='utf-8')
file.write(plain_text);
content.html生成了,用chrome打開發(fā)現(xiàn)也是有內(nèi)容的侨核,說明這個是沒有問題的草穆,不過以后真的開始爬蟲運行起來,可能會被斗魚封ip搓译,這個以后發(fā)生了再看怎么解決悲柱。
- 獲得分類信息
Chrome打開 https://www.douyu.com/directory,按F12出來源碼些己。尋找分類的源碼塊
用beautifulSoup拿到這部分內(nèi)容
import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
def getClassify():
r = http.request('GET', "https://www.douyu.com/directory")
plain_text = r.data.decode("utf-8")
file = open("content.html", "w", encoding='utf-8')
file.write(plain_text);
soup = BeautifulSoup(plain_text, "html5lib")
classify_list = soup.findAll(attrs={'class':'layout-Classify-item'})
for classify in classify_list:
link_info = classify.find('a')
link = link_info.get('href')
name_info = classify.find('strong')
classify_name = name_info.text
print(classify_name + ":" + link)
getClassify()
可以看到打印出了分類的信息豌鸡,前幾個空值嘿般,看了一下,推薦分類也是相同的layout-Classify-item涯冠,在沒有登錄情況下推薦都是空的炉奴,問題不大,后續(xù)再處理蛇更。
3.獲得英雄聯(lián)盟分類下的主播
Chrome打開https://www.douyu.com/g_LOL瞻赶,查看主播信息
還是使用bs4來拿到相關(guān)信息
import urllib3
from bs4 import BeautifulSoup
http = urllib3.PoolManager()
def getLOL():
r = http.request('GET', "https://www.douyu.com/g_LOL")
plain_text = r.data.decode("utf-8")
file = open("content_lol.html", "w", encoding='utf-8')
file.write(plain_text);
soup = BeautifulSoup(plain_text, "html5lib")
classify_list = soup.findAll('li', {'class':'layout-Cover-item'})
for classify in classify_list:
link_info = classify.find('a')
link = link_info.get('href')
name_info = classify.find(attrs={'class':'DyListCover-user'})
user_name = name_info.text
print(user_name + ":" + link)
getLOL()
可以看到主播信息,但是只有第一頁的派任。
下次再做跳轉(zhuǎn)砸逊。