目標(biāo):爬取熊貓TV某個(gè)分類下面主播的人氣排行
分析網(wǎng)站結(jié)構(gòu)
- 使用Chrome瀏覽器
- F12查看HTML信息段直,Ctrl+Shift+C鼠標(biāo)選取后找到對應(yīng)的HTML。
- 文本分析并提取信息——正則表達(dá)式括丁。
搜索引擎,今日頭條等app本質(zhì)上的技術(shù)就是爬蟲
設(shè)計(jì)步驟:
- 明確目的
- 找到數(shù)據(jù)對應(yīng)網(wǎng)頁
- 分析網(wǎng)頁的結(jié)構(gòu)伶选,找到數(shù)據(jù)所在標(biāo)簽的位置
- 模擬HTTP請求史飞,向服務(wù)器發(fā)送請求,獲取到服務(wù)器返回給我們的HTML
- 用正則表達(dá)式提取我們要的數(shù)據(jù)
- 處理數(shù)據(jù)
HTML結(jié)構(gòu)分析基本原則
1. 尋找到標(biāo)簽仰税、標(biāo)識符构资,使之能夠定位要抓取的信息
- 盡量選取具有唯一性的標(biāo)簽
- 盡量選取最接近于數(shù)據(jù)的標(biāo)簽
2. 把多個(gè)個(gè)數(shù)據(jù)看成是一組數(shù)據(jù)并再次尋找標(biāo)簽
- 盡量選取可以閉合的標(biāo)簽(父級標(biāo)簽),并包裹其需要的數(shù)據(jù)陨簇。
提取順序:htmls->video-info->video-nickname/video-number->提取數(shù)據(jù)
正則匹配
匹配父級標(biāo)簽
root_pattern = '<div class="video-info">([\s\S]*?)</div>' #此處要使用非貪婪吐绵,否則會無限匹配下去。
匹配名字和人數(shù)
name_pattern = '</i>([\s\S]*?)</span>'
number_pattern = '<i class="ricon ricon-eye"></i>([\s\S]*?)</span>'
數(shù)據(jù)精煉
def __refine(self,anchors):
l = lambda anchor:{
'name':anchor['name'][0].strip(), #去除全部空白字符
'number':anchor['number'][0] #列表轉(zhuǎn)化為單一的字符串
}
return map(l,anchors)
數(shù)據(jù)排序
def __sort_seed(self,anchor):
r = re.findall('\d*',anchor['number']) #提取數(shù)字
number = float(r[0])
if '萬' in anchor['number']: #處理'萬'
number *= 10000
return number
def __sort(self,anchors):
anchors = sorted(anchors,key = self.__sort_seed,reverse = True)
#key確定字典中哪個(gè)元素作為比較對象
#sorted()默認(rèn)升序排列河绽,reverse = True降序
return anchors
VSCode中調(diào)試代碼
斷點(diǎn)調(diào)試:F5啟動(dòng)己单,F(xiàn)10單步,F(xiàn)5跳斷點(diǎn)耙饰,F(xiàn)11進(jìn)內(nèi)部
整體代碼:
from urllib import request
import re
class Spider():
url = 'https://www.panda.tv/cate/lol'
root_pattern = '<div class="video-info">([\s\S]*?)</div>'
name_pattern = '</i>([\s\S]*?)</span>'
number_pattern = '<i class="ricon ricon-eye"></i>([\s\S]*?)</span>'
def __fetch_content(self):
r = request.urlopen(Spider.url) #獲取url的界面
htmls = r.read() #讀取界面(獲得字節(jié)碼)
htmls = str(htmls, encoding='utf-8') #字符串轉(zhuǎn)碼
return htmls
def __analysis(self,htmls):
root_html = re.findall(Spider.root_pattern, htmls)
anchors = []
# print(root_html[0])
for html in root_html:
name = re.findall(spider.name_pattern, html)
number = re.findall(spider.number_pattern, html)
anchor = {'name': name, 'number': number}
anchors.append(anchor)
# print(anchors[0])
return anchors
def __refine(self,anchors):
l = lambda anchor:{
'name':anchor['name'][0].strip(), #去除全部空白字符
'number':anchor['number'][0] #列表轉(zhuǎn)化為單一的字符串
}
return map(l,anchors)
def __sort_seed(self,anchor):
r = re.findall('\d*',anchor['number']) #提取數(shù)字
number = float(r[0])
if '萬' in anchor['number']: #處理'萬'
number *= 10000
return number
def __sort(self,anchors):
anchors = sorted(anchors,key = self.__sort_seed,reverse = True)
#key確定字典中哪個(gè)元素作為比較對象
#sorted()默認(rèn)升序排列,reverse = True 降序
return anchors
def __show(self,anchors):
for rank in range(0,len(anchors)):
print('rank ' + str(rank + 1) +':' + ' ' + anchors[rank]['name'] +
'————' + anchors[rank]['number'])
def go(self):
htmls = self.__fetch_content()
anchors = self.__analysis(htmls)
anchors = list(self.__refine(anchors))
anchors = self.__sort(anchors)
self.__show(anchors)
spider = Spider()
spider.go()