如何提升爬蟲(chóng)的性能
如果你使用過(guò)爬蟲(chóng)框架scrapy,那么你多多少少會(huì)驚異于她的并發(fā)和高效况凉。
在scrapy中,你可以通過(guò)在settings中設(shè)置線程數(shù)來(lái)輕松定制一個(gè)多線程爬蟲(chóng)。這得益于scrappy的底層twisted異步框架嚷往。
異步在爬蟲(chóng)開(kāi)發(fā)中經(jīng)常突顯奇效,因?yàn)樗梢允菃蝹€(gè)鏈接爬蟲(chóng)不堵塞柠衅。
不阻塞可以理解為:在A線程等待response的時(shí)候皮仁,B線程可以發(fā)起requests,或者C線程可以進(jìn)行數(shù)據(jù)處理。
要單個(gè)爬蟲(chóng)線程不阻塞菲宴,python可以使用到的庫(kù)有:
- threading
- gevent
- asyncio
一個(gè)常規(guī)的阻塞爬蟲(chóng)
下面的代碼實(shí)現(xiàn)了一個(gè)獲取 貓眼電影top100 的爬蟲(chóng)贷祈,網(wǎng)站反爬較弱,帶上UA即可喝峦。
我們給爬蟲(chóng)寫一個(gè)裝飾器势誊,記錄其爬取時(shí)間。
import requests
import time
from lxml import etree
from threading import Thread
from functools import cmp_to_key
# 給輸出結(jié)果排序
def sortRule(x, y):
for i in x.keys():
c1 = int(i)
for i in y.keys():
c2 = int(i)
if c1 > c2:
return 1
elif c1 < c2:
return -1
else:
return 0
# 計(jì)算時(shí)間的裝飾器
def caltime(func):
def wrapper(*args, **kwargs):
start = time.time()
func(*args, **kwargs)
print("costtime: ", time.time() - start)
return wrapper
# 獲取頁(yè)面
def getPage(url):
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36',
# 'Cookie': '__mta=141898381.1589978369143.1590927122695.1590927124319.9; uuid_n_v=v1; uuid=EDAA8A109A9611EABDA40952C053E9B506991609A05441F5AFBA3872BEA6088C; _csrf=f36a7050eb60429b197a902b4f1d66317db95bde0879648c8bff0e8237e937de; Hm_lvt_703e94591e87be68cc8da0da7cbd0be2=1589978364; mojo-uuid=8b4dad0e1f472f08ffd3f3f67b75f2ab; _lxsdk_cuid=17232188c2f0-022085e6f29b1b-30657c06-13c680-17232188c30c8; _lxsdk=EDAA8A109A9611EABDA40952C053E9B506991609A05441F5AFBA3872BEA6088C; mojo-session-id={"id":"afcd899e03fe72ca70e34368fe483d15","time":1590927095603}; __mta=141898381.1589978369143.1590063115667.1590927111235.7; mojo-trace-id=10; Hm_lpvt_703e94591e87be68cc8da0da7cbd0be2=1590927124; _lxsdk_s=1726aa4fd86-ba9-904-221%7C%7C15',
}
try:
resp = requests.get(url=url, headers=headers)
if resp.status_code == 200:
return resp.text
return None
except Exception as e:
print(e)
return None
# 獲取單個(gè)頁(yè)面數(shù)據(jù)
def parsePage(page):
if not page:
yield
data = etree.HTML(page).xpath('.//dl/dd')
for d in data:
rank = d.xpath("./i/text()")[0]
title = d.xpath(".//p[@class='name']/a/text()")[0]
yield {
rank: title
}
# 調(diào)度
def schedule(url, f):
page = getPage(url)
for data in parsePage(page):
f.append(data)
# 數(shù)據(jù)展示
def show(f):
f.sort(key=cmp_to_key(sortRule))
for x in f:
print(x)
@caltime
def main():
urls = ['https://maoyan.com/board/4?offset={offset}'.format(offset=i) for i in range(0, 100, 10)]
f = []
for url in urls:
schedule(url, f)
show(f)
if __name__ == '__main__':
main()
成功爬取完top100平均花費(fèi)2.8s左右谣蠢。
這個(gè)爬蟲(chóng)程序總共有10個(gè)小的爬蟲(chóng)線程粟耻,每個(gè)爬蟲(chóng)線程爬取10條數(shù)據(jù)查近。當(dāng)前面的線程未成功收到response時(shí),后面所有的線程都阻塞了挤忙。
這也是這個(gè)爬蟲(chóng)程序低效的原因嗦嗡。因?yàn)榫€程之間有明確的先后順序,后面的線程無(wú)法越過(guò)前面的線程發(fā)送請(qǐng)求饭玲。
threading打破線程的優(yōu)先級(jí)侥祭?
接下來(lái)我們使用多線程打破這種優(yōu)先順序。修改main函數(shù)
def main():
urls = ['https://maoyan.com/board/4?offset={offset}'.format(offset=i) for i in range(0, 100, 10)]
threads = []
f = []
for url in urls:
# schedule(url, f)
t = Thread(target=schedule, args=(url, f))
threads.append(t)
t.start()
for t in threads:
t.join()
show(f)
記得導(dǎo)入threading庫(kù)
from threading import Thread
點(diǎn)擊運(yùn)行茄厘,發(fā)現(xiàn)時(shí)間縮短為0.4s矮冬,性能的提升還是很客觀的。
threading的作用在于開(kāi)啟了多個(gè)線程次哈,每個(gè)線程同時(shí)競(jìng)爭(zhēng)GIL胎署,當(dāng)拿到GIL發(fā)出requests后。該線程又立即釋放GIL窑滞。進(jìn)入等待Response的狀態(tài)琼牧。
釋放掉的GIL又馬上被其他線程獲取...如此以來(lái),每個(gè)線程都是平等的哀卫,無(wú)先后之分巨坊。看起來(lái)就好像同時(shí)進(jìn)行著(實(shí)際并不是此改,因?yàn)镚IL的原因)趾撵。
所以效率大大提升了。
gevent異步協(xié)程搞一波共啃?
gevent是一個(gè)優(yōu)先的異步網(wǎng)絡(luò)庫(kù)占调,可以輕松支持高并發(fā)的網(wǎng)絡(luò)訪問(wèn)。我們現(xiàn)在試著把阻塞的爬蟲(chóng)加上gevent試試
@caltime
def main():
threads = []
urls = ['https://maoyan.com/board/4?offset={offset}'.format(offset=i) for i in range(0, 100, 10)]
f = []
for url in urls:
threads.append(gevent.spawn(schedule, url, f))
gevent.joinall(threads)
show(f)
同樣這里也要導(dǎo)入gevent庫(kù)
import gevent
from gevent import monkey
monkey.patch_all()
點(diǎn)擊運(yùn)行移剪,平均時(shí)間在0.45上左右究珊,和多線程差不多。
新版異步庫(kù)ascyncio搞一波纵苛?
ascyncion是python前不久剛推出的基于協(xié)程的異步庫(kù)剿涮,號(hào)稱最有野心的庫(kù)。要使ascyncio支持我們的程序赶站,必須對(duì)getPage做點(diǎn)修改:
因?yàn)閞equests是不支持異步的幔虏,所以我們這里使用aiohttp庫(kù)替換requests,并用它來(lái)實(shí)現(xiàn)getPage函數(shù)贝椿。
# 異步requests
async def getPage(url):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36'}
async with aiohttp.ClientSession() as session:
async with session.get(url, headers = headers) as resp:
return await resp.text()
main函數(shù)也需要修改
@caltime
def main():
urls = ['https://maoyan.com/board/4?offset={offset}'.format(offset=i) for i in range(0, 100, 10)]
loop = asyncio.get_event_loop()
f = []
threads = []
for url in urls:
threads.append(schedule(url,f))
loop.run_until_complete(asyncio.wait(threads))
show(f)
記得導(dǎo)入相關(guān)庫(kù)
import asyncio
import aiohttp
點(diǎn)擊運(yùn)行,平均時(shí)間在0.35左右陷谱,性能稍優(yōu)于多線程和gevent一點(diǎn)烙博。
結(jié)語(yǔ)
對(duì)于爬蟲(chóng)技術(shù)瑟蜈,其實(shí)有些比較新的東西是值得去了解一下的。比如:
- 提升并發(fā)方面:asyncio渣窜, aiohttp
- 動(dòng)態(tài)渲染:pyppeteer(puppeteer的python版铺根,支持異步)
- 驗(yàn)證碼破解:機(jī)器學(xué)習(xí),模型訓(xùn)練
還有一些數(shù)據(jù)解析方面的工具性能大概如下:
- re > lxml > bs4
- 但是即便是同一種解析方法乔宿,不同工具實(shí)現(xiàn)的位迂,性能也不一樣。比如同樣是xpath详瑞,lxml的性能略好于parsel(scrapy團(tuán)隊(duì)開(kāi)發(fā)的數(shù)據(jù)解析工具掂林,支持css,re坝橡,xpath)的泻帮。