我們知道python爬蟲的解析庫有很多,我們選取了lxml摆寄,bs4失暴,re,pyquery微饥,進行測試锐帜。
- bs4:純python寫的文檔樹解析庫,它有4種解析器(lxml,html.parser,html5lib),我們測試的是lxml畜号,主要可以通過標簽進行定位,也可以通過css選擇器進行定位
- pyquery:模擬前端jQuery寫的python文檔樹解析庫允瞧,用起來跟jQuery非常相似简软,用的都是css語法進行定位元素
- xpath:lxml是用c語言編寫通過python調用的解析庫蛮拔,用的xpath語法
- re:python正則表達式庫
4個庫各有優(yōu)缺點: - bs4更多的用于解析script標簽的文本,因為它的速度實在太慢了
- re則是進行非結構化的文檔進行匹配
- lxml底層是c實現(xiàn)的痹升,在速度上毋庸置疑建炫,同時易用性也很高
- pyquery使用更加比xpath和bs4更加靈活,PyQuery對象可以直接解析html文件疼蛾,url(通過urllib進行請求返回結果)肛跌,文檔字符串
代碼如下
"""
@Author: Jonescyna
@Created: 2020/12/28
"""
import requests
import time
import re
from pyquery import PyQuery as pq
from lxml import etree
from bs4 import BeautifulSoup
def cal_time(func):
def inner(*args, **kwargs):
start = time.time()
ret = func(*args, **kwargs)
print(f'{func.__name__}:{time.time() - start}s')
return ret
return inner
base_url = 'https://www.amazon.cn/b/ref=s9_acss_bw_cg_pccateg_2a1_w?node=106200071&pf_rd_m=A1U5RCOVU0NYF2&pf_rd_s=merchandised-search-2&pf_rd_r=PQNKPPABQXAWCTZSNFXA&pf_rd_t=101&pf_rd_p=cdcd9a0d-d7cf-4dab-80db-2b7d63266973&pf_rd_i=42689071'
def get(url):
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:82.0) Gecko/20100101 Firefox/82.0', }
resp = requests.get(url, headers=headers)
return resp.text
@cal_time
def parse_by_pq(html):
for _ in range(50):
doc = pq(html)
h2_list = doc('h2').items()
for h2 in h2_list:
h2.text()
@cal_time
def parse_by_xpath(html):
for _ in range(50):
doc = etree.HTML(html)
h2_list = doc.xpath('//h2')
for h2 in h2_list:
title = h2.xpath('./text()')[0]
@cal_time
def parse_by_bs4(html):
for _ in range(50):
soup = BeautifulSoup(html, 'lxml')
h2_list = soup.find_all('h2')
for h2 in h2_list:
title = h2.text
@cal_time
def parse_by_re(html):
for _ in range(50):
h2_list = re.findall(r'<h2 .*>\n(.*)\n<', html)
for h2 in h2_list:
title = h2
if __name__ == '__main__':
resp = get(base_url)
parse_by_pq(resp)
parse_by_xpath(resp)
parse_by_bs4(resp)
parse_by_re(resp)
執(zhí)行結果
測試環(huán)境:本人用的是臺式電腦進行的測試,win10系統(tǒng)配置為i5察郁,16G內存(ddr3)衍慎,不同的電腦跟網(wǎng)絡環(huán)境直接影響解析速度,在相同的環(huán)境下皮钠,時間浮動不會太大
parse_by_pq:0.9650003910064697s
parse_by_xpath:0.761019229888916s
parse_by_bs4:2.878000020980835s
parse_by_re:0.01597905158996582s