常規(guī)pyppeteer中間件
常規(guī)的pyppeteer中間件恨锚,盡管pyppeteer是基于asyncio的異步框架宇驾,但因?yàn)橥ㄟ^同步的方式調(diào)用,無法發(fā)揮其異步框架的優(yōu)勢(shì)猴伶,會(huì)將scrapy阻塞课舍,相當(dāng)于總并發(fā)降至1,參考github項(xiàng)目(https://github.com/Python3WebSpider/ScrapyPyppeteer.git)
import websockets
from scrapy.http import HtmlResponse
from logging import getLogger
import asyncio
import pyppeteer
import logging
from concurrent.futures._base import TimeoutError
class PyppeteerMiddleware():
def render(self, url, **kwargs):
async def async_render(url, **kwargs):
try:
page = await self.browser.newPage()
response = await page.goto(url, options={'timeout': int(timeout * 1000)})
content = await page.content()
return content, response.status
except TimeoutError:
return None, 500
finally:
if not page.isClosed():
await page.close()
return content, status
def process_request(self, request, spider):
if request.meta.get('render') == 'pyppeteer':
try:
html, status = self.render(request.url)
return HtmlResponse(url=request.url, body=html, request=request, encoding='utf-8',
status=status)
except websockets.exceptions.ConnectionClosed:
pass
異步pyppeteer中間件
將pyppeteer中間件弄成異步需要進(jìn)行兩步操作
- 在process_request方法中他挎,將pyppeteer請(qǐng)求函數(shù)協(xié)程異步調(diào)用筝尾,并用Deferred.fromFuture將twisted deffered 改成asyncio的future
from twisted.internet.defer import Deferred
from scrapy.http import HtmlResponse
def as_deferred(f):
"""Transform a Twisted Deffered to an Asyncio Future"""
return Deferred.fromFuture(asyncio.ensure_future(f))
class PuppeteerMiddleware:
async def _process_request(self, request, spider):
"""Handle the request using Puppeteer"""
page = await self.browser.newPage()
......
return HtmlResponse(
page.url,
status=response.status,
headers=response.headers,
body=body,
encoding='utf-8',
request=request
)
def process_request(self, request, spider):
"""Check if the Request should be handled by Puppeteer"""
if request.meta.get('render') == 'pyppeteer':
return as_deferred(self._process_request(request, spider))
return None
- 由于scrapy是基于twisted,而pyppeteer基于asyncio办桨,需要解決reactor的互通問題筹淫。
Twisted有一個(gè)解決方案,可以在asyncio上運(yùn)行twisted呢撞,那就是asyncioreactor损姜,不過要確保在導(dǎo)入scrappy或執(zhí)行任何其他操作之前做處理饰剥,可以在導(dǎo)入execute之前先解決reactor問題
import asyncio
from twisted.internet import asyncioreactor
asyncioreactor.install(asyncio.get_event_loop())
'''
導(dǎo)入scrapy之前,必須先加上以上三行摧阅,否則無法對(duì)接asyncio
'''
from scrapy.cmdline import execute
execute("scrapy crawl spider_name".split())
參考github項(xiàng)目(https://github.com/clemfromspace/scrapy-puppeteer.git)
這樣就可以兼容scrapy的并發(fā)設(shè)置了汰蓉。