該包是puppeteer 的非官方python實(shí)現(xiàn)捧存, 可以實(shí)現(xiàn)與puppeteer類似的功能。
https://github.com/miyakogi/pyppeteer
安裝
pip install pyppeteer
下載使用 chromium
默認(rèn)下載地址為 DEFAULT_DOWNLOAD_HOST = 'https://storage.googleapis.com'
, 需要FQ才可下載,下面介紹不FQ的方法瘩扼。
我使用npm 安裝了 puppeteer , 運(yùn)行時(shí)會(huì)下載chromium 启上。
npm下載的位置在F:\program_nodejs\testpuppeteer\node_modules\puppeteer\.local-chromium\win64-588429\chrome-win32
.
pyppeteer默認(rèn)會(huì)將之存放在 pyppeteer_home =C:\Users\Administrator\AppData\Local\pyppeteer\pyppeteer\
下邢隧,
翻看pyppeteer源碼, 可以看到其存放的目錄結(jié)構(gòu)為local-chromium / REVISION / 'chrome-win32' / 'chrome.exe' , REVISION對(duì)應(yīng)上面的588429版本號(hào)冈在, 建立上述目錄倒慧,將npm下載的chromium拷貝進(jìn)去即可。
這里需要注意包券, 在運(yùn)行時(shí)需要指定chromium的版本號(hào) 纫谅, os.environ['PYPPETEER_CHROMIUM_REVISION'] ='588429'
爬取網(wǎng)頁代碼
import asyncio
import pyppeteer
import os
os.environ['PYPPETEER_CHROMIUM_REVISION'] ='588429'
pyppeteer.DEBUG = True
async def main():
print("in main ")
print(os.environ.get('PYPPETEER_CHROMIUM_REVISION'))
browser = await pyppeteer.launch()
page = await browser.newPage()
await page.goto('http://www.baidu.com')
content = await page.content()
cookies = await page.cookies()
# await page.screenshot({'path': 'example.png'})
await browser.close()
return {'content':content, 'cookies':cookies}
loop = asyncio.get_event_loop()
task = asyncio.ensure_future(main())
loop.run_until_complete(task)
print(task.result())
注意asyncio包的使用 , 及獲取網(wǎng)頁內(nèi)容的寫法溅固。
其他api用法詳見 https://miyakogi.github.io/pyppeteer/reference.html
與scrapy的整合
加入downloadmiddleware
from scrapy import signals
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import random
import pyppeteer
import asyncio
import os
from scrapy.http import HtmlResponse
pyppeteer.DEBUG = False
class FundscrapyDownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
def __init__(self) :
print("Init downloaderMiddleware use pypputeer.")
os.environ['PYPPETEER_CHROMIUM_REVISION'] ='588429'
# pyppeteer.DEBUG = False
print(os.environ.get('PYPPETEER_CHROMIUM_REVISION'))
loop = asyncio.get_event_loop()
task = asyncio.ensure_future(self.getbrowser())
loop.run_until_complete(task)
#self.browser = task.result()
print(self.browser)
print(self.page)
# self.page = await browser.newPage()
async def getbrowser(self):
self.browser = await pyppeteer.launch()
self.page = await self.browser.newPage()
# return await pyppeteer.launch()
async def getnewpage(self):
return await self.browser.newPage()
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
loop = asyncio.get_event_loop()
task = asyncio.ensure_future(self.usePypuppeteer(request))
loop.run_until_complete(task)
# return task.result()
return HtmlResponse(url=request.url, body=task.result(), encoding="utf-8",request=request)
async def usePypuppeteer(self, request):
print(request.url)
# page = await self.browser.newPage()
await self.page.goto(request.url)
content = await self.page.content()
return content
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)