scrapy-splash是一個(gè)配合scrapy使用的爬取動(dòng)態(tài)js的第三方庫(kù)(包)
安裝
pip install scrapy-splash
使用
配合上一篇docker的安裝食用更美味牌芋。
我就假設(shè)你看完了docker的安裝使用文章
進(jìn)入docker容器中,使用docker pull scrapinghub/splash
加載splash鏡像
docker run -p 8050:8050 scrapinghub/splash
啟動(dòng)splash服務(wù)
配置splash服務(wù)(以下操作全部在settings.py):
1)添加splash服務(wù)器地址:
SPLASH_URL = 'http://localhost:8050'
2)將splash middleware添加到DOWNLOADER_MIDDLEWARE中:
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
3)Enable SplashDeduplicateArgsMiddleware:
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
4)Set a custom DUPEFILTER_CLASS:
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
5)a custom cache storage backend:
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
scrapy.spider中使用:
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Selector, Request
from scrapy_splash import SplashRequest
class DmozSpider(scrapy.Spider):
name = "bcch"
allowed_domains = ["http://bcch.ahnw.gov.cn"]
start_urls = [
"http://bcch.ahnw.gov.cn/default.aspx",
]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse, args={'wait': 0.5})
def parse(self, response):
resp_sel = Selector(response)
resp_sel.xpath('/')
使用容易松逊,但是對(duì)于沒(méi)搞過(guò)docker的朋友來(lái)講躺屁,還是麻煩一點(diǎn)點(diǎn)