使用scrapy-splash之前边翼,可以先創(chuàng)建一個(gè)scrapy項(xiàng)目征懈,然后打印一下網(wǎng)頁,突出scrapy-splash的優(yōu)秀麻车,嘻嘻缀皱。
作為scrapy-splash練習(xí)的靶子,它的長相是這個(gè)樣子滴:
如果僅僅使用scrapy动猬,打印出當(dāng)前的網(wǎng)頁是這樣滴:
所以啤斗,還玩?zhèn)€蛇皮
當(dāng)當(dāng)當(dāng)當(dāng),今天的主角要登場了赁咙。
重啟下docker:
sudo service docker start
讓Docker容器以守護(hù)態(tài)運(yùn)行钮莲,這樣在中斷遠(yuǎn)程服務(wù)器連接后,不會(huì)終止Splash服務(wù)的運(yùn)行:
docker run -d -p 8050:8050 scrapinghub/splash
如果運(yùn)行不報(bào)錯(cuò)一般就是OK的彼水,也可以在瀏覽器打開localhost:8050查看是不是這樣?jì)饍旱模?/p>
然后配置下settings文件崔拥,添加如下代碼:
# 以下是scrapy-splash的配置:
# 渲染服務(wù)的url
SPLASH_URL = 'http://localhost:8050'
#下載器中間件
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
# 去重過濾器
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
# 使用Splash的Http緩存
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
spiders文件zfcaigou.py代碼如下:
# -*- coding: utf-8 -*-
import scrapy
from scrapy_splash import SplashRequest
from caigou.items import CaigouItem
# from caigou.items import ZfcaigouItemLoad, CaigouItem
spiders文件zfcaigou.py代碼如下:
class ZfcaigouSpider(scrapy.Spider):
name = 'zfcaigou'
allowed_domains = ['www.zjzfcg.gov.cn']
start_urls = ['http://www.zjzfcg.gov.cn/purchaseNotice/index.html?categoryId=3001']
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url, callback=self.parse,
args={'wait': 10}, endpoint='render.html')
def parse(self, response):
# print(response.body.decode("utf-8"))
infodata = response.css(".items p")
for infoline in infodata:
caigouitem = CaigouItem()
caigouitem['city'] = infoline.css(".warning::text").extract()[0].replace("[", "").replace("·", "").strip()
caigouitem['issuescate'] = infoline.css(".warning .limit::text").extract()[0]
caigouitem['title'] = infoline.css("a .underline::text").extract()[0].replace("]", "")
caigouitem['publish_date'] = infoline.css(".time::text").extract()[0].replace("[", "").replace("]", "")
yield caigouitem
items.py代碼:
# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class CaigouItem(scrapy.Item):
city = scrapy.Field()
issuescate = scrapy.Field()
title = scrapy.Field()
publish_date = scrapy.Field()
除了網(wǎng)絡(luò)稍卡,還是將內(nèi)容給抓取下來了凤覆。
項(xiàng)目地址:https://github.com/hfxjd9527/caigou