在scrapy框架及中間件中說到了中間件相關(guān)的數(shù)據(jù)流程魔市,剛好在用proxy爬數(shù)據(jù)的時(shí)候會(huì)用到中間件的零零總總,這回可以一起說說了赵哲。
我覺得寫中間件要先找到內(nèi)置的相關(guān)中間件,根據(jù)你的需求改寫其中的request/response/exceptions枫夺。
因?yàn)閟crapy里內(nèi)置的downloadermiddlewares應(yīng)該已經(jīng)足夠滿足大部分的需求了将宪,文檔上說了一個(gè)順序,也是把所有的downloadermiddlewares羅列出來橡庞。以及每個(gè)中間件要啟用哪些設(shè)置较坛,在文檔中間件有寫明。
{
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,#Robots協(xié)議
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,#http認(rèn)證
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,#壓縮方式——Accept-Encoding: gzip, deflate
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,#重定向301,302
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,#代理
'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,#底層緩存支持
}
另spidermiddlewares
{
'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50,#直接跳過非2**的request扒最,
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware': 500,#在domain之外的網(wǎng)址不被過濾
'scrapy.spidermiddlewares.referer.RefererMiddleware': 700,#根據(jù)request和response生成request headers中的referer
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware': 800,#控制爬取得url長度
'scrapy.spidermiddlewares.depth.DepthMiddleware': 900,#控制爬取得深度
}
這回想要用proxy爬取百度首頁丑勤,想的是基本流程是
1.setting里導(dǎo)入ip-list,同時(shí)DOWNLOAD_TIMEOUT=3扼倘,默認(rèn)180,3分鐘太長了
2.修改HttpProxyMiddleware确封,讓其從setting里都每次都取第一個(gè)proxy發(fā)起request
2.修改RetryMiddleware除呵,如果出現(xiàn)timeout等錯(cuò)誤(重寫exception)或者ip被封出現(xiàn)503(重寫response)之類再菊,就把這個(gè)ip刪掉,把刪除后的iplist重寫進(jìn)setting颜曾,如果iplist為0纠拔,就結(jié)束spider。
middleware:
from scrapy import signals
from scrapy.utils.project import get_project_settings
from scrapy.downloadermiddlewares.retry import RetryMiddleware
from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware
import time
import random
from scrapy.utils.response import response_status_message
from scrapy.log import logger
class MyProxyMiddleware(HttpProxyMiddleware):
def process_request(self, request, spider):
settings = get_project_settings()
proxies = settings.get('IPOOL')
logger.debug('now ip is '+proxies[0])
request.meta['proxy'] = proxies[0]
class MyRetryMiddleware(RetryMiddleware):
def delete_proxy(self,spider):
settings = get_project_settings()
proxies = settings.get('IPOOL')
if proxies:
proxies.pop(0)
settings.set('IPOOL',proxies)
else:
spider.crawler.engine.close_spider(spider, 'response msg error , job done!')
def process_exception(self, request, exception, spider):
if isinstance(exception, self.EXCEPTIONS_TO_RETRY) \
and not request.meta.get('dont_retry', False):
self.delete_proxy(spider)
time.sleep(random.randint(3, 5))
return self._retry(request, exception, spider)
def process_response(self, request, response, spider):
if request.meta.get('dont_retry', False):
return response
if response.status == 200:
self.delete_proxy(spider)
return response
if response.status in self.retry_http_codes:
reason = response_status_message(response.status)
self.delete_proxy(spider)
time.sleep(random.randint(3, 5))
return self._retry(request, reason, spider) or response
return response
settings:
import pandas as pd
df = pd.read_csv('F:\\pycharm project\\pachong\\vpn.csv')
IPOOL = df['address'][df['status'] == 'yes'].tolist()
DOWNLOADER_MIDDLEWARES = {
# 'mytset.middlewares.MytsetDownloaderMiddleware': 543,
'mytset.middlewares.MyRetryMiddleware':550,
'mytset.middlewares.MyProxyMiddleware': 750,
}
DOWNLOAD_TIMEOUT=3
spider:
import scrapy
from pyquery import PyQuery as pq
class BaiduSpider(scrapy.Spider):
name = 'baidu'
allowed_domains = ['www.baidu.com']
def start_requests(self):
for _ in range(30):
yield scrapy.Request(url='http://www.baidu.com/',callback=self.parse,dont_filter=True)
def parse(self, response):
res = pq(response.body)
proxy = response.meta['proxy']
print(proxy)
print(res('title').text())