scrapy提供了兩種中間件,下載中間件(Downloader Middleware)和Spider中間件(Spider Middleware)
下載中間件
下載中間件是scrapy提供用于用于在爬蟲過程中可修改Request和Response,用于擴展scrapy的功能堰氓;比如:
- 可以在請求被Download之前怔蚌,請求頭部加上某些信息嫁佳;
- 完成請求之后纸淮,回包需要解壓等處理允趟;
如何激活下載中間:
在配置文件settings.py
中的DOWNLOADER_MIDDLEWARES
中配置鍵值對董栽,鍵為要打開的中間件码倦,值為數字,代表優(yōu)先級锭碳,值越低袁稽,優(yōu)先級越高。
scrapy還有一個內部自帶的下載中間件配置DOWNLOADER_MIDDLEWARES_BASE
(不可覆蓋)工禾。scrapy在啟用是會結合DOWNLOADER_MIDDLEWARES_BASE
和DOWNLOADER_MIDDLEWARES
运提,若要取消scrapy默認打開的中間,可在DOWNLOADER_MIDDLEWARES
將該中間的值置為0闻葵。
DOWNLOADER_MIDDLEWARES_BASE =
{
'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100,
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware': 300,
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware': 350,
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware': 400,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': 500,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 550,
'scrapy.downloadermiddlewares.ajaxcrawl.AjaxCrawlMiddleware': 560,
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware': 580,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 590,
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware': 600,
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware': 700,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 750,
'scrapy.downloadermiddlewares.stats.DownloaderStats': 850,
'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 900,
}
如何編寫一個下載中間件
class Scrapy.downloadermiddleares.DownloaderMiddleware
process_request(request, spider)
當每個Request對象經過下載中間件時會被調用民泵,優(yōu)先級越高的中間件,越先調用槽畔;該方法應該返回以下對象:None
/Response
對象/Request
對象/拋出IgnoreRequest
異常栈妆;
- 返回
None
:scrapy會繼續(xù)執(zhí)行其他中間件相應的方法; - 返回
Response
對象:scrapy不會再調用其他中間件的process_request
方法厢钧,也不會去發(fā)起下載鳞尔,而是直接返回該Response
對象; - 返回
Request
對象:scrapy不會再調用其他中間件的process_request()
方法早直,而是將其放置調度器待調度下載寥假; - 拋出
IgnoreRequest
異常:已安裝中間件的process_exception()
會被調用,如果它們沒有捕獲該異常霞扬,則Request.errback
會被調用糕韧;如果再沒被處理枫振,它會被忽略,且不會寫進日志萤彩。
process_response(request, response, spider)
當每個Response經過下載中間件會被調用粪滤,優(yōu)先級越高的中間件,越晚被調用雀扶,與process_request()相反杖小;該方法返回以下對象:Response
對象/Request
對象/拋出IgnoreRequest
異常。
- 返回
Response
對象:scrapy會繼續(xù)調用其他中間件的process_response
方法愚墓; - 返回
Request
對象:停止中間器調用予权,將其放置到調度器待調度下載; - 拋出
IgnoreRequest
異常:Request.errback
會被調用來處理函數转绷,如果沒有處理伟件,它將會被忽略且不會寫進日志。
process_exception(request, exception, spider)
當process_exception()
和process_request()
拋出異常時會被調用议经,應該返回以下對象:None/Response
對象/Request
對象斧账;
- 如果返回
None
:scrapy會繼續(xù)調用其他中間件的process_exception()
; - 如果返回
Response
對象:中間件鏈的process_response()
開始啟動煞肾,不會繼續(xù)調用其他中間件的process_exception()
咧织; - 如果返回
Request
對象:停止中間器的process_exception()
方法調用,將其放置到調度器待調度下載籍救。
from_crawler(cls, crawler)
如果存在該函數习绢,from_crawler
會被調用使用crawler
來創(chuàng)建中間器對象,必須返回一個中間器對象蝙昙,通過這種方式闪萄,可以訪問到crawler
的所有核心部件,如settings
奇颠、signals
等败去。
scray提供的一些下載中間件
以下講述的是一些常用的下載中間件,更多的下載中間件請查看文檔和代碼
HttpProxyMiddleware
scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware
用以設置代理服務器烈拒,通過設定Request.meta['proxy']
來設置代理圆裕,會從環(huán)境變量http_proxy
、https_proxy
荆几、no_proxy
依次獲认抛薄;我們以http://httpbin.org/ip的返回來測試下:
#shell 命令
export http_proxy='http://193.112.216.55:1234'
# -*- coding: utf-8 -*-
import scrapy
class ProxySpider(scrapy.Spider):
name = 'proxy'
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/ip']
def parse(self, response):
print(response.text)
運行scrapy crawl proxy --nolog
吨铸,獲得以下結果:
{"origin":"111.231.115.150, 193.112.216.55"}
返回了我們設置的代理地址IP行拢。
UserAgentMiddleware
scrapy.downloadermiddlewares.useragent.UserAgentMiddleware
通過配置項USER_AGENT
設置用戶代理;我們以http://httpbin.org/headers的返回來看看測試下:
settings.py
#...
#UserAgentMiddleware默認打開
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'
#...
# -*- coding: utf-8 -*-
import scrapy
class UserAgentSpider(scrapy.Spider):
name = 'user_agent'
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/headers']
def parse(self, response):
print(response.text)
運行scrapy crawl user_agent --nolog
诞吱,獲得以下結果:
{"headers":{"Accept":"text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8","Accept-Encoding":"gzip,deflate","Accept-Language":"en","Connection":"close","Host":"httpbin.org","User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36"}}
返回了我們設置的用戶代理剂陡。
使用隨機用戶代理與IP代理
某些網站會通過檢測訪問IP
和User-agent
來進行反爬蟲狈涮,如果檢測到來自同一IP的大量請求,會判斷該IP正在進行爬蟲鸭栖,故而拒絕請求。有些網站也會檢測User-Agent
握巢。我們可以使用多個代理IP
和不同的User-Agent
來對網站數據進行爬取晕鹊,避免被封禁IP。
我們可以通過繼承HttpProxyMiddleware
和UserAgentMiddleware
并修改來使得scrapy使用proxy和user-agent按我們的想法來運行暴浦。HttpProxyMiddleware
和UserAgentMiddleware
見httpproxy.py和useragent.py
代碼如下:
#middlewares.py
# -*- coding: utf-8 -*-
# Define here the models for your spider middleware
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
from scrapy import signals
from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware
from scrapy.exceptions import NotConfigured
from collections import defaultdict
from urllib.parse import urlparse
from faker import Faker #引入Faker溅话,pip install faker下載
import random
class RandomHttpProxyMiddleware(HttpProxyMiddleware):
def __init__(self, auth_encoding='latin-1', proxy_list = None):
if not proxy_list:
raise NotConfigured
self.proxies = defaultdict(list)
for proxy in proxy_list:
parse = urlparse(proxy)
self.proxies[parse.scheme].append(proxy) #生成dict,鍵為協(xié)議歌焦,值為代理ip列表
@classmethod
def from_crawler(cls, crawler):
if not crawler.settings.get('HTTP_PROXY_LIST'):
raise NotConfigured
http_proxy_list = crawler.settings.get('HTTP_PROXY_LIST') #從配置文件中讀取
auth_encoding = crawler.settings.get('HTTPPROXY_AUTH_ENCODING', 'latin-1')
return cls(auth_encoding, http_proxy_list)
def _set_proxy(self, request, scheme):
proxy = random.choice(self.proxies[scheme]) #隨機抽取選中協(xié)議的IP
request.meta['proxy'] = proxy
class RandomUserAgentMiddleware(object):
def __init__(self):
self.faker = Faker(local='zh_CN')
self.user_agent = ''
@classmethod
def from_crawler(cls, crawler):
o = cls()
crawler.signals.connect(o.spider_opened, signal=signals.spider_opened)
return o
def spider_opened(self, spider):
self.user_agent = getattr(spider, 'user_agent',self.user_agent)
def process_request(self, request, spider):
self.user_agent = self.faker.user_agent() #獲得隨機user_agent
request.headers.setdefault(b'User-Agent', self.user_agent)
#settings.py
#...
DOWNLOADER_MIDDLEWARES = {
'newproject.middlewares.RandomHttpProxyMiddleware': 543,
'newproject.middlewares.RandomUserAgentMiddleware': 550,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware':None,
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware':None,
}
HTTP_PROXY_LIST = [
'http://193.112.216.55:1234',
'http://118.24.172.34:1234',
]
#...
#anything.py
# -*- coding: utf-8 -*-
import scrapy
import json
import pprint
class AnythingSpider(scrapy.Spider):
name = 'anything'
allowed_domains = ['httpbin.org']
start_urls = ['http://httpbin.org/anything']
def parse(self, response):
ret = json.loads(response.text)
pprint.pprint(ret)
上面引入了faker
庫飞几,該庫是用來偽造數據的庫,十分方便独撇。我們通過訪問http://httpbin.org/anything來得到我們的請求內容屑墨;如下:
#scrapy crawl anything --nolog
{'args': {},
'data': '',
'files': {},
'form': {},
'headers': {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding': 'gzip,deflate',
'Accept-Language': 'en',
'Cache-Control': 'max-age=259200',
'Connection': 'close',
'Host': 'httpbin.org',
'User-Agent': 'Opera/8.85.(Windows NT 5.2; sc-IT) Presto/2.9.177 '
'Version/10.00'},
'json': None,
'method': 'GET',
'origin': '193.112.216.55',
'url': 'http://httpbin.org/anything'}
#scrapy crawl anything --nolog
{'args': {},
'data': '',
'files': {},
'form': {},
'headers': {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding': 'gzip,deflate',
'Accept-Language': 'en',
'Cache-Control': 'max-age=259200',
'Connection': 'close',
'Host': 'httpbin.org',
'User-Agent': 'Mozilla/5.0 (Macintosh; PPC Mac OS X 10_12_3) '
'AppleWebKit/5342 (KHTML, like Gecko) '
'Chrome/40.0.810.0 Safari/5342'},
'json': None,
'method': 'GET',
'origin': '118.24.172.34',
'url': 'http://httpbin.org/anything'}
可以看到,我們的spider通過下載中間件纷铣,不斷的更改了IP
和User-Agent
卵史。
總結
本篇講述了什么是下載中間件以及如何自定義和啟用下載中間件,最后實踐了自定義下載中間件搜立。后面將會學習另一個中間件:Spider中間件以躯。