反反爬蟲(chóng)相關(guān)機(jī)制
(有些網(wǎng)站使用不同程度的復(fù)雜性規(guī)則防止爬蟲(chóng)訪問(wèn)男应,繞過(guò)這些規(guī)則是困難和復(fù)雜的岸更,有時(shí)可能需要特殊的設(shè)置)
通常反爬措施
1. 基于請(qǐng)求頭
動(dòng)態(tài)設(shè)置User-Agent(隨機(jī)切換User-Agent,模擬不同用戶(hù)的瀏覽器信息)
2. 基于cookie的反爬
禁用Cookies(前提是爬取的網(wǎng)站不需要cookies參數(shù))
(cookie池矢棚,文件存儲(chǔ)空繁,數(shù)據(jù)庫(kù)存儲(chǔ))
(如何獲取cookies把敞,如何驗(yàn)證cookie,如何進(jìn)行模擬登陸)可以使用request纽疟,手動(dòng)添加罐韩,使用selenium
3. 基于IP
- 代理:代理的原理?污朽,付費(fèi)代理散吵,免費(fèi)代理?代理池
4. 基于動(dòng)態(tài)加載的網(wǎng)頁(yè)
- ajax
- js
- jq
- (使用selenium)
- 無(wú)頭瀏覽器蟆肆,有頭瀏覽器矾睦,selenium的方法?
5. 數(shù)據(jù)加密炎功?(一般會(huì)寫(xiě)在js代碼里)
- app
- web網(wǎng)頁(yè)
- 使用selenium
針對(duì)于反爬手段
方案一
使用 Crawlera(專(zhuān)用于爬蟲(chóng)的代理組件)枚冗,正確配置和設(shè)置下載中間件后,項(xiàng)目所有的request都是通過(guò)crawlera發(fā)出蛇损。 官方網(wǎng)站:https://scrapinghub.com/crawlera 參考網(wǎng)站:https://www.aliyun.com/jiaocheng/481939.html
方式二
自定義下載中間件(Downloader Middlewares)
下載中間件是處于引擎(crawler.engine)和下載器(crawler.engine.download())之間的一層組件赁温,可以用來(lái)修改Request和Response肛宋。
當(dāng)引擎?zhèn)鬟f請(qǐng)求(Request)給下載器的過(guò)程中,下載中間件可以對(duì)請(qǐng)求進(jìn)行處理 (例如增加http header信息(User-Agent)束世,增加proxy代理等);
在下載器完成http請(qǐng)求酝陈,傳遞響應(yīng)給引擎的過(guò)程中, 下載中間件可以對(duì)響應(yīng)進(jìn)行處理
要激活下載器中間件組件毁涉,將其加入到 DOWNLOADER_MIDDLEWARES 設(shè)置中沉帮。 該設(shè)置是一個(gè)字典(dict),鍵為中間件類(lèi)的路徑贫堰,值為其中間件的優(yōu)先級(jí)穆壕。值越低,代表優(yōu)先級(jí)越高其屏。
自定義中間鍵需要了解middlewares.py下載中間件相關(guān)方法
每個(gè)中間件組件是一個(gè)定義了以下一個(gè)或多個(gè)方法
class DownloadwareDownloaderMiddleware(object):
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
@classmethod
def from_crawler(cls, crawler):
從settings中獲取值
# This method is used by Scrapy to create your spiders.
s = cls()
#信號(hào)量鏈接
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
所有的request請(qǐng)求在交給下載器之前都會(huì)經(jīng)過(guò)這個(gè)
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
所有的響應(yīng)結(jié)果會(huì)經(jīng)過(guò)這個(gè)方法
# Must either;
# - return a Response object 返回是個(gè)相應(yīng)結(jié)果
# - return a Request object 會(huì)把request放在調(diào)度器進(jìn)行調(diào)度
# - or raise IgnoreRequest 異常
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
處理異常錯(cuò)誤
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
from_crawler(cls, crawler)
如果存在該函數(shù)喇勋,from_crawler會(huì)被調(diào)用使用crawler來(lái)創(chuàng)建中間器對(duì)象,必須返回一個(gè)中間器對(duì)象偎行,通過(guò)這種方式川背,可以訪問(wèn)到crawler的所有核心部件,如settings蛤袒、signals等
def process_request(self, request, spider)
當(dāng)每個(gè)request對(duì)象通過(guò)下載中間件時(shí)該方法被調(diào)用熄云。
process_request() 必須返回以下其中之一:
1. None
- 如果返回 None: Scrapy將繼續(xù)處理request,執(zhí) 行其他的中間件的相應(yīng)方法妙真。
2. Response 對(duì)象
- 如果返回 Response 對(duì)象: Scrapy不會(huì)再調(diào)用任 何其他的中間件的 process_request() 或相應(yīng)地下 載函數(shù)缴允; 直接返回這個(gè)response對(duì)象。 已激活的中間件的 process_response()方法則會(huì)在 每個(gè) response 返回時(shí)被調(diào)用珍德。
3. Request 對(duì)象
- 如果返回 Request 對(duì)象练般,Scrapy則停止調(diào)用 其他中間件的process_request方法,并重新將返回的 request對(duì)象放置到調(diào)度器等待下載锈候。
4. IgnoreRequest異常
- 如果返回raise IgnoreRequest 異常: 下載中間件的 process_exception() 方法會(huì)被用薄料。 如果沒(méi)有捕獲該異常, 則request發(fā)情請(qǐng)求時(shí)設(shè)置的 errback(Request.errback)方法會(huì)被調(diào)用晴及。如果也 沒(méi)有設(shè)置異扯及欤回調(diào),則該異常被忽略且不記錄虑稼。
process_request()有兩個(gè)參數(shù):
- request (Request 對(duì)象) – 處理的request
- spider (Spider 對(duì)象) – 該request對(duì)應(yīng)的spider
process_response(self, request, response, spider)
當(dāng)下載器完成http請(qǐng)求琳钉,傳遞Response給引擎的時(shí)調(diào)用
優(yōu)先級(jí)越高的中間件,越晚被調(diào)用蛛倦,與process_request()相反
process_response() 必須返回以下其中之一:
1. Response 對(duì)象
如果返回 Request: 更低優(yōu)先級(jí)的下載中間件的 process_response方法不會(huì)繼續(xù)調(diào)用歌懒,該Request會(huì)被 重新放到調(diào)度器任務(wù)隊(duì)列中等待調(diào)度,相當(dāng)于一個(gè)新的 Request溯壶。
(scrapy會(huì)繼續(xù)調(diào)用其他中間件的process_response方法)
2. Request 對(duì)象
如果返回 Response 對(duì)象及皂,更低優(yōu)先級(jí)的下載中間 件的process_response方法會(huì)被繼續(xù)調(diào)用對(duì)Response對(duì)象進(jìn)行處理
(停止中間器調(diào)用甫男,將其放置到調(diào)度器待調(diào)度下載)
3. IgnoreRequest異常
如果拋出 IgnoreRequest 異常,則調(diào)用request 設(shè)置的errback(Request.errback)函數(shù)验烧。 如果異常沒(méi)有 被處理板驳,則該異常被忽略且不記錄。
(會(huì)被調(diào)用來(lái)處理函數(shù)碍拆,如果沒(méi)有處理若治,它將會(huì)被忽略且不會(huì)寫(xiě)進(jìn)日志)
process_response()有三個(gè)參數(shù):
- request (Request 對(duì)象) – response所對(duì)應(yīng)的request
- response (Response 對(duì)象) – 被處理的response
- spider (Spider 對(duì)象) – response所對(duì)應(yīng)的spider
process_exception(request, exception, spider)
當(dāng)process_exception()和process_request()拋出異常時(shí)會(huì)被調(diào)用,應(yīng)該返回以下對(duì)象:None/Response對(duì)象/Request對(duì)象感混;
如果返回None:scrapy會(huì)繼續(xù)調(diào)用其他中間件的process_exception()端幼;
如果返回Response對(duì)象:中間件鏈的process_response()開(kāi)始啟動(dòng),不會(huì)繼續(xù)調(diào)用其他中間件的process_exception()弧满;
如果返回Request對(duì)象:停止中間器的process_exception()方法調(diào)用婆跑,將其放置到調(diào)度器待調(diào)度下載。
User-Agent
class UserAgentDown(object):
def process_request(self,request,spider):
from fake_useragent import UserAgent
userAgent = UserAgent()
#引入第三方
random_ua = userAgent.random
if random_ua:
print('經(jīng)過(guò)了下載中間件',random_ua)
request.headers["User-Agent"] = random_ua
---------------------------------------------------------------
USER_AGENTS = [
"Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Win64; x64; Trident/5.0; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 2.0.50727; Media Center PC 6.0)",
"Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; .NET CLR 1.0.3705; .NET CLR 1.1.4322)",
"Mozilla/4.0 (compatible; MSIE 7.0b; Windows NT 5.2; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.2; .NET CLR 3.0.04506.30)",
"Mozilla/5.0 (Windows; U; Windows NT 5.1; zh-CN) AppleWebKit/523.15 (KHTML, like Gecko, Safari/419.3) Arora/0.3 (Change: 287 c9dfb30)",
"Mozilla/5.0 (X11; U; Linux; en-US) AppleWebKit/527+ (KHTML, like Gecko, Safari/419.3) Arora/0.6",
]
class UserAgentDown(object):
def __init__(self,User_Agents):
self.User_agents = User_Agents
@classmethod
def from_crawler(cls,crawler):
#在settings中提前設(shè)置好
User_Agent = crawler.settings['USER_AGENTS']
return cls(User_Agent)
import random
#spider.settings和crawler.settings結(jié)果一樣
#User_Agent = spider.settings['USERAGENT']
random_ua = random.choice(User_Agent)
if random_ua:
print('經(jīng)過(guò)了下載中間件',random_ua)
request.headers["User-Agent"] = random_ua
代理
為什么HTTP代理要使用base64編碼:
HTTP代理的原理很簡(jiǎn)單庭呜,就是通過(guò)HTTP協(xié)議與代理服務(wù)器建立連接滑进,協(xié)議信令中包含要連接到的遠(yuǎn)程主機(jī)的IP和端口號(hào),如果有需要身份驗(yàn)證的話還需要加上授權(quán)信息疟赊,服務(wù)器收到信令后首先進(jìn)行身份驗(yàn)證郊供,通過(guò)后便與遠(yuǎn)程主機(jī)建立連接峡碉,連接成功之后會(huì)返回給客戶(hù)端200近哟,表示驗(yàn)證通過(guò),就這么簡(jiǎn)單鲫寄,下面是具體的信令格式:
我們?cè)趕ettings.py中
PROXIES = [
{"ip":"127.0.0.1:6379","pwd":"ljk123456"},
{"ip":"127.0.0.1:6379","pwd":None},
{"ip":"127.0.0.1:6379","pwd":None},
{"ip":"127.0.0.1:6379","pwd":None},
]
------------------------------------------------------
class ProxyDownloadMiddlerware(object):
def process_request(self,request,spider):
proxies = spider.settings['PROXIES']
import random
proxy_rm = random.choice(proxies)
if proxy_rm['pwd']:
#有賬號(hào)密碼的代理
#對(duì)賬號(hào)密碼進(jìn)行base64編碼
import base64
base64_pwd = base64.b64encode(proxy_rm['pwd'].encode('utf-8')).decode('utf-8')
# 對(duì)應(yīng)到代理服務(wù)器的信令格式里
request.headers['Proxy-Authorization'] = 'Basic ' + base64_pwd
#設(shè)置ip
request.meta['proxy'] = proxy_rm['ip']
else:
request.meta['proxy'] = proxy_rm['ip']
Cookie
同樣現(xiàn)在settings.py中設(shè)置好cookie
class CookiesDownloadMiddlerware(object):
def process_request(self,request,spider):
Cookies = spider.settings['COOKIES']
import random
cookie_rm = random.choice(Cookies)
if cookie_rm:
request.cookies = cookie_rm
srapy從不支持動(dòng)態(tài)加載
因此
4. 設(shè)置selenium中間件吉执。
class SeleniumDownloadMiddlerware(object):
def __init__(self):
self.driver = webdriver.Firefox(
executable_path='E://Firefox/geckodriver'
)
# 設(shè)置超時(shí)時(shí)間
self.driver.set_page_load_timeout(10)
def process_request(self,request,spider):
if spider.name == 'test':
#獲取url
url = request.url
if url:
try:
self.driver.get(url)
page = self.driver.page_source
if page:
return HtmlResponse(url=url,status=200,body=page.encode('utf-8'),request=request)
except TimeoutException as err:
print("請(qǐng)求超時(shí)",url)
return HtmlResponse(url=url, status=408, body=b"", request=request)
會(huì)發(fā)現(xiàn)selenium的瀏覽器一直處于打開(kāi)狀態(tài)
在scrapy.Spider中的
def _set_crawler(self, crawler):
self.crawler = crawler
self.settings = crawler.settings
#監(jiān)控爬蟲(chóng)結(jié)束
crawler.signals.connect(self.close, signals.spider_closed)
改進(jìn)之后
爬蟲(chóng)文件
class TestDemoSpider(scrapy.Spider):
name = 'test_demo'
allowed_domains = ['baidu.com']
start_urls = ['http://www.baidu.com/']
driver = webdriver.Firefox(
executable_path='E://Firefox/geckodriver'
)
# 設(shè)置超時(shí)時(shí)間
driver.set_page_load_timeout(10)
------------------------------------------------------
class SeleniumDownloadMiddlerware(object):
@classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
# 信號(hào)量鏈接
crawler.signals.connect(s.close, signal=signals.spider_closed)
return s
def close(self,spider):
#driver就寫(xiě)在spider中
spider.driver.close()
def process_request(self,request,spider):
if spider.name == 'test':
#獲取url
url = request.url
if url:
try:
spider.driver.get(url)
page = spider.driver.page_source
if page:
"""
self, url, status=200, headers=None, body=b''(相應(yīng)結(jié)果), flags=None, request=None):
self.headers = Headers(headers or {}
"""
return HtmlResponse(url=url,status=200,body=page.encode('utf-8'),request=request)
except TimeoutException as err:
print("請(qǐng)求超時(shí)",url)
return HtmlResponse(url=url, status=408, body=b"", request=request)
切記
激活中間件
SPIDER_MIDDLEWARES = {
'downloadware.middlewares.DownloadwareSpiderMiddleware': 543,
}