- 為什么要增加隨機(jī)請(qǐng)求頭:更好地偽裝瀏覽器肖卧,防止被Ban室梅。
- 如何在每次請(qǐng)求時(shí)较坛,更換不同的user_agent留晚,Scrapy使用Middleware即可
Spider 中間件(Middleware) 下載器中間件是介入到 Scrapy 的 spider 處理機(jī)制的鉤子框架酵紫,可以添加代碼來處理發(fā)送給Spiders的 response 及 spider 產(chǎn)生的 item 和 request告嘲。
官網(wǎng)說明在這里:Spider Middleware
- 添加middleware的步驟:
1)創(chuàng)建一個(gè)中間件(RandomAgentMiddleware)
設(shè)置請(qǐng)求時(shí)使用隨機(jī)user_agent
-
在settings.py中配置错维,激活中間件。
網(wǎng)上文章基本上轉(zhuǎn)的都是下面這段代碼:
-
這段代碼中的疑問:
1)自己寫的Middleware放在哪個(gè)目錄下
2)settings.py中的MIDDLEWARES的路徑是如何定1)
自己編寫的中間件放在items.py和settings.py的同一級(jí)目錄橄唬。
2)
settings.py中的MIDDLEWARES的路徑赋焕,應(yīng)該是:
yourproject.middlewares(文件名).middleware類
如果你的中間件的類名和文件名都使用了RandomUserAgentMiddleware,那這個(gè)路徑應(yīng)該寫成:
xiaozhu.RandomUserAgentMiddleware.RandomUserAgentMiddleware
這一點(diǎn)仰楚,大家可以比較引入自己寫的pipelines隆判,只不過Scrapy框架本身為我們創(chuàng)建了一個(gè)pipelines.py
3) 在middleware中間件中導(dǎo)入settings中的USER_AGENT_LIST
我使用的是mac犬庇,因?yàn)閟ettings.py與RandomUserAgentMiddleware在同一級(jí)目錄
from settings import USER_AGENT_LIST
Scrapy增加隨機(jī)user_agent的完整代碼:
from settings import USER_AGENT_LIST
import random
from scrapy import log
class RandomUserAgentMiddleware(object):
def process_request(self, request, spider):
ua = random.choice(USER_AGENT_LIST)
if ua:
request.headers.setdefault('User-Agent', ua)
settings.py中:
USER_AGENT_LIST=[
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
DOWNLOADER_MIDDLEWARES = {
'xiaozhu.user_agent_middleware.RandomUserAgentMiddleware': 400,
'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None,
}
代碼Github: https://github.com/ppy2790/xiaozhu