思考:
1. Scrapy分布式爬蟲意味著幾臺(tái)機(jī)器通過(guò)某種方式共同執(zhí)行一套爬取任務(wù),這就首先要求每臺(tái)機(jī)器都要有Scrapy框架抖僵,一套Scrapy框架就有一套Scrapy五大核心組件埃叭,引擎--調(diào)度器--下載器--爬蟲--項(xiàng)目管道,各自獨(dú)有的調(diào)度器沒(méi)有辦法實(shí)現(xiàn)任務(wù)的共享,所以不能實(shí)現(xiàn)分布式爬取秸侣。
2. 假設(shè)可以實(shí)現(xiàn)Scrapy框架的調(diào)度器共享,那么就能實(shí)現(xiàn)分布式爬取了嗎宠互?答案是不能味榛,因?yàn)槲覀儗?shí)現(xiàn)了任務(wù)的共享,但是框架之間的項(xiàng)目管道是單獨(dú)的予跌,我們的任務(wù)下載完之后搏色,我們爬取的有效信息還是不能全部存放在某個(gè)指定的位置,所以要想實(shí)現(xiàn)分布式爬蟲券册,需要同時(shí)滿足調(diào)度器和項(xiàng)目管道的共享才可以達(dá)到分布式的效果频轿。
實(shí)現(xiàn):基于Scrapy-redis實(shí)現(xiàn)分布式爬蟲:
scrapy-redis內(nèi)部實(shí)現(xiàn)了調(diào)度器和項(xiàng)目管道共享,可以實(shí)現(xiàn)分布式爬蟲
一烁焙、redis數(shù)據(jù)庫(kù)實(shí)現(xiàn)RedisCrawlSpider分布式操作
案例簡(jiǎn)述:分布式爬蟲爬取抽屜網(wǎng)全棧主題文本數(shù)據(jù)
redis的準(zhǔn)備工作:
1.對(duì)redis配置文件進(jìn)行配置:
- 注釋該行:bind 127.0.0.1航邢,表示可以讓其他ip訪問(wèn)redis
- 將yes該為no:protected-mode no,表示可以讓其他ip操作redis
2.啟動(dòng)redis:
mac/linux: redis-server redis.conf
windows: redis-server.exe redis-windows.conf實(shí)現(xiàn)分布式爬蟲的操作步驟:
1. 將redis數(shù)據(jù)庫(kù)的配置文件進(jìn)行改動(dòng): .修改值 protected-mode no .注釋 bind 127.0.0.1
2. 下載scrapy-redis
pip3 install scraps-redis
3. 創(chuàng)建工程 scrapy startproject 工程名
scrapy startproject 工程名
4. 創(chuàng)建基于scrawlSpider的爬蟲文件
cd 工程名
scrapy genspider -t crawl 項(xiàng)目名
5. 導(dǎo)入RedisCrawlSpider類
from scrapy_redis.spiders import RedisCrawlSpider
6. 在現(xiàn)有代碼的基礎(chǔ)上進(jìn)行連接提取和解析操作
class RidesdemoSpider(RedisCrawlSpider):
redis_key = "redisQueue"
7. 將解析的數(shù)據(jù)值封裝到item中骄蝇,然后將item對(duì)象提交到scrapy-redis組件中的管道里(自建項(xiàng)目的管道沒(méi)什么用了膳殷,可以直接刪除了,用的是組件封裝好的scrapy_redis.pipelines中)
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 400,
}
. 8. 管道會(huì)將數(shù)據(jù)值寫入到指定的redis數(shù)據(jù)庫(kù)中(在配置文件中進(jìn)行指定redis數(shù)據(jù)庫(kù)ip的編寫)
REDIS_HOST = '192.168.137.76'
REDIS_PORT = 6379
REDIS_ENCODING = ‘utf-8’
# REDIS_PARAMS = {‘password’:’123456’}
9. 在當(dāng)前工程中使用scrapy-redis封裝好的調(diào)度器(在配置文件中進(jìn)行配置)
# 使用scrapy-redis組件的去重隊(duì)列(過(guò)濾)
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 使用scrapy-redis組件自己的調(diào)度器(核心代碼共享調(diào)度器)
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 是否允許暫停
SCHEDULER_PERSIST = True
11. 啟動(dòng)redis服務(wù)器:
redis-server redis.windows.conf windows系統(tǒng)
redis-server redis.conf mac系統(tǒng)
12. 啟動(dòng)redis-cli
redis-cli
13. 執(zhí)行當(dāng)前爬蟲文件:
scrapy runspider 爬蟲文件.py
14. 向隊(duì)列中扔一個(gè)起始url>>>在redis-cli執(zhí)行扔的操作:
lpush redis_key的value值 起始url
spider.py文件:
# -*- coding: utf-8 -*-
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule
from scrapy_redis.spiders import RedisCrawlSpider
from redisScrapyPro.items import RedisscrapyproItem
class RidesdemoSpider(RedisCrawlSpider):
name = 'redisDemo'
# scrapy_redis的調(diào)度器隊(duì)列的名稱九火,最終我們會(huì)根據(jù)該隊(duì)列的名稱向調(diào)度器隊(duì)列中扔一個(gè)起始url
redis_key = "redisQueue"
link = LinkExtractor(allow=r'https://dig.chouti.com/.*?/.*?/.*?/\d+')
link1 = LinkExtractor(allow=r'https://dig.chouti.com/all/hot/recent/1')
rules = (
Rule(link, callback='parse_item', follow=True),
Rule(link1, callback='parse_item', follow=True),
)
def parse_item(self, response):
div_list = response.xpath('//*[@id="content-list"]/div')
for div in div_list:
content = div.xpath('string(./div[@class="news-content"]/div[1]/a[1])').extract_first().strip().replace("\t","")
print(content)
item = RedisscrapyproItem()
item['content'] = content
yield item
settings.py
BOT_NAME = 'redisScrapyPro'
SPIDER_MODULES = ['redisScrapyPro.spiders']
NEWSPIDER_MODULE = 'redisScrapyPro.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
'scrapy_redis.pipelines.RedisPipeline': 400,
}
REDIS_HOST = '192.168.137.76'
REDIS_PORT = 6379
REDIS_ENCODING = 'utf-8'
# REDIS_PARAMS = {‘password’:’123456’}
# 使用scrapy-redis組件的去重隊(duì)列
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"
# 使用scrapy-redis組件自己的調(diào)度器
SCHEDULER = "scrapy_redis.scheduler.Scheduler"
# 是否允許暫停
SCHEDULER_PERSIST = True
更多項(xiàng)目代碼
https://github.com/wangjifei121/FB-RedisCrawlSpider
二赚窃、redis數(shù)據(jù)庫(kù)實(shí)現(xiàn)RedisSpider分布式操作
案例簡(jiǎn)述:分布式爬蟲爬取網(wǎng)易新聞(國(guó)內(nèi)册招,國(guó)際,軍事勒极,航空四個(gè)板塊)
擴(kuò)展知識(shí)點(diǎn)使用:
- selenium如何被應(yīng)用到scrapy框架
- UA池的使用
- 代理IP池的使用
RedisSpider分布式操作的步驟和RedisCrawlSpider分布式的搭建步驟是相同的跨细,參照以上步驟來(lái)學(xué)習(xí)搭建就可以。
接下來(lái)主要講解一下拓展知識(shí)點(diǎn)的使用:
一河质、selenium如何被應(yīng)用到scrapy框架
首先看spider類中增加的兩個(gè)方法:
def __init__(self):
pass
def closed(self, spider):
pass
通過(guò)對(duì)網(wǎng)易新聞網(wǎng)頁(yè)分析冀惭,我們可以看出網(wǎng)頁(yè)的數(shù)據(jù)采取了動(dòng)態(tài)加載數(shù)據(jù)的反爬措施,這樣我們要想獲取更多數(shù)據(jù)就需要使用到selenium的webdriver類
了掀鹅。
使用webdriver的第一步就是實(shí)例化一個(gè)webdriver對(duì)象散休,而實(shí)例化的對(duì)象只需要實(shí)例化一次就能在爬取的過(guò)程中使用,所以就應(yīng)該想到類的實(shí)例化方法init乐尊。
實(shí)例化的webdriver對(duì)象在結(jié)束使用后需要關(guān)閉戚丸,正好spider類給我們提供了closed方法來(lái)做關(guān)閉操作。這樣我們就可以通過(guò)這兩個(gè)方法來(lái)實(shí)現(xiàn)我們的想法了扔嵌。
.
那我們實(shí)例化好了webdriver對(duì)象該怎么用它呢限府?在哪里用?
首先我們?cè)诓挥胹elenium的時(shí)候我們發(fā)現(xiàn)頁(yè)面的數(shù)據(jù)我們是獲取不到的痢缎,也可以換個(gè)角度來(lái)說(shuō)胁勺,我們獲取到了數(shù)據(jù)但不是我們想要的。這樣我們的需求就要求我們重新來(lái)獲取想要的response独旷,可以肯定的是署穗,每次請(qǐng)求我們都要做相應(yīng)的處理,這時(shí)候經(jīng)驗(yàn)豐富的你就應(yīng)該想到了三個(gè)字《中間件》嵌洼,那在哪個(gè)中間件來(lái)做對(duì)應(yīng)的處理呢案疲?
很顯然要在process_response
中間件中來(lái)執(zhí)行我們的selenium的相關(guān)操作,這個(gè)中間件的作用就是攔截到響應(yīng)對(duì)象(下載器傳遞給Spider的響應(yīng)對(duì)象)麻养,通過(guò)處理褐啡、偽裝response從而得到我們想要的數(shù)據(jù)。
.
process_response
中間件中參數(shù)解釋:
request:響應(yīng)對(duì)象對(duì)應(yīng)的請(qǐng)求對(duì)象
response:攔截到的響應(yīng)對(duì)象
spider:爬蟲文件中對(duì)應(yīng)的爬蟲類的實(shí)例
.
在中間件中我們主要做了哪些操作呢鳖昌?
通過(guò)實(shí)例化好的瀏覽器對(duì)象發(fā)動(dòng)請(qǐng)求-->執(zhí)行相應(yīng)的js代碼-->獲取對(duì)應(yīng)的頁(yè)面數(shù)據(jù)-->篡改相應(yīng)對(duì)象-->返回response
二备畦、UA池代碼的編寫
和上面的方法一樣,UA池代碼也需要在中間件中編寫遗遵,需要導(dǎo)入一個(gè)類來(lái)繼承萍恕,然后構(gòu)建我們的UA類:
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
三、IP代理池的編寫
問(wèn)題的思考方式是一樣的车要,這里就略過(guò)了這里需要著重注意:請(qǐng)求的協(xié)議頭有http 和 https兩種允粤,我們需要做相應(yīng)的判斷
最后中間件一定要在settings中進(jìn)行注冊(cè)!!类垫!注冊(cè)K竟狻!悉患!注冊(cè)2屑摇!售躁!
最后附上spider.py和middlewares.py的代碼:
spider.py
# -*- coding: utf-8 -*-
import scrapy
from selenium import webdriver
from wangyiPro.items import WangyiproItem
from scrapy_redis.spiders import RedisSpider
class WangyiSpider(RedisSpider):
name = 'wangyi'
# allowed_domains = ['www.xxxx.com']
# start_urls = ['https://news.163.com']
redis_key = 'wangyi'
def __init__(self):
# 實(shí)例化一個(gè)瀏覽器對(duì)象(實(shí)例化一次)
self.bro = webdriver.Chrome(executable_path='/Users/wangjifei/my_useful_file/chromedriver')
# 必須在整個(gè)爬蟲結(jié)束后坞淮,關(guān)閉瀏覽器
def closed(self, spider):
print('爬蟲結(jié)束')
self.bro.quit()
def parse(self, response):
lis = response.xpath('//div[@class="ns_area list"]/ul/li')
indexs = [3, 4, 6, 7]
li_list = [] # 存儲(chǔ)的就是國(guó)內(nèi),國(guó)際陪捷,軍事回窘,航空四個(gè)板塊對(duì)應(yīng)的li標(biāo)簽對(duì)象
for index in indexs:
li_list.append(lis[index])
# 獲取四個(gè)板塊中的鏈接和文字標(biāo)題
for li in li_list:
url = li.xpath('./a/@href').extract_first()
title = li.xpath('./a/text()').extract_first()
# 對(duì)每一個(gè)板塊對(duì)應(yīng)的url發(fā)起請(qǐng)求,獲取頁(yè)面數(shù)據(jù)(標(biāo)題市袖,縮略圖啡直,關(guān)鍵字,發(fā)布時(shí)間苍碟,url)
yield scrapy.Request(url=url, callback=self.parseSecond, meta={'title': title})
def parseSecond(self, response):
div_list = response.xpath('//div[@class="data_row news_article clearfix "]')
# print(len(div_list))
for div in div_list:
head = div.xpath('.//div[@class="news_title"]/h3/a/text()').extract_first()
url = div.xpath('.//div[@class="news_title"]/h3/a/@href').extract_first()
imgUrl = div.xpath('./a/img/@src').extract_first()
tag = div.xpath('.//div[@class="news_tag"]//text()').extract()
tags = []
for t in tag:
t = t.strip(' \n \t')
tags.append(t)
tag = "".join(tags)
# 獲取meta傳遞過(guò)來(lái)的數(shù)據(jù)值title
title = response.meta['title']
# 實(shí)例化item對(duì)象酒觅,將解析到的數(shù)據(jù)值存儲(chǔ)到item對(duì)象中
item = WangyiproItem()
item['head'] = head
item['url'] = url
item['imgUrl'] = imgUrl
item['tag'] = tag
item['title'] = title
# 對(duì)url發(fā)起請(qǐng)求,獲取對(duì)應(yīng)頁(yè)面中存儲(chǔ)的新聞內(nèi)容數(shù)據(jù)
yield scrapy.Request(url=url, callback=self.getContent, meta={'item': item})
# print(head+":"+url+":"+imgUrl+":"+tag)
def getContent(self, response):
# 獲取傳遞過(guò)來(lái)的item
item = response.meta['item']
# 解析當(dāng)前頁(yè)面中存儲(chǔ)的新聞數(shù)據(jù)
content_list = response.xpath('//div[@class="post_text"]/p/text()').extract()
content = "".join(content_list)
item['content'] = content
yield item
middlewares.py
from scrapy.http import HtmlResponse
import time
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware
import random
#UA池代碼的編寫(單獨(dú)給UA池封裝一個(gè)下載中間件的一個(gè)類)
#1微峰,導(dǎo)包UserAgentMiddlware類
class RandomUserAgent(UserAgentMiddleware):
def process_request(self, request, spider):
#從列表中隨機(jī)抽選出一個(gè)ua值
ua = random.choice(user_agent_list)
#ua值進(jìn)行當(dāng)前攔截到請(qǐng)求的ua的寫入操作
request.headers.setdefault('User-Agent',ua)
#批量對(duì)攔截到的請(qǐng)求進(jìn)行ip更換
class Proxy(object):
def process_request(self, request, spider):
#對(duì)攔截到請(qǐng)求的url進(jìn)行判斷(協(xié)議頭到底是http還是https)
#request.url返回值:http://www.xxx.com
h = request.url.split(':')[0] #請(qǐng)求的協(xié)議頭
if h == 'https':
ip = random.choice(PROXY_https)
request.meta['proxy'] = 'https://'+ip
else:
ip = random.choice(PROXY_http)
request.meta['proxy'] = 'http://' + ip
class WangyiproDownloaderMiddleware(object):
#攔截到響應(yīng)對(duì)象(下載器傳遞給Spider的響應(yīng)對(duì)象)
#request:響應(yīng)對(duì)象對(duì)應(yīng)的請(qǐng)求對(duì)象
#response:攔截到的響應(yīng)對(duì)象
#spider:爬蟲文件中對(duì)應(yīng)的爬蟲類的實(shí)例
def process_response(self, request, response, spider):
#響應(yīng)對(duì)象中存儲(chǔ)頁(yè)面數(shù)據(jù)的篡改
if request.url in['http://news.163.com/domestic/','http://news.163.com/world/','http://news.163.com/air/','http://war.163.com/']:
spider.bro.get(url=request.url)
js = 'window.scrollTo(0,document.body.scrollHeight)'
spider.bro.execute_script(js)
time.sleep(2) #一定要給與瀏覽器一定的緩沖加載數(shù)據(jù)的時(shí)間
#頁(yè)面數(shù)據(jù)就是包含了動(dòng)態(tài)加載出來(lái)的新聞數(shù)據(jù)對(duì)應(yīng)的頁(yè)面數(shù)據(jù)
page_text = spider.bro.page_source
#篡改響應(yīng)對(duì)象
return HtmlResponse(url=spider.bro.current_url,body=page_text,encoding='utf-8',request=request)
else:
return response
PROXY_http = [
'153.180.102.104:80',
'195.208.131.189:56055',
]
PROXY_https = [
'120.83.49.90:9000',
'95.189.112.214:35508',
]
user_agent_list = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 "
"(KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1",
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 "
"(KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 "
"(KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 "
"(KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 "
"(KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 "
"(KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 "
"(KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 "
"(KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 "
"(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 "
"(KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
作者:SlashBoyMr_wang
鏈接:http://www.reibang.com/p/5baa1d5eb6d9
來(lái)源:簡(jiǎn)書