66.3-代理豆瓣圖書爬蟲

當擁有已經(jīng)是失去摧玫,就勇敢的放棄咳促!

總結(jié):

  1. 鏈接提取器:鏈接回調(diào)rule;處理把鏈接封裝成request 下載后返回的response襟沮,在response中要不要再處理啊锥惋,由follow控制昌腰,
    Rule(LinkExtractor(allow=r'Items/')// 抽取的內(nèi)容, callback='parse_item'//回調(diào)執(zhí)行函數(shù), follow=True//是否跟進),

Scrapy實戰(zhàn)案例——爬取豆瓣讀書

scrapy 的功能 解決 URL分頁爬取的問題;

需求 : 爬取豆瓣讀書膀跌,并提取后續(xù)鏈接加入待爬取隊列遭商;

1. 鏈接分析

https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=0&type=T
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=20&type=T

start的值一直在變化;

2. 項目開發(fā)

2.1 創(chuàng)建項目 和 setting.py設(shè)置
# 同一個目錄下創(chuàng)建多個scrapy項目
(F:\Pyenv\conda3.8) F:\Projects\spider>scrapy startproject firstpro .

(F:\Pyenv\conda3.8) F:\Projects\spider>scrapy startproject firstpro
New Scrapy project 'firstpro', using template directory 'F:\Pyenv\conda3.8\lib\site-packages\scrapy\templates\project', created in:
    F:\Projects\spider\firstpro

You can start your first spider with:
    cd firstpro
    scrapy genspider example example.com
# setting.py
rom fake_useragent import UserAgent

BOT_NAME = 'mspider'

SPIDER_MODULES = ['mspider.spiders']
NEWSPIDER_MODULE = 'mspider.spiders'

USER_AGENT = UserAgent().random
ROBOTSTXT_OBEY = False
COOKIES_ENABLED = False   
spider中出現(xiàn)一個book
2.2 編寫item
import scrapy

class MspiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    rate = scrapy.Field()

2.3 創(chuàng)建爬蟲

scrapy genspider -t template <name> <domain>

模板
-t 模板捅伤,這個選項可以使用一個模板來創(chuàng)建爬蟲類劫流,常用模板有basic、crawl 丛忆;

鏈接抽取器Rule規(guī)則類
class scrapy.spiders.Rule(link_extractor, callback = None, cb_kwargs = None, follow = None, process_links = None, process_request = None)


link_extractor:一個LinkExtractor對象祠汇,用于定義爬取規(guī)則。
callback:下載內(nèi)容解析的回調(diào)函數(shù)蘸际;滿足這個規(guī)則的url座哩,應(yīng)該要執(zhí)行哪個回調(diào)函數(shù)徒扶。因為CrawlSpider使用了parse作為回調(diào)函數(shù)粮彤,因此不要覆蓋parse作為回調(diào)函數(shù)自己的回調(diào)函數(shù)。
follow:指定根據(jù)該規(guī)則從response中提取的鏈接是否需要跟進(再提取)姜骡。
process_links:從link_extractor中獲取到鏈接后會傳遞給這個函數(shù)导坟,用來過濾不需要爬取的鏈接。

# 創(chuàng)建模板并指定域
(blog) F:\Projects\scrapy\mspider>scrapy genspider -t crawl book douban.com
Created spider 'book' using template 'crawl' in module:
  mspider.spiders.book

# book.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class BookSpider(CrawlSpider):
    name = 'book'
    allowed_domains = ['douban.com']
    start_urls = ['https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=0&type=T']

    rules = (
        Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        i = {}
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        return i

scrapy.spiders.crawl.CrawlSpider 是 scrapy.spiders.Spider的子類圈澈,增強了功能惫周。在其中可以使用LinkExtractor、Rule康栈。

規(guī)則Rule定義

  1. rules元組里面定義多條規(guī)則Rule递递,用規(guī)則來方便的跟進鏈接
  2. LinkExtractor從response中提取鏈接
    allow需要一個對象或可迭代對象,其中配置正則表達式啥么,表示匹配什么鏈接登舞,即只關(guān)心 <a> 標簽
  3. callback 定義匹配鏈接后執(zhí)行的回調(diào),特別注意不要使用parse這個名稱悬荣。返回一個包含Item或Request對象的列表
    參考 scrapy.spiders.crawl.CrawlSpider#_parse_response
  4. follow是否跟進鏈接

由此得到一個本例程的規(guī)則菠秒,如下

import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http.response.html import HtmlResponse

class BookSpider(CrawlSpider):
    name = 'book1'
    allowed_domains = ['douban.com']
    start_urls = ['https://book.douban.com/tag/%E7%BC%96%E7%A8%8B']

    rules = (
        Rule(LinkExtractor(allow=r'start=\d+'), callback='parse_item', follow=False),
    )
    # rule = ()

    def parse_item(self, response:HtmlResponse):
        print(response.url)
        print('-'*30)
        i = {}

        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        return i
#-------------------------------------------------------------------------------------------------
F:\Projects\scrapy\mspider>scrapy crawl book1 --nolog
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=60&type=T
------------------------------
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=40&type=T
------------------------------
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=20&type=T
------------------------------
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=1440&type=T
------------------------------
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=1420&type=T
------------------------------
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=160&type=T
------------------------------
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=140&type=T
------------------------------
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=120&type=T
------------------------------
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=100&type=T
------------------------------
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=80&type=T
------------------------------

可以看到抽取的是頁面內(nèi) 顯示的頁碼


頁碼

由此得到一個本例程的規(guī)則,如下:

(blog) F:\Projects\scrapy\mspider>scrapy crawl book1 --nolog

# book.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http.response.html import HtmlResponse
from ..items import BookItem

class BookSpider(CrawlSpider):
    name = 'book1'
    allowed_domains = ['douban.com']
    start_urls = ['https://book.douban.com/tag/%E7%BC%96%E7%A8%8B']

    custom_settings = {
        'filename':'./book2.json'
    }

    rules = (
        Rule(LinkExtractor(allow=r'start=\d+'), callback='parse_item', follow=False),
    )

    def parse_item(self, response:HtmlResponse):
        print(response.url)

        subjects = response.xpath('//li[@class="subject-item"]')

        for subject in subjects:
            title = "".join((x.strip() for x in subject.xpath('.//h2/a//text()').extract()))
            rate = subject.css('span.rating_nums').xpath('./text()').extract()  # extract_first()/extract()[0]

            item = BookItem()
            item['title'] = title
            item['rate'] = rate[0] if rate else '0'

            yield item
# -----------------------------------------------------
<BookItem {'title': '圖解HTTP', 'rate': '8.1'}> --------------
<BookItem {'title': 'SQL必知必會: (第4版)', 'rate': '8.5'}> --------------
==========160


# setting.py
rom fake_useragent import UserAgent

BOT_NAME = 'mspider'

SPIDER_MODULES = ['mspider.spiders']
NEWSPIDER_MODULE = 'mspider.spiders'

USER_AGENT = UserAgent().random
ROBOTSTXT_OBEY = False
CONCURRENT_REQUESTS = 4      # 并行小一點
DOWNLOAD_DELAY = 1     # 延遲   1  S;
COOKIES_ENABLED = False   

ITEM_PIPELINES = {
   'mspider.pipelines.MspiderPipeline': 300,
}


# pipelines.py

import simplejson
from scrapy import Spider

class MspiderPipeline(object):
    def __init__(self):   # 實例化過程氯迂, 可以不用
        print('~~~~~init~~~~~')

    def open_spider(self, spider:Spider):
        self.count = 0
        print(spider.name)
        filename = spider.settings['filename']
        self.file = open(filename, 'w', encoding='utf-8')
        self.file.write('[\n')
        self.file.flush()

    def process_item(self, item, spider):
        print(item, '--------------')
        self.count += 1
        self.file.write(simplejson.dumps(dict(item)) + ',\n')   # dict

        return item

    def close_spider(self, spider):
        print('=========={}'.format(self.count))
        self.file.write(']')
        self.file.close()

爬蟲會首先爬取start_urls践叠,按照規(guī)則Rule,會分析并抽取頁面內(nèi)匹配的鏈接嚼蚀,并發(fā)起對這些鏈接的請求禁灼,任一請求響應(yīng)后,執(zhí)行回調(diào)函數(shù)parse_item轿曙∨叮回調(diào)中response就是提取到的鏈接的頁面請求返回的HTML哮独,直接對這個HTML使用xpath或css分析即可。
follow決定著是否在回調(diào)函數(shù)中對response內(nèi)容中的鏈接進行抽取

1. 爬取

修改 follow=True 察藐; 鏈接抽取器將會 提取頁面所有的頁碼鏈接(1-99)

2.4 代理(反爬)

在爬取過程中皮璧,豆瓣使用了反爬策略,可能會出現(xiàn)以下現(xiàn)象


這相當于封了IP分飞,登錄賬號會被封賬號悴务;所以,可以在爬取時使用代理來解決譬猫。

思路:在發(fā)起HTTP請求之前讯檐, 會經(jīng)過下載中間件, 自定義一個下載中間件染服, 在其中臨時獲取一個代理地址别洪, 然后再發(fā)起HTTP請求

代理測試

'mspider.middlewares.ProxyDownloaderMiddleware':150,
'mspider.middlewares.After':600,
如果middlewares.py 中 ProxyDownloaderMiddleware 返回一個response, 它會把處理異常的class After(object): 和 process_exception方法繞過柳刮;
創(chuàng)建 test.py

(F:\Pyenv\conda3.8) F:\Projects\spider>scrapy genspider -t basic test httpbin.org
Created spider 'test' using template 'basic' in module:
  mspider.spiders.test

(F:\Pyenv\conda3.8) F:\Projects\spider>scrapy crawl test
#----------------------------------------------------------------------------------
 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://myip.ipip.net/> (referer: None)
http://www.magedu.com/user?id=1000 ++++++++++++++



# test.py
import scrapy

class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['ipip.net']
    start_urls = ['http://myip.ipip.net/']

    def parse(self, response):
        print(response.url,'++++++++++++++')


# middlewares.py
from scrapy.http.response.html import HtmlResponse

class MspiderSpiderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the spider middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_spider_input(self, response, spider):
        # Called for each response that goes through the spider
        # middleware and into the spider.

        # Should return None or raise an exception.
        return None

    def process_spider_output(self, response, result, spider):
        # Called with the results returned from the Spider, after
        # it has processed the response.

        # Must return an iterable of Request, or item objects.
        for i in result:
            yield i

    def process_spider_exception(self, response, exception, spider):
        # Called when a spider or process_spider_input() method
        # (from other spider middleware) raises an exception.

        # Should return either None or an iterable of Request or item objects.
        pass

    def process_start_requests(self, start_requests, spider):
        # Called with the start requests of the spider, and works
        # similarly to the process_spider_output() method, except
        # that it doesn’t have a response associated.

        # Must return only requests (not items).
        for r in start_requests:
            yield r

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

class MspiderDownloaderMiddleware:
    # Not all methods need to be defined. If a method is not defined,
    # scrapy acts as if the downloader middleware does not modify the
    # passed objects.

    @classmethod
    def from_crawler(cls, crawler):
        # This method is used by Scrapy to create your spiders.
        s = cls()
        crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
        return s

    def process_request(self, request, spider):
        # Called for each request that goes through the downloader
        # middleware.

        # Must either:
        # - return None: continue processing this request
        # - or return a Response object
        # - or return a Request object
        # - or raise IgnoreRequest: process_exception() methods of
        #   installed downloader middleware will be called
        return None

    def process_response(self, request, response, spider):
        # Called with the response returned from the downloader.

        # Must either;
        # - return a Response object
        # - return a Request object
        # - or raise IgnoreRequest
        return response

    def process_exception(self, request, exception, spider):
        # Called when a download handler or a process_request()
        # (from other downloader middleware) raises an exception.

        # Must either:
        # - return None: continue processing this exception
        # - return a Response object: stops process_exception() chain
        # - return a Request object: stops process_exception() chain
        pass

    def spider_opened(self, spider):
        spider.logger.info('Spider opened: %s' % spider.name)

# 添加的部分
class ProxyDownloaderMiddleware:

    proxies = []

    def process_request(self, request, spider):

        return HtmlResponse('http://www.magedu.com/user?id=1000')

# 看造出的response怎么處理挖垛;
class After(object):
    def process_request(self, request, spider):
        print('After ~~~~~{}')

        return None


# settings 下載中間件中添加 代理ProxyDownloaderMiddleware和 mspider.middlewares.After;
# 暫時關(guān)閉pipelines 中間件秉颗;
DOWNLOADER_MIDDLEWARES = {
   # 'mspider.middlewares.MspiderDownloaderMiddleware': 543,
   'mspider.middlewares.ProxyDownloaderMiddleware':150,
   'mspider.middlewares.After':600,
}

# ITEM_PIPELINES = {
#    'mspider.pipelines.MspiderPipeline': 300,
# }
  1. 下載中間件
    仿照middlewares.py中的下載中間件痢毒,編寫process_request,返回None;
    參考https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
from scrapy import signals
from scrapy.http.response.html import HtmlResponse
from scrapy.http.request import Request

import random
class ProxyDownloaderMiddleware:

    proxy_ip = '117.44.10.234'
    proxy_port = 36410
    proxies = [
        'http://{}:{}'.format(proxy_ip, proxy_port)
    ]

    def process_request(self, request:Request, spider):

        proxy = random.choice(self.proxies)
        request.meta['proxy'] = proxy   # 換代理
        print(request.url,request.meta['proxy'])
        # return None
#----------------------------------------------------
(F:\Pyenv\conda3.8) F:\Projects\spider>scrapy crawl test --nolog
http://myip.ipip.net/ http://223.215.13.40:894
當前 IP:223.215.13.40  來自于:中國 安徽 蕪湖  電信  (非本機IP)
 ++++++++++++++

2蚕甥、配置
在settings.py中

DOWNLOADER_MIDDLEWARES = {
   # 'mspider.middlewares.MspiderDownloaderMiddleware': 543,
   'mspider.middlewares.ProxyDownloaderMiddleware':150,    # 優(yōu)先級可以高一點哪替;
}
IP代理成功

爬取豆瓣50頁

# book.py        follow=True
    rules = (
        Rule(LinkExtractor(allow=r'start=\d+'), callback='parse_item', follow=True),
    )

# settings
ITEM_PIPELINES = {         # 開啟pipelines
   'mspider.pipelines.MspiderPipeline': 300,
}

# CONCURRENT_REQUESTS = 4
# DOWNLOAD_DELAY = 1

(F:\Pyenv\conda3.8) F:\Projects\spider>scrapy crawl book1 --nolog

{'rate': '8.8', 'title': '黑客與畫家: 來自計算機時代的高見'} ~----~~~~~~~~~
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=1060&type=T http://223.215.13.40:894
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=1040&type=T
https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=1060&type=T
==========1000   # 

剛好 50 頁爬取1000條數(shù)據(jù)成功;

參考:

# books
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.http.response.html import HtmlResponse
from ..items import BookItem

class BookSpider(CrawlSpider):
    name = 'book1'
    allowed_domains = ['douban.com']
    start_urls = ['https://book.douban.com/tag/%E7%BC%96%E7%A8%8B?start=0&type=T']

    rules = (
        Rule(LinkExtractor(allow=r'start=\d+'), callback='parse_item', follow=True),
    )

    custom_settings = {
        'filename':'./books2.json'
    }

    def parse_item(self, response:HtmlResponse):
        print(response.url)

        subjects = response.xpath('//li[@class="subject-item"]')

        for subject in subjects:
            title = "".join((x.strip() for x in subject.xpath('.//h2/a//text()').extract()))
            rate = subject.css('span.rating_nums').xpath('./text()').extract()  # extract_first()/extract()[0]

            item = BookItem()
            item['title'] = title
            item['rate'] = rate[0] if rate else '0'
            print(item)

            yield item



最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末菇怀,一起剝皮案震驚了整個濱河市凭舶,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌爱沟,老刑警劉巖帅霜,帶你破解...
    沈念sama閱讀 216,324評論 6 498
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異钥顽,居然都是意外死亡义屏,警方通過查閱死者的電腦和手機,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,356評論 3 392
  • 文/潘曉璐 我一進店門蜂大,熙熙樓的掌柜王于貴愁眉苦臉地迎上來闽铐,“玉大人,你說我怎么就攤上這事奶浦⌒质” “怎么了?”我有些...
    開封第一講書人閱讀 162,328評論 0 353
  • 文/不壞的土叔 我叫張陵澳叉,是天一觀的道長隙咸。 經(jīng)常有香客問我沐悦,道長,這世上最難降的妖魔是什么五督? 我笑而不...
    開封第一講書人閱讀 58,147評論 1 292
  • 正文 為了忘掉前任藏否,我火速辦了婚禮,結(jié)果婚禮上充包,老公的妹妹穿的比我還像新娘副签。我一直安慰自己,他們只是感情好基矮,可當我...
    茶點故事閱讀 67,160評論 6 388
  • 文/花漫 我一把揭開白布淆储。 她就那樣靜靜地躺著,像睡著了一般家浇。 火紅的嫁衣襯著肌膚如雪本砰。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 51,115評論 1 296
  • 那天钢悲,我揣著相機與錄音点额,去河邊找鬼。 笑死譬巫,一個胖子當著我的面吹牛咖楣,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播芦昔,決...
    沈念sama閱讀 40,025評論 3 417
  • 文/蒼蘭香墨 我猛地睜開眼,長吁一口氣:“原來是場噩夢啊……” “哼娃肿!你這毒婦竟也來了咕缎?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 38,867評論 0 274
  • 序言:老撾萬榮一對情侶失蹤料扰,失蹤者是張志新(化名)和其女友劉穎凭豪,沒想到半個月后,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體晒杈,經(jīng)...
    沈念sama閱讀 45,307評論 1 310
  • 正文 獨居荒郊野嶺守林人離奇死亡嫂伞,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 37,528評論 2 332
  • 正文 我和宋清朗相戀三年,在試婚紗的時候發(fā)現(xiàn)自己被綠了拯钻。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片帖努。...
    茶點故事閱讀 39,688評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡,死狀恐怖粪般,靈堂內(nèi)的尸體忽然破棺而出拼余,到底是詐尸還是另有隱情,我是刑警寧澤亩歹,帶...
    沈念sama閱讀 35,409評論 5 343
  • 正文 年R本政府宣布匙监,位于F島的核電站凡橱,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏亭姥。R本人自食惡果不足惜稼钩,卻給世界環(huán)境...
    茶點故事閱讀 41,001評論 3 325
  • 文/蒙蒙 一、第九天 我趴在偏房一處隱蔽的房頂上張望达罗。 院中可真熱鬧变抽,春花似錦、人聲如沸氮块。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,657評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽滔蝉。三九已至击儡,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間蝠引,已是汗流浹背阳谍。 一陣腳步聲響...
    開封第一講書人閱讀 32,811評論 1 268
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留螃概,地道東北人矫夯。 一個月前我還...
    沈念sama閱讀 47,685評論 2 368
  • 正文 我出身青樓,卻偏偏與公主長得像吊洼,于是被迫代替她去往敵國和親训貌。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當晚...
    茶點故事閱讀 44,573評論 2 353