Scrapy - 第一個(gè)爬蟲和我的博客

第一個(gè)爬蟲

這里我用官方文檔的第一個(gè)例子：爬取http://quotes.toscrape.com來作為我的首個(gè)scrapy爬蟲盆均，我沒有找到scrapy 1.5的中文文檔婴梧，后續(xù)內(nèi)容有部分是我按照官方文檔進(jìn)行翻譯的（廣告：要翻譯也可以聯(lián)系我岩睁，我有三本英文書籍的翻譯出版經(jīng)驗(yàn)，其中兩本是獨(dú)立翻譯LOL）获雕，具體的步驟是：

在CMD中世分，進(jìn)入你想要存儲(chǔ)代碼的目錄下執(zhí)行：scrapy startproject myspiders，其中quotes可以是你想要?jiǎng)?chuàng)建的目錄名字索烹。
Scrapy會(huì)自動(dòng)創(chuàng)建一個(gè)名為myspiders的目錄工碾，并在它里面初始化一些內(nèi)容。
進(jìn)入myspiders/spiders目錄百姓，新建一個(gè)名為quotestoscrape.py的文件渊额，并添加如下代碼：

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'quotestoscrape'

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/'
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)


    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),
            }

保存后，切回CMD垒拢，執(zhí)行scrapy crawl quotestoscrape旬迹，在展示結(jié)果之前，我想先簡單解釋一下這部分的代碼：

首先經(jīng)過我的測(cè)試start_requests(self)這個(gè)方法并不是必須的求类，至少它也可以是一個(gè)名為start_urls[]的列表奔垦。不過我覺得還是遵循某種標(biāo)準(zhǔn)寫法比較好。如果有的話尸疆，按照文檔的說法椿猎，必須返回一個(gè)Requests的迭代器（它可以是一系列請(qǐng)求也可以是一個(gè)生成迭代器的方法），它代表了這個(gè)爬蟲要從哪個(gè)或哪些地址開始爬取寿弱。同時(shí)也會(huì)同來進(jìn)一步生成之后的請(qǐng)求犯眠。
每條請(qǐng)求都會(huì)從服務(wù)器下載下來一些內(nèi)容，parse()方法是用來處理這些內(nèi)容的症革。參數(shù)response包含了整個(gè)頁面的內(nèi)容筐咧，之后你可以使用其他函數(shù)方法來進(jìn)一步處理它。
yield關(guān)鍵字代表了Python另一個(gè)特性：生成器地沮。我忽然想到似乎我從來沒提到過它嗜浮，雖然我知道這是什么羡亩。以后有機(jī)會(huì)在寫一寫吧。

指令執(zhí)行后危融，都會(huì)輸出一大堆的log畏铆，大多數(shù)不難理解，我這里只截取其中我們想看的一部分吉殃，其中前半部分是爬取到的結(jié)果辞居，后面一部分是一個(gè)統(tǒng)計(jì)：

....
2018-04-19 15:56:07 [scrapy.core.engine] INFO: Spider opened
2018-04-19 15:56:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-04-19 15:56:07 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-04-19 15:56:07 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2018-04-19 15:56:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“Try not to become a man of success. Rather become a man of value.”', 'author': 'Albert Einstein'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'author': 'André Gide'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': "“I have not failed. I've just found 10,000 ways that won't work.”", 'author': 'Thomas A. Edison'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", 'author': 'Eleanor Roosevelt'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin'}
2018-04-19 15:56:07 [scrapy.core.engine] INFO: Closing spider (finished)
2018-04-19 15:56:07 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 446,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 2701,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 4, 19, 19, 56, 7, 908603),
 'item_scraped_count': 10,
 'log_count/DEBUG': 13,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2018, 4, 19, 19, 56, 7, 400951)}
2018-04-19 15:56:07 [scrapy.core.engine] INFO: Spider closed (finished)

以后如果有空會(huì)專門寫一篇文檔把這部分日志展開來說一說。

error: No module named win32api

在最后執(zhí)行的時(shí)候蛋勺，有可能會(huì)出現(xiàn)找不到win32api的錯(cuò)誤瓦灶，安裝如下模塊即可：pip install pypiwin32。

進(jìn)一步處理response

初次接觸爬蟲抱完，可能會(huì)對(duì)上述代碼中的response.css(), quote.css(), quote.xpath()和extract_first()感到陌生贼陶，這些就是所謂的進(jìn)一步處理response的方法。

這部分內(nèi)容需要用到一些HTML/CSS的知識(shí)巧娱，你需要知道通過怎樣的表達(dá)式才能從返回內(nèi)容中獲取到你需要的內(nèi)容碉怔。因?yàn)榫W(wǎng)頁的代碼都是樹形結(jié)構(gòu)，理論上通過合理的表達(dá)式禁添，我們可以獲取任何我們想要獲得的內(nèi)容撮胧。通常情況下，我們有兩種方法可以計(jì)算出我們的表達(dá)式：

第一種是用瀏覽器的審查模式老翘。
第二種是利用scrapy提供的命令行模式芹啥。

CSS選擇器

上述代碼中，response.css('div.quote')和quote.css('span.text::text')都是CSS選擇器铺峭。如果我們打開該網(wǎng)頁的元素審查頁面墓怀，會(huì)有如下結(jié)果：

Python爬蟲CSS選擇器.jpg

依我之見，流程大概如下：利用屏幕底下幾個(gè)標(biāo)簽可以先定位到一個(gè)大概的位置逛薇，比如說quote = response.css('div.quote')定位到圖中藍(lán)框的位置捺疼，之后我們要進(jìn)行進(jìn)一步的篩選，我沒有找到文檔說明應(yīng)如何進(jìn)行篩選永罚，這里是我的一點(diǎn)經(jīng)驗(yàn)之談：如果是html標(biāo)簽用空格分割啤呼，如果標(biāo)簽帶class標(biāo)識(shí)，則用.連接呢袱，最后再加上::text 用來剔除首尾的<>標(biāo)識(shí)官扣。

在整個(gè)過程中，我們都可以用scrapy的命令行來測(cè)試羞福，在你的CMD下輸入：scrapy shell "http://quotes.toscrape.com/"惕蹄。之后出現(xiàn)一大推日志和一些可用的指令：

D:\OneDrive\Documents\Python和數(shù)據(jù)挖掘\code\blogspider>scrapy shell "http://quotes.toscrape.com/"
.............省略.............
2018-04-19 18:28:19 [scrapy.core.engine] INFO: Spider opened
2018-04-19 18:28:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2018-04-19 18:28:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x0000029D0C61AC50>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com/>
[s]   response   <200 http://quotes.toscrape.com/>
[s]   settings   <scrapy.settings.Settings object at 0x0000029D0ED439B0>
[s]   spider     <DefaultSpider 'default' at 0x29d0efecc18>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
>>>

我們主要用到的是response對(duì)象，之后我們就可以進(jìn)行調(diào)試，如下：

# 定位這個(gè)網(wǎng)站的標(biāo)題卖陵，extract()用來獲取其中的data
>>> response.css('title::text')
[<Selector xpath='descendant-or-self::title/text()' data='Quotes to Scrape'>]
>>> response.css('title::text').extract()
['Quotes to Scrape']

# 定位到作者信息遭顶，這是最完整的寫法
>>> response.css("div.quote span small.author::text").extract()
['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Austen', 'Marilyn Monroe', 'Albert Einstein', 'André Gide', 'Thomas A. Edison', 'Eleanor Roosevelt', 'Steve Martin']
# 也可以簡單一點(diǎn)
>>> response.css("div span small::text").extract()
['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Austen', 'Marilyn Monroe', 'Albert Einstein', 'André Gide', 'Thomas A. Edison', 'Eleanor Roosevelt', 'Steve Martin']
# 也可以拆開來寫
>>> response.css("div.quote").css("span").css("small.author::text").extract()
['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Austen', 'Marilyn Monroe', 'Albert Einstein', 'André Gide', 'Thomas A. Edison', 'Eleanor Roosevelt', 'Steve Martin']
# 只需要第一項(xiàng)？
>>> response.css("div.quote").css("span").css("small.author::text")[0].extract()
'Albert Einstein'
>>> response.css("div.quote").css("span").css("small.author::text").extract_first()
'Albert Einstein'

如果你之前自己寫過網(wǎng)站的CSS泪蔫，這些其實(shí)還是很好理解的棒旗，因?yàn)閮?nèi)在的邏輯是一樣的，伴隨這個(gè)命令行指令自己琢磨琢磨很容就就能掌握撩荣。如果你仔細(xì)看铣揉，會(huì)發(fā)現(xiàn)這個(gè)函數(shù)返回的其實(shí)是個(gè)列表，這點(diǎn)可以方便我們寫代碼餐曹。

XPath選擇器

另一種方法是使用XPath選擇器逛拱，如上文中的代碼：quote.xpath('span/small/text()')。根據(jù)文檔的描述台猴，XPath才是Scrapy的基礎(chǔ)朽合，事實(shí)上，即使是CSS選擇器最終也會(huì)在底層被轉(zhuǎn)化為XPath卿吐。XPath比CSS選擇強(qiáng)大的地方在于它還可以對(duì)篩選出的網(wǎng)頁的內(nèi)容本身就行操作旁舰，比如說它可以進(jìn)行諸如選擇那個(gè)內(nèi)容為（下一頁）的鏈接的操作。官方提供了三個(gè)關(guān)于XPath的文檔：using XPath with Scrapy Selectors嗡官，learn XPath through examples和how to think in XPath。

保存數(shù)據(jù)

這個(gè)只是一行命令的事毯焕，比如說我要將上文爬蟲的內(nèi)容寫入一個(gè)json文件衍腥，我只需要在cmd中執(zhí)行：

scrapy crawl quotes -o data.json

-o應(yīng)該就是output，這個(gè)linux命令很像纳猫，不難理解婆咸。當(dāng)然也可以是其他格式的文件，官方推薦一個(gè)叫JSON Lines的格式芜辕，雖然我目前還不知道這是什么格式尚骄。

所有指出的到處數(shù)據(jù)類型為：'json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle'。

爬取下一頁的數(shù)據(jù)

像http://quotes.toscrape.com這個(gè)網(wǎng)站侵续，它可以分為好幾頁倔丈，我們可以通過解析網(wǎng)頁中的“下一個(gè)”按鈕的鏈接來爬取下一頁，下一頁的下一頁状蜗，...需五，的內(nèi)容，直到?jīng)]有下一頁了轧坎。代碼不難理解宏邮，直接放上了：

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'quotestoscrape'

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/'
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)


    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.xpath('span/small/text()').extract_first(),
            }

        next_page = response.css('li.next a::attr("href")').extract_first()
        if next_page is not None:
            next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

爬取我自己的博客

說了這么多，做點(diǎn)實(shí)際的，我想爬取一下我自己博客的所有文章和發(fā)布時(shí)間蜜氨，代碼如下：

import scrapy

class BlogSpider(scrapy.Spider):
    name = 'ethanshub'
    start_urls = [
        'https://journal.ethanshub.com/archive',
    ]

    def parse(self, response):
        yearlists = response.css('ul.listing')

        for i in range(len(yearlists)):
            lists = yearlists[i]

            for j in range(len(lists.css("li.listing_item"))//2):
                yield {
                    'date': lists.css("li.listing_item::text")[j*2].extract(),
                    'title': lists.css("li.listing_item a::text")[j].extract(),
                }

這里唯一要注意的是要注意不要只爬取了一年的文章械筛，要準(zhǔn)確找到能包含所有文章的最小結(jié)構(gòu)。然后就是簡單的邏輯性操作了飒炎。另外值得一提的一點(diǎn)是埋哟，我的博客使用的是Bitcron，CSS文件也是后臺(tái)渲染的并且我也是按照其語法規(guī)則編寫CSS的厌丑，但是我在分析過程中發(fā)現(xiàn)lists.css("li.listing_item")對(duì)于每一項(xiàng)都會(huì)多爬取到一個(gè)空白字段定欧，這也就導(dǎo)致了最后取出的date數(shù)量總是title數(shù)量的兩倍，好在這也保證了date數(shù)量肯定是雙數(shù)怒竿，代碼略微調(diào)整一下即可砍鸠。

在執(zhí)行scrapy crawl ethanshub -o data.json之后抓取到的data.json文件內(nèi)容如下：

[
{"date": "[2017-12-16]\n", "title": "Python3 \u722c\u866b\u5165\u95e8\uff08\u4e8c\uff09"},
{"date": "[2017-12-15]\n", "title": "Python3 \u722c\u866b\u5165\u95e8\uff08\u4e00\uff09"},
{"date": "[2017-12-13]\n", "title": "\u7528Python\u5411Kindle\u63a8\u9001\u7535\u5b50\u4e66"},
{"date": "[2017-12-12]\n", "title": "GUI\u7f16\u7a0b\uff0cTkinter\u5e93\u548c\u5e03\u5c40"},
{"date": "[2017-12-12]\n", "title": "Python3\u7684\u6b63\u5219\u8868\u8fbe\u5f0f"},
{"date": "[2017-12-10]\n", "title": "Python\u901f\u89c8[7]"},
{"date": "[2017-12-09]\n", "title": "Python\u901f\u89c8[6]"},
....
{"date": "[2013-09-16]\n", "title": "How to split a string in C"},
{"date": "[2012-11-28]\n", "title": "Common Filters for Wireshark"}
]

一切OK，其中\u722c是Unicode的中文字符耕驰，只是個(gè)編碼問題爷辱，就不多做了。

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末朦肘，一起剝皮案震驚了整個(gè)濱河市饭弓，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌媒抠，老刑警劉巖弟断，帶你破解...
沈念sama閱讀 217,907評(píng)論 6贊 506
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異趴生，居然都是意外死亡阀趴，警方通過查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,987評(píng)論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門苍匆，熙熙樓的掌柜王于貴愁眉苦臉地迎上來刘急，“玉大人，你說我怎么就攤上這事浸踩∈逯” “怎么了？”我有些...
開封第一講書人閱讀 164,298評(píng)論 0贊 354
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵检碗，是天一觀的道長据块。經(jīng)常有香客問我，道長后裸，這世上最難降的妖魔是什么瑰钮？我笑而不...
開封第一講書人閱讀 58,586評(píng)論 1贊 293
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮微驶，結(jié)果婚禮上浪谴，老公的妹妹穿的比我還像新娘开睡。我一直安慰自己，他們只是感情好苟耻，可當(dāng)我...
茶點(diǎn)故事閱讀 67,633評(píng)論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布篇恒。她就那樣靜靜地躺著，像睡著了一般凶杖。火紅的嫁衣襯著肌膚如雪胁艰。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 51,488評(píng)論 1贊 302
城市分裂傳說
那天智蝠，我揣著相機(jī)與錄音腾么，去河邊找鬼。笑死杈湾，一個(gè)胖子當(dāng)著我的面吹牛解虱，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播漆撞，決...
沈念sama閱讀 40,275評(píng)論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼殴泰，長吁一口氣：“原來是場噩夢(mèng)啊……” “哼！你這毒婦竟也來了浮驳？” 一聲冷哼從身側(cè)響起悍汛，我...
開封第一講書人閱讀 39,176評(píng)論 0贊 276
萬榮殺人案實(shí)錄
序言：老撾萬榮一對(duì)情侶失蹤，失蹤者是張志新（化名）和其女友劉穎至会，沒想到半個(gè)月后离咐，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 45,619評(píng)論 1贊 314
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡奉件，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,819評(píng)論 3贊 336
?白月光啟示錄
正文我和宋清朗相戀三年健霹，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片瓶蚂。...
茶點(diǎn)故事閱讀 39,932評(píng)論 1贊 348
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡，死狀恐怖宣吱，靈堂內(nèi)的尸體忽然破棺而出窃这，到底是詐尸還是另有隱情，我是刑警寧澤征候，帶...
沈念sama閱讀 35,655評(píng)論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布杭攻，位于F島的核電站，受9級(jí)特大地震影響疤坝，放射性物質(zhì)發(fā)生泄漏兆解。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,265評(píng)論 3贊 329
男人毒藥：我在死后第九天來索命
文/蒙蒙一跑揉、第九天我趴在偏房一處隱蔽的房頂上張望锅睛。院中可真熱鬧埠巨，春花似錦、人聲如沸现拒。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,871評(píng)論 0贊 22
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽印蔬。三九已至勋桶，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間侥猬，已是汗流浹背例驹。一陣腳步聲響...
開封第一講書人閱讀 32,994評(píng)論 1贊 269
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留退唠，地道東北人鹃锈。一個(gè)月前我還...
沈念sama閱讀 48,095評(píng)論 3贊 370
代替公主和親
正文我出身青樓，卻偏偏與公主長得像铜邮，于是被迫代替她去往敵國和親仪召。傳聞我的和親對(duì)象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 44,884評(píng)論 2贊 354