第一個(gè)爬蟲
這里我用官方文檔的第一個(gè)例子:爬取http://quotes.toscrape.com來作為我的首個(gè)scrapy爬蟲盆均,我沒有找到scrapy 1.5的中文文檔婴梧,后續(xù)內(nèi)容有部分是我按照官方文檔進(jìn)行翻譯的(廣告:要翻譯也可以聯(lián)系我岩睁,我有三本英文書籍的翻譯出版經(jīng)驗(yàn),其中兩本是獨(dú)立翻譯LOL)获雕,具體的步驟是:
- 在CMD中世分,進(jìn)入你想要存儲(chǔ)代碼的目錄下執(zhí)行:
scrapy startproject myspiders
,其中quotes
可以是你想要?jiǎng)?chuàng)建的目錄名字索烹。 - Scrapy會(huì)自動(dòng)創(chuàng)建一個(gè)名為
myspiders
的目錄工碾,并在它里面初始化一些內(nèi)容。 - 進(jìn)入
myspiders/spiders
目錄百姓,新建一個(gè)名為quotestoscrape.py
的文件渊额,并添加如下代碼:
import scrapy
class BlogSpider(scrapy.Spider):
name = 'quotestoscrape'
def start_requests(self):
urls = [
'http://quotes.toscrape.com/'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.xpath('span/small/text()').extract_first(),
}
保存后,切回CMD垒拢,執(zhí)行scrapy crawl quotestoscrape
旬迹,在展示結(jié)果之前,我想先簡單解釋一下這部分的代碼:
- 首先經(jīng)過我的測(cè)試start_requests(self)這個(gè)方法并不是必須的求类,至少它也可以是一個(gè)名為start_urls[]的列表奔垦。不過我覺得還是遵循某種標(biāo)準(zhǔn)寫法比較好。如果有的話尸疆,按照文檔的說法椿猎,必須返回一個(gè)Requests的迭代器(它可以是一系列請(qǐng)求也可以是一個(gè)生成迭代器的方法),它代表了這個(gè)爬蟲要從哪個(gè)或哪些地址開始爬取寿弱。同時(shí)也會(huì)同來進(jìn)一步生成之后的請(qǐng)求犯眠。
- 每條請(qǐng)求都會(huì)從服務(wù)器下載下來一些內(nèi)容,parse()方法是用來處理這些內(nèi)容的症革。參數(shù)response包含了整個(gè)頁面的內(nèi)容筐咧,之后你可以使用其他函數(shù)方法來進(jìn)一步處理它。
- yield關(guān)鍵字代表了Python另一個(gè)特性:生成器地沮。我忽然想到似乎我從來沒提到過它嗜浮,雖然我知道這是什么羡亩。以后有機(jī)會(huì)在寫一寫吧。
指令執(zhí)行后危融,都會(huì)輸出一大堆的log畏铆,大多數(shù)不難理解,我這里只截取其中我們想看的一部分吉殃,其中前半部分是爬取到的結(jié)果辞居,后面一部分是一個(gè)統(tǒng)計(jì):
....
2018-04-19 15:56:07 [scrapy.core.engine] INFO: Spider opened
2018-04-19 15:56:07 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-04-19 15:56:07 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-04-19 15:56:07 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2018-04-19 15:56:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”', 'author': 'Albert Einstein'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“It is our choices, Harry, that show what we truly are, far more than our abilities.”', 'author': 'J.K. Rowling'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”', 'author': 'Albert Einstein'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”', 'author': 'Jane Austen'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”", 'author': 'Marilyn Monroe'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“Try not to become a man of success. Rather become a man of value.”', 'author': 'Albert Einstein'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“It is better to be hated for what you are than to be loved for what you are not.”', 'author': 'André Gide'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': "“I have not failed. I've just found 10,000 ways that won't work.”", 'author': 'Thomas A. Edison'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”", 'author': 'Eleanor Roosevelt'}
2018-04-19 15:56:07 [scrapy.core.scraper] DEBUG: Scraped from <200 http://quotes.toscrape.com/>
{'text': '“A day without sunshine is like, you know, night.”', 'author': 'Steve Martin'}
2018-04-19 15:56:07 [scrapy.core.engine] INFO: Closing spider (finished)
2018-04-19 15:56:07 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 446,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2701,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 4, 19, 19, 56, 7, 908603),
'item_scraped_count': 10,
'log_count/DEBUG': 13,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2018, 4, 19, 19, 56, 7, 400951)}
2018-04-19 15:56:07 [scrapy.core.engine] INFO: Spider closed (finished)
以后如果有空會(huì)專門寫一篇文檔把這部分日志展開來說一說。
error: No module named win32api
在最后執(zhí)行的時(shí)候蛋勺,有可能會(huì)出現(xiàn)找不到win32api的錯(cuò)誤瓦灶,安裝如下模塊即可:pip install pypiwin32
。
進(jìn)一步處理response
初次接觸爬蟲抱完,可能會(huì)對(duì)上述代碼中的response.css()
, quote.css()
, quote.xpath()
和extract_first()
感到陌生贼陶,這些就是所謂的進(jìn)一步處理response的方法。
這部分內(nèi)容需要用到一些HTML/CSS的知識(shí)巧娱,你需要知道通過怎樣的表達(dá)式才能從返回內(nèi)容中獲取到你需要的內(nèi)容碉怔。因?yàn)榫W(wǎng)頁的代碼都是樹形結(jié)構(gòu),理論上通過合理的表達(dá)式禁添,我們可以獲取任何我們想要獲得的內(nèi)容撮胧。通常情況下,我們有兩種方法可以計(jì)算出我們的表達(dá)式:
- 第一種是用瀏覽器的審查模式老翘。
- 第二種是利用scrapy提供的命令行模式芹啥。
CSS選擇器
上述代碼中,response.css('div.quote')
和quote.css('span.text::text')
都是CSS選擇器铺峭。如果我們打開該網(wǎng)頁的元素審查頁面墓怀,會(huì)有如下結(jié)果:
依我之見,流程大概如下:利用屏幕底下幾個(gè)標(biāo)簽可以先定位到一個(gè)大概的位置逛薇,比如說quote = response.css('div.quote')
定位到圖中藍(lán)框的位置捺疼,之后我們要進(jìn)行進(jìn)一步的篩選,我沒有找到文檔說明應(yīng)如何進(jìn)行篩選永罚,這里是我的一點(diǎn)經(jīng)驗(yàn)之談:如果是html標(biāo)簽用空格
分割啤呼,如果標(biāo)簽帶class標(biāo)識(shí),則用.
連接呢袱,最后再加上::text
用來剔除首尾的<>
標(biāo)識(shí)官扣。
在整個(gè)過程中,我們都可以用scrapy的命令行來測(cè)試羞福,在你的CMD下輸入:scrapy shell "http://quotes.toscrape.com/"
惕蹄。之后出現(xiàn)一大推日志和一些可用的指令:
D:\OneDrive\Documents\Python和數(shù)據(jù)挖掘\code\blogspider>scrapy shell "http://quotes.toscrape.com/"
.............省略.............
2018-04-19 18:28:19 [scrapy.core.engine] INFO: Spider opened
2018-04-19 18:28:19 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2018-04-19 18:28:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x0000029D0C61AC50>
[s] item {}
[s] request <GET http://quotes.toscrape.com/>
[s] response <200 http://quotes.toscrape.com/>
[s] settings <scrapy.settings.Settings object at 0x0000029D0ED439B0>
[s] spider <DefaultSpider 'default' at 0x29d0efecc18>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>>>
我們主要用到的是response對(duì)象,之后我們就可以進(jìn)行調(diào)試,如下:
# 定位這個(gè)網(wǎng)站的標(biāo)題卖陵,extract()用來獲取其中的data
>>> response.css('title::text')
[<Selector xpath='descendant-or-self::title/text()' data='Quotes to Scrape'>]
>>> response.css('title::text').extract()
['Quotes to Scrape']
# 定位到作者信息遭顶,這是最完整的寫法
>>> response.css("div.quote span small.author::text").extract()
['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Austen', 'Marilyn Monroe', 'Albert Einstein', 'André Gide', 'Thomas A. Edison', 'Eleanor Roosevelt', 'Steve Martin']
# 也可以簡單一點(diǎn)
>>> response.css("div span small::text").extract()
['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Austen', 'Marilyn Monroe', 'Albert Einstein', 'André Gide', 'Thomas A. Edison', 'Eleanor Roosevelt', 'Steve Martin']
# 也可以拆開來寫
>>> response.css("div.quote").css("span").css("small.author::text").extract()
['Albert Einstein', 'J.K. Rowling', 'Albert Einstein', 'Jane Austen', 'Marilyn Monroe', 'Albert Einstein', 'André Gide', 'Thomas A. Edison', 'Eleanor Roosevelt', 'Steve Martin']
# 只需要第一項(xiàng)?
>>> response.css("div.quote").css("span").css("small.author::text")[0].extract()
'Albert Einstein'
>>> response.css("div.quote").css("span").css("small.author::text").extract_first()
'Albert Einstein'
如果你之前自己寫過網(wǎng)站的CSS泪蔫,這些其實(shí)還是很好理解的棒旗,因?yàn)閮?nèi)在的邏輯是一樣的,伴隨這個(gè)命令行指令自己琢磨琢磨很容就就能掌握撩荣。如果你仔細(xì)看铣揉,會(huì)發(fā)現(xiàn)這個(gè)函數(shù)返回的其實(shí)是個(gè)列表,這點(diǎn)可以方便我們寫代碼餐曹。
XPath選擇器
另一種方法是使用XPath選擇器逛拱,如上文中的代碼:quote.xpath('span/small/text()')
。根據(jù)文檔的描述台猴,XPath才是Scrapy的基礎(chǔ)朽合,事實(shí)上,即使是CSS選擇器最終也會(huì)在底層被轉(zhuǎn)化為XPath卿吐。XPath比CSS選擇強(qiáng)大的地方在于它還可以對(duì)篩選出的網(wǎng)頁的內(nèi)容本身就行操作旁舰,比如說它可以進(jìn)行諸如選擇那個(gè)內(nèi)容為(下一頁)的鏈接
的操作。官方提供了三個(gè)關(guān)于XPath的文檔:using XPath with Scrapy Selectors嗡官,learn XPath through examples和how to think in XPath。
保存數(shù)據(jù)
這個(gè)只是一行命令的事毯焕,比如說我要將上文爬蟲的內(nèi)容寫入一個(gè)json文件衍腥,我只需要在cmd中執(zhí)行:
scrapy crawl quotes -o data.json
-o
應(yīng)該就是output,這個(gè)linux命令很像纳猫,不難理解婆咸。當(dāng)然也可以是其他格式的文件,官方推薦一個(gè)叫JSON Lines的格式芜辕,雖然我目前還不知道這是什么格式尚骄。
所有指出的到處數(shù)據(jù)類型為:'json', 'jsonlines', 'jl', 'csv', 'xml', 'marshal', 'pickle'。
爬取下一頁的數(shù)據(jù)
像http://quotes.toscrape.com這個(gè)網(wǎng)站侵续,它可以分為好幾頁倔丈,我們可以通過解析網(wǎng)頁中的“下一個(gè)”按鈕的鏈接來爬取下一頁,下一頁的下一頁状蜗,...需五,的內(nèi)容,直到?jīng)]有下一頁了轧坎。代碼不難理解宏邮,直接放上了:
import scrapy
class BlogSpider(scrapy.Spider):
name = 'quotestoscrape'
def start_requests(self):
urls = [
'http://quotes.toscrape.com/'
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
for quote in response.css('div.quote'):
yield {
'text': quote.css('span.text::text').extract_first(),
'author': quote.xpath('span/small/text()').extract_first(),
}
next_page = response.css('li.next a::attr("href")').extract_first()
if next_page is not None:
next_page = response.urljoin(next_page)
yield scrapy.Request(next_page, callback=self.parse)
爬取我自己的博客
說了這么多,做點(diǎn)實(shí)際的,我想爬取一下我自己博客的所有文章和發(fā)布時(shí)間蜜氨,代碼如下:
import scrapy
class BlogSpider(scrapy.Spider):
name = 'ethanshub'
start_urls = [
'https://journal.ethanshub.com/archive',
]
def parse(self, response):
yearlists = response.css('ul.listing')
for i in range(len(yearlists)):
lists = yearlists[i]
for j in range(len(lists.css("li.listing_item"))//2):
yield {
'date': lists.css("li.listing_item::text")[j*2].extract(),
'title': lists.css("li.listing_item a::text")[j].extract(),
}
這里唯一要注意的是要注意不要只爬取了一年的文章械筛,要準(zhǔn)確找到能包含所有文章的最小結(jié)構(gòu)。然后就是簡單的邏輯性操作了飒炎。另外值得一提的一點(diǎn)是埋哟,我的博客使用的是Bitcron,CSS文件也是后臺(tái)渲染的并且我也是按照其語法規(guī)則編寫CSS的厌丑,但是我在分析過程中發(fā)現(xiàn)lists.css("li.listing_item")
對(duì)于每一項(xiàng)都會(huì)多爬取到一個(gè)空白字段定欧,這也就導(dǎo)致了最后取出的date數(shù)量總是title數(shù)量的兩倍,好在這也保證了date數(shù)量肯定是雙數(shù)怒竿,代碼略微調(diào)整一下即可砍鸠。
在執(zhí)行scrapy crawl ethanshub -o data.json
之后抓取到的data.json文件內(nèi)容如下:
[
{"date": "[2017-12-16]\n", "title": "Python3 \u722c\u866b\u5165\u95e8\uff08\u4e8c\uff09"},
{"date": "[2017-12-15]\n", "title": "Python3 \u722c\u866b\u5165\u95e8\uff08\u4e00\uff09"},
{"date": "[2017-12-13]\n", "title": "\u7528Python\u5411Kindle\u63a8\u9001\u7535\u5b50\u4e66"},
{"date": "[2017-12-12]\n", "title": "GUI\u7f16\u7a0b\uff0cTkinter\u5e93\u548c\u5e03\u5c40"},
{"date": "[2017-12-12]\n", "title": "Python3\u7684\u6b63\u5219\u8868\u8fbe\u5f0f"},
{"date": "[2017-12-10]\n", "title": "Python\u901f\u89c8[7]"},
{"date": "[2017-12-09]\n", "title": "Python\u901f\u89c8[6]"},
....
{"date": "[2013-09-16]\n", "title": "How to split a string in C"},
{"date": "[2012-11-28]\n", "title": "Common Filters for Wireshark"}
]
一切OK,其中\u722c
是Unicode的中文字符耕驰,只是個(gè)編碼問題爷辱,就不多做了。