本篇將介紹使用scrapy的命令躯概,更多內(nèi)容請(qǐng)參考:Python學(xué)習(xí)指南
Scrapy Shell
Scrapy終端是一個(gè)交互終端,我們可以在未啟動(dòng)spider的情況下嘗試及調(diào)試代碼畔师,也可以用來(lái)測(cè)試XPath或CSS表達(dá)式娶靡,查看他們的工作方式,方便我們從爬取的網(wǎng)頁(yè)中提取數(shù)據(jù)茉唉。
如果安裝了IPython固蛾,Scrapy終端將使用IPython(替代標(biāo)準(zhǔn)Python客戶(hù)端)。IPython終端與其它相比更為強(qiáng)大度陆,提供智能的自動(dòng)補(bǔ)全艾凯,全亮輸出,即其它特性懂傀。(推薦安裝IPython)
啟動(dòng)Scrapy Shell
進(jìn)入項(xiàng)目的根目錄趾诗,執(zhí)行下列命令來(lái)啟動(dòng):
scrapy shell 'http://www.cnblogs.com/miqi1992/'
[root@centos chapter04]# scrapy shell 'http://www.cnblogs.com/miqi1992'
2017-12-25 16:19:58 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: scrapybot)
2017-12-25 16:19:58 [scrapy.utils.log] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
2017-12-25 16:19:58 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2017-12-25 16:19:58 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2017-12-25 16:19:58 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2017-12-25 16:19:58 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2017-12-25 16:19:58 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-12-25 16:19:58 [scrapy.core.engine] INFO: Spider opened
2017-12-25 16:19:58 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.cnblogs.com/miqi1992> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x2ea9b10>
[s] item {}
[s] request <GET http://www.cnblogs.com/miqi1992>
[s] response <200 http://www.cnblogs.com/miqi1992>
[s] settings <scrapy.settings.Settings object at 0x2ea9990>
[s] spider <DefaultSpider 'default' at 0x32a6f10>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
Scrapy Shell根據(jù)下載的頁(yè)面會(huì)自動(dòng)創(chuàng)建一些方便使用的對(duì)象,例如Response對(duì)象,以及Selector
對(duì)象(對(duì)HTML及XML內(nèi)容)
- 當(dāng)shell載入后恃泪,將會(huì)得到一個(gè)包含response的數(shù)據(jù)的本地response變量郑兴,輸入
response.body
將輸出response的包體,輸出response.headers
可以看到response的報(bào)頭贝乎。 - 輸入
response.selector
時(shí)情连,將獲取到一個(gè)response初始化的類(lèi)Selector的對(duì)象,此時(shí)可以通過(guò)使用response.selector.xpath()
或response.selector.css()
來(lái)對(duì)resposne進(jìn)行查詢(xún)览效。 - Scrapy也提供了一些快捷方式却舀,例如
response.xpath()
或response.css()
同樣可以生效(如之前的案例)。
Selector選擇器
Scrapy Selector內(nèi)置XPath和CSS Selector表達(dá)式機(jī)制
Selector有四個(gè)基本的方法锤灿,最常用的還是xpath:
- xpath():傳入xpath表達(dá)式挽拔,返回該表達(dá)式所對(duì)應(yīng)的所有節(jié)點(diǎn)的selector list列表
- extract():序列化該節(jié)點(diǎn)為Unicode字符串并返回list
- css():傳入CSS表達(dá)式,返回該表達(dá)式所對(duì)應(yīng)的所有節(jié)點(diǎn)的selector list列表但校,語(yǔ)法同BeautifulSoup4
- re():傳入輸入的正則表達(dá)式對(duì)數(shù)據(jù)進(jìn)行提取螃诅,返回Unicode字符串list列表
XPath表達(dá)式的例子及對(duì)應(yīng)的含義:
/html/head/title:選擇<HTML>文檔中<head>標(biāo)簽內(nèi)的<title>元素
/html/head/title/text():選擇上面提到的<title>元素的文字
//td:選擇所有的<td>元素
//div[@class="mine"]:選擇所有具有class="mine"屬性的div元素
嘗試Selector
我們用騰訊社招的網(wǎng)站 http://hr.tencent.com/position.php?&start=0#a
舉例:
response.xpath('//title')
[<Selector xpath='//title' data=u'<title>\u804c\u4f4d\u641c\u7d22 | \u793e\u4f1a\u62db\u8058 | Tencent \u817e\u8baf\u62db\u8058</title'>]
#使用extract()方法返回Unicode字符串列表
response.xpath('//title').extract()
u'<title>\u804c\u4f4d\u641c\u7d22 | \u793e\u4f1a\u62db\u8058 | Tencent \u817e\u8baf\u62db\u8058
##打印列表中第一個(gè)元素,終端編碼格式顯示
print(response.xpath('//title'.extract()[0]))
response.xpath('//*[@class='even']')
以后做數(shù)據(jù)提取的時(shí)候状囱,可以先在Scrapy Shell中測(cè)試术裸,測(cè)試通過(guò)后再應(yīng)用到代碼中。當(dāng)然Scrapy Shell作用不僅僅如此浪箭,但是不屬于我們課程重點(diǎn)穗椅,不做詳細(xì)介紹。