Scrapy入門(mén)案例

Scrapy教程:

官方《Scrapy 1.5 documentation》

中文《Scrapy 0.24.1文檔》

安裝環(huán)境:

Python 2.7.12
Scrapy 0.24.1
Ubuntu 16.04

安裝步驟:

pip install scrapy==0.24.1

pip install service_identity==17.0.0

Creating a project

scrapy startproject tutorial

Our first Spider

This is the code for our first Spider. Save it in a file named quotes_spider.py under thetutorial/spiders directory in your project:

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"

    def start_requests(self):
        urls = [
            'http://quotes.toscrape.com/page/1/',
            'http://quotes.toscrape.com/page/2/',
        ]
        for url in urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)
        self.log('Saved file %s' % filename)

How to run our spider

scrapy crawl quotes

A shortcut to the start_requests method

import scrapy


class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'quotes-%s.html' % page
        with open(filename, 'wb') as f:
            f.write(response.body)

上述兩種寫(xiě)法等價(jià)，而start_urls是start_requests的簡(jiǎn)潔寫(xiě)法南缓。

為了創(chuàng)建一個(gè)Spider肮之，您必須繼承 scrapy.Spider 類(lèi)堡赔， 且定義以下三個(gè)屬性:
name: 用于區(qū)別Spider漾稀。 該名字必須是唯一的谚中，您不可以為不同的Spider設(shè)定相同的名字甸怕。
start_urls: 包含了Spider在啟動(dòng)時(shí)進(jìn)行爬取的url列表博敬。 因此，第一個(gè)被獲取到的頁(yè)面將是其中之一。 后續(xù)的URL則從初始的URL獲取到的數(shù)據(jù)中提取养晋。
parse() 是spider的一個(gè)方法衬吆。 被調(diào)用時(shí)，每個(gè)初始URL完成下載后生成的 Response 對(duì)象將會(huì)作為唯一的參數(shù)傳遞給該函數(shù)绳泉。 該方法負(fù)責(zé)解析返回的數(shù)據(jù)(response data)逊抡，提取數(shù)據(jù)(生成item)以及生成需要進(jìn)一步處理的URL的 Request 對(duì)象。

Selectors選擇器簡(jiǎn)介

Selector有四個(gè)基本的方法:
xpath(): 傳入xpath表達(dá)式零酪，返回該表達(dá)式所對(duì)應(yīng)的所有節(jié)點(diǎn)的selector list列表 秦忿。
css(): 傳入CSS表達(dá)式，返回該表達(dá)式所對(duì)應(yīng)的所有節(jié)點(diǎn)的selector list列表.
extract(): 序列化該節(jié)點(diǎn)為unicode字符串并返回list蛾娶。
re(): 根據(jù)傳入的正則表達(dá)式對(duì)數(shù)據(jù)進(jìn)行提取灯谣，返回unicode字符串list列表。

Extracting data

The best way to learn how to extract data with Scrapy is trying selectors using the shell Scrapy shell. Run:

scrapy shell 'http://quotes.toscrape.com/page/1/'

當(dāng)shell載入后蛔琅，您將得到一個(gè)包含response數(shù)據(jù)的本地 response 變量胎许。輸入 response.body 將輸出response的包體，輸出 response.headers 可以看到response的包頭罗售。

你可以使用 response.selector.xpath() 辜窑、 response.selector.css()或者response.xpath() 和 response.css() 甚至sel.xpath() 、sel.css()來(lái)獲取數(shù)據(jù)寨躁，他們之間是等價(jià)的穆碎。

# 測(cè)試這些css方法看看輸出啥
response.css('title')
response.css('title').extract()
response.css('title::text')
response.css('title::text').extract()
# response.css('title::text').extract_first()  #extract_first在0.24.1版本不可用
response.css('title::text')[0].extract()
response.css('title::text').re(r'Quotes.*')
response.css('title::text').re(r'Q\w+')
response.css('title::text').re(r'(\w+) to (\w+)')

# 測(cè)試這些xpath方法看看輸出啥
response.xpath('//title')
response.xpath('//title').extract()
response.xpath('//title/text()')
response.xpath('//title/text()').extract()
# response.xpath('//title/text()').extract_first()   #extract_first在0.24.1版本不可用
response.xpath('//title/text()')[0].extract()
response.xpath('//title/text()').re(r'Quotes.*')
response.xpath('//title/text()').re(r'Q\w+')
response.xpath('//title/text()').re(r'(\w+) to (\w+)')

上面css與xpath 表達(dá)式部分不同其他對(duì)應(yīng)一致，他們的輸出結(jié)果基本一樣职恳，除了個(gè)別所禀。

Extracting data in our spider

import scrapy
from tutorial.items import QuotesItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            item = QuotesItem()
            item['title'] = quote.css('span.text::text')[0].extract(),
            item['author'] = quote.css('small.author::text')[0].extract(),
            item['tags'] = quote.css('div.tags a.tag::text')[0].extract(),
            yield item

Spider將爬到的數(shù)據(jù)以Item對(duì)象返回，因此放钦，還需要定義一個(gè)QuotesItem在items.py中

import scrapy

class QuotesItem(scrapy.Item):
    title = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

Run 查看輸出log,確定沒(méi)有錯(cuò)誤,否則返回修改上面代碼

 scrapy crawl quotes

Storing the scraped data

The simplest way to store the scraped data is by using Feed exports, with the following command:

scrapy crawl quotes -o quotes.json

You can also use other formats, like JSON Lines:

scrapy crawl quotes -o quotes.jl

Following links

import scrapy
from tutorial.items import QuotesItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            item = QuotesItem()
            item['title'] = quote.css('span.text::text')[0].extract(),
            item['author'] = quote.css('small.author::text')[0].extract(),
            item['tags'] = quote.css('div.tags a.tag::text')[0].extract(),
            yield item
        next_page = response.css('li.next a::attr(href)')[0].extract()
        if next_page is not None:
            next_page = 'http://quotes.toscrape.com'+next_page
            yield scrapy.Request(next_page, callback=self.parse)

這里關(guān)鍵是response.css('li.next a::attr(href)')[0].extract()獲取到了'/page/2/'色徘，然后通過(guò)scrapy.Request遞歸調(diào)用，再次爬取了'/page/2/'操禀，這樣實(shí)現(xiàn)了跟蹤鏈接的效果褂策。

新版本會(huì)提供response.urljoin來(lái)代替我們現(xiàn)在手動(dòng)拼接url，

最新版還會(huì)出response.follow來(lái)代替response.urljoin和scrapy.Request兩步操作颓屑。

最后編輯于：2018.04.06 04:49:14

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末斤寂，一起剝皮案震驚了整個(gè)濱河市，隨后出現(xiàn)的幾起案子揪惦，更是在濱河造成了極大的恐慌遍搞，老刑警劉巖，帶你破解...
沈念sama閱讀 219,110評(píng)論 6贊 508
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件丹擎，死亡現(xiàn)場(chǎng)離奇詭異尾抑，居然都是意外死亡，警方通過(guò)查閱死者的電腦和手機(jī)蒂培，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,443評(píng)論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門(mén)再愈，熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)，“玉大人护戳，你說(shuō)我怎么就攤上這事翎冲。” “怎么了媳荒？”我有些...
開(kāi)封第一講書(shū)人閱讀 165,474評(píng)論 0贊 356
道士緝兇錄：失蹤的賣(mài)姜人
文/不壞的土叔我叫張陵抗悍，是天一觀的道長(zhǎng)。經(jīng)常有香客問(wèn)我钳枕，道長(zhǎng)缴渊，這世上最難降的妖魔是什么？我笑而不...
開(kāi)封第一講書(shū)人閱讀 58,881評(píng)論 1贊 295
?港島之戀（遺憾婚禮）
正文為了忘掉前任鱼炒，我火速辦了婚禮衔沼，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘昔瞧。我一直安慰自己指蚁，他們只是感情好，可當(dāng)我...
茶點(diǎn)故事閱讀 67,902評(píng)論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來(lái)的
文/花漫我一把揭開(kāi)白布自晰。她就那樣靜靜地躺著凝化，像睡著了一般。火紅的嫁衣襯著肌膚如雪酬荞。梳的紋絲不亂的頭發(fā)上搓劫，一...
開(kāi)封第一講書(shū)人閱讀 51,698評(píng)論 1贊 305
城市分裂傳說(shuō)
那天，我揣著相機(jī)與錄音混巧，去河邊找鬼糟把。笑死，一個(gè)胖子當(dāng)著我的面吹牛牲剃，可吹牛的內(nèi)容都是我干的遣疯。我是一名探鬼主播，決...
沈念sama閱讀 40,418評(píng)論 3贊 419
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開(kāi)眼凿傅，長(zhǎng)吁一口氣：“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼缠犀！你這毒婦竟也來(lái)了？” 一聲冷哼從身側(cè)響起聪舒，我...
開(kāi)封第一講書(shū)人閱讀 39,332評(píng)論 0贊 276
萬(wàn)榮殺人案實(shí)錄
序言：老撾萬(wàn)榮一對(duì)情侶失蹤辨液，失蹤者是張志新（化名）和其女友劉穎箱残，沒(méi)想到半個(gè)月后燎悍，有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體敬惦，經(jīng)...
沈念sama閱讀 45,796評(píng)論 1贊 316
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡，尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,968評(píng)論 3贊 337
?白月光啟示錄
正文我和宋清朗相戀三年谈山，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了俄删。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點(diǎn)故事閱讀 40,110評(píng)論 1贊 351
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡奏路，死狀恐怖畴椰，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情鸽粉，我是刑警寧澤触机，帶...
沈念sama閱讀 35,792評(píng)論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布秽褒，位于F島的核電站，受9級(jí)特大地震影響威兜，放射性物質(zhì)發(fā)生泄漏销斟。R本人自食惡果不足惜，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,455評(píng)論 3贊 331
男人毒藥：我在死后第九天來(lái)索命
文/蒙蒙一笔宿、第九天我趴在偏房一處隱蔽的房頂上張望犁钟。院中可真熱鬧，春花似錦泼橘、人聲如沸涝动。這莊子的主人今日做“春日...
開(kāi)封第一講書(shū)人閱讀 32,003評(píng)論 0贊 22
一樁弒父案炬灭，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽(yáng)醋粟。三九已至，卻和暖如春重归，著一層夾襖步出監(jiān)牢的瞬間米愿，已是汗流浹背。一陣腳步聲響...
開(kāi)封第一講書(shū)人閱讀 33,130評(píng)論 1贊 272
情欲美人皮
我被黑心中介騙來(lái)泰國(guó)打工鼻吮，沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留育苟，地道東北人。一個(gè)月前我還...
沈念sama閱讀 48,348評(píng)論 3贊 373
代替公主和親
正文我出身青樓椎木，卻偏偏與公主長(zhǎng)得像违柏，于是被迫代替她去往敵國(guó)和親博烂。傳聞我的和親對(duì)象是個(gè)殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 45,047評(píng)論 2贊 355

Scrapy入門(mén)案例

Scrapy入門(mén)案例

Creating a project

Our first Spider

How to run our spider

A shortcut to the start_requests method

Selectors選擇器簡(jiǎn)介

Extracting data

Extracting data in our spider

Storing the scraped data

Following links

推薦閱讀更多精彩內(nèi)容