python3的爬蟲筆記13——Scrapy初窺

1栽烂、Scrapy安裝

在windows平臺anaconda環(huán)境下胚吁,在命令窗口輸入conda install scrapy撑毛,輸入確認的y后书聚,靜靜等待安裝完成即可。安裝完成后,在窗口輸入scrapy version雌续,能顯示版本號說明能夠正常使用斩个。


2、Scrapy指令

輸入scrapy -h可以看到指令驯杜,關于命令行受啥,后面會再總結(jié)。

Scrapy 1.3.3 - project: quotetutorial

Usage:
  scrapy <command> [options] [args]

Available commands:
  bench         Run quick benchmark test
  check         Check spider contracts
  commands
  crawl         Run a spider
  edit          Edit spider
  fetch         Fetch a URL using the Scrapy downloader
  genspider     Generate new spider using pre-defined templates
  list          List available spiders
  parse         Parse URL (using its spider) and print the results
  runspider     Run a self-contained spider (without creating a project)
  settings      Get settings values
  shell         Interactive scraping console
  startproject  Create new project
  version       Print Scrapy version
  view          Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a command

3鸽心、新建項目

爬取的為用于測試scrapy的網(wǎng)站:http://quotes.toscrape.com/
爬取目標:獲取名言---作者---標簽滚局。

網(wǎng)站樣式

1、命令窗口下再悼,用cd指令移動到想用來存放項目的文件夾
2、命令窗口下膝但,scrapy startproject 你的文件夾名冲九,這里命名為scrapy startproject quotetutorial
可以看到兩個提示跟束, cd quotetutorial ,scrapy genspider example example.com莺奸,(即cd 你的工作文件夾 ,scrapy genspider 你的爬蟲名 爬取的目標地址),根據(jù)提示繼續(xù)操作冀宴。

C:\Users\m1812>scrapy startproject quotetutorial
New Scrapy project 'quotetutorial', using template directory 'C:\\Users\\m1812\\Anaconda3\\lib\\site-packages\\scrapy\\templates\\project', created in:
    C:\Users\m1812\quotetutorial

You can start your first spider with:
    cd quotetutorial
    scrapy genspider example example.com

3灭贷、cd quotetutorial移動到創(chuàng)建好的文件夾中

C:\Users\m1812>cd quotetutorial

4、scrapy genspider quotes quotes.toscrape.com略贮,生成一個名為quotes.py的文件甚疟,地址為quotes quotes.toscrape.com

C:\Users\m1812\quotetutorial>scrapy genspider quotes quotes.toscrape.com
Created spider 'quotes' using template 'basic' in module:
  quotetutorial.spiders.quotes
用pycharm打開工程,框架如圖

4逃延、Scrapy初窺

1览妖、修改quotes.py中的parse函數(shù),讓其打印出網(wǎng)頁的html代碼揽祥,這個網(wǎng)頁直接輸出print(response.text)會有編碼報錯讽膏。parse函數(shù)會在爬蟲運行的最后開始執(zhí)行,這里的response就是網(wǎng)頁請求返回的結(jié)果拄丰。


2府树、在命令窗口中使用scrapy crawl quotes運行爬蟲,看到scrapy除了打印出網(wǎng)頁html代碼外料按,還有很多信息輸出奄侠。

C:\Users\m1812\quotetutorial>scrapy crawl quotes
2019-04-05 19:50:11 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: quotetutorial)
2019-04-05 19:50:11 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'quotetutorial.spiders', 'SPIDER_MODULES': ['quotetutorial.spi
ders'], 'BOT_NAME': 'quotetutorial', 'ROBOTSTXT_OBEY': True}
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats']
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-05 19:50:11 [scrapy.core.engine] INFO: Spider opened
2019-04-05 19:50:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-05 19:50:11 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-05 19:50:12 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-04-05 19:50:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n    <link rel="stylesheet" href="/static/bo
otstrap.min.css">\n    <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n    <div class="container">\n        <div class="row header-
box">\n            <div class="col-md-8">\n                <h1>\n                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n
                </h1>\n            </div>\n            <div class="col-md-4">\n                <p>\n                \n                    <a href="/l
ogin">Login</a>\n                \n                </p>\n            </div>\n        </div>\n    \n\n<div class="row">\n    <div class="col-md-8">\n\
n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xa1\xb0The world as we have
 created it is a process of our thinking. It cannot be changed without changing our thinking.\xa1\xb1</span>\n        <span>by <small class="author"
itemprop="author">Albert Einstein</small>\n        <a href="/author/Albert-Einstein">(about)</a>\n        </span>\n        <div class="tags">\n
      Tags:\n            <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world" /    > \n            \n
<a class="tag" href="/tag/change/page/1/">change</a>\n            \n            <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>\n
           \n            <a class="tag" href="/tag/thinking/page/1/">thinking</a>\n            \n            <a class="tag" href="/tag/world/page/1/"
>world</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span cl
ass="text" itemprop="text">\xa1\xb0It is our choices, Harry, that show what we truly are, far more than our abilities.\xa1\xb1</span>\n        <span>
by <small class="author" itemprop="author">J.K. Rowling</small>\n        <a href="/author/J-K-Rowling">(about)</a>\n        </span>\n        <div cla
ss="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="abilities,choices" /    > \n            \n
<a class="tag" href="/tag/abilities/page/1/">abilities</a>\n            \n            <a class="tag" href="/tag/choices/page/1/">choices</a>\n
     \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop
="text">\xa1\xb0There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\xa1
\xb1</span>\n        <span>by <small class="author" itemprop="author">Albert Einstein</small>\n        <a href="/author/Albert-Einstein">(about)</a>\
n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="inspirational,life,l
ive,miracle,miracles" /    > \n            \n            <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>\n            \n
  <a class="tag" href="/tag/life/page/1/">life</a>\n            \n            <a class="tag" href="/tag/live/page/1/">live</a>\n            \n
     <a class="tag" href="/tag/miracle/page/1/">miracle</a>\n            \n            <a class="tag" href="/tag/miracles/page/1/">miracles</a>\n
        \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemp
rop="text">\xa1\xb0The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\xa1\xb1</span>\n        <sp
an>by <small class="author" itemprop="author">Jane Austen</small>\n        <a href="/author/Jane-Austen">(about)</a>\n        </span>\n        <div c
lass="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="aliteracy,books,classic,humor" /    > \n
\n            <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>\n            \n            <a class="tag" href="/tag/books/page/1/">books</a
>\n            \n            <a class="tag" href="/tag/classic/page/1/">classic</a>\n            \n            <a class="tag" href="/tag/humor/page/1
/">humor</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span
class="text" itemprop="text">\xa1\xb0Imperfection is beauty, madness is genius and it&#39;s better to be absolutely ridiculous than absolutely boring
.\xa1\xb1</span>\n        <span>by <small class="author" itemprop="author">Marilyn Monroe</small>\n        <a href="/author/Marilyn-Monroe">(about)</
a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="be-yourself,inspi
rational" /    > \n            \n            <a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>\n            \n            <a class="tag"
 href="/tag/inspirational/page/1/">inspirational</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemtype="http://s
chema.org/CreativeWork">\n        <span class="text" itemprop="text">\xa1\xb0Try not to become a man of success. Rather become a man of value.\xa1\xb
1</span>\n        <span>by <small class="author" itemprop="author">Albert Einstein</small>\n        <a href="/author/Albert-Einstein">(about)</a>\n
      </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="adulthood,success,value
" /    > \n            \n            <a class="tag" href="/tag/adulthood/page/1/">adulthood</a>\n            \n            <a class="tag" href="/tag/
success/page/1/">success</a>\n            \n            <a class="tag" href="/tag/value/page/1/">value</a>\n            \n        </div>\n    </div>\
n\n    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xa1\xb0It is better to be
 hated for what you are than to be loved for what you are not.\xa1\xb1</span>\n        <span>by <small class="author" itemprop="author">Andr\xa8\xa6
Gide</small>\n        <a href="/author/Andre-Gide">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta cla
ss="keywords" itemprop="keywords" content="life,love" /    > \n            \n            <a class="tag" href="/tag/life/page/1/">life</a>\n
  \n            <a class="tag" href="/tag/love/page/1/">love</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemty
pe="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xa1\xb0I have not failed. I&#39;ve just found 10,000 ways that won&
#39;t work.\xa1\xb1</span>\n        <span>by <small class="author" itemprop="author">Thomas A. Edison</small>\n        <a href="/author/Thomas-A-Edis
on">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="edis
on,failure,inspirational,paraphrased" /    > \n            \n            <a class="tag" href="/tag/edison/page/1/">edison</a>\n            \n
    <a class="tag" href="/tag/failure/page/1/">failure</a>\n            \n            <a class="tag" href="/tag/inspirational/page/1/">inspirational<
/a>\n            \n            <a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>\n            \n        </div>\n    </div>\n\n    <div c
lass="quote" itemscope itemtype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xa1\xb0A woman is like a tea bag; you
never know how strong it is until it&#39;s in hot water.\xa1\xb1</span>\n        <span>by <small class="author" itemprop="author">Eleanor Roosevelt</
small>\n        <a href="/author/Eleanor-Roosevelt">(about)</a>\n        </span>\n        <div class="tags">\n            Tags:\n            <meta cl
ass="keywords" itemprop="keywords" content="misattributed-eleanor-roosevelt" /    > \n            \n            <a class="tag" href="/tag/misattribut
ed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a>\n            \n        </div>\n    </div>\n\n    <div class="quote" itemscope itemt
ype="http://schema.org/CreativeWork">\n        <span class="text" itemprop="text">\xa1\xb0A day without sunshine is like, you know, night.\xa1\xb1</s
pan>\n        <span>by <small class="author" itemprop="author">Steve Martin</small>\n        <a href="/author/Steve-Martin">(about)</a>\n        </sp
an>\n        <div class="tags">\n            Tags:\n            <meta class="keywords" itemprop="keywords" content="humor,obvious,simile" /    > \n
          \n            <a class="tag" href="/tag/humor/page/1/">humor</a>\n            \n            <a class="tag" href="/tag/obvious/page/1/">obvi
ous</a>\n            \n            <a class="tag" href="/tag/simile/page/1/">simile</a>\n            \n        </div>\n    </div>\n\n    <nav>\n
   <ul class="pager">\n            \n            \n            <li class="next">\n                <a href="/page/2/">Next <span aria-hidden="true">&r
arr;</span></a>\n            </li>\n            \n        </ul>\n    </nav>\n    </div>\n    <div class="col-md-4 tags-box">\n        \n            <
h2>Top Ten tags</h2>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 28px" href="/tag/love/">love</a
>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 26px" href="/tag/inspirationa
l/">inspirational</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 26px" hre
f="/tag/life/">life</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 24px" h
ref="/tag/humor/">humor</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size: 22p
x" href="/tag/books/">books</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="font-size:
 14px" href="/tag/reading/">reading</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="tag" style="fo
nt-size: 10px" href="/tag/friendship/">friendship</a>\n            </span>\n            \n            <span class="tag-item">\n            <a class="
tag" style="font-size: 8px" href="/tag/friends/">friends</a>\n            </span>\n            \n            <span class="tag-item">\n            <a
class="tag" style="font-size: 8px" href="/tag/truth/">truth</a>\n            </span>\n            \n            <span class="tag-item">\n
<a class="tag" style="font-size: 6px" href="/tag/simile/">simile</a>\n            </span>\n            \n        \n    </div>\n</div>\n\n    </div>\n
    <footer class="footer">\n        <div class="container">\n            <p class="text-muted">\n                Quotes by: <a href="https://www.goo
dreads.com/quotes">GoodReads.com</a>\n            </p>\n            <p class="copyright">\n                Made with <span class=\'sh-red\'>\x817\xc5
8</span> by <a >Scrapinghub</a>\n            </p>\n        </div>\n    </footer>\n</body>\n</html>'
2019-04-05 19:50:12 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-05 19:50:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 444,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 2701,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 4, 5, 11, 50, 12, 560342),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 4, 5, 11, 50, 11, 713697)}
2019-04-05 19:50:12 [scrapy.core.engine] INFO: Spider closed (finished)

5、開始爬取

首先瀏覽下目標信息的html結(jié)構(gòu):


1载矿、修改items.py中的內(nèi)容遭铺,將欲提取的3個信息按照指定的格式填入:


2、修改quotes.py中的內(nèi)容,添加爬取的規(guī)則魂挂,并且和步驟一中items.py的配置相映射甫题。

# -*- coding: utf-8 -*-
import scrapy
from quotetutorial.items import QuotetutorialItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):

        quotes = response.css('.quote')
        for quote in quotes:
            item = QuotetutorialItem()
            text = quote.css('.text::text').extract_first()
            author = quote.css('.author::text').extract_first()
            tags = quote.css('.tags .tag::text').extract()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            yield item

這邊用到了自帶的css選擇器。
在命令窗口中涂召,利用shell指令可以進行交互性測試坠非,scrapy shell "quotes.toscrape.com"注意這里的雙引號),從這里我們可以理解上面的css選出了什么果正,extract_first()extract()有什么區(qū)別炎码。

C:\Users\m1812\quotetutorial>scrapy shell "quotes.toscrape.com"

2019-04-05 20:08:39 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: quotetutorial)
2019-04-05 20:08:39 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'quotetutorial', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilte
r', 'SPIDER_MODULES': ['quotetutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'quotetutorial.spiders'}
2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-05 20:08:39 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-05 20:08:39 [scrapy.core.engine] INFO: Spider opened
2019-04-05 20:08:40 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-04-05 20:08:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com> (referer: None)
2019-04-05 20:08:46 [traitlets] DEBUG: Using default logger
2019-04-05 20:08:46 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x000002B3C7410748>
[s]   item       {}
[s]   request    <GET http://quotes.toscrape.com>
[s]   response   <200 http://quotes.toscrape.com>
[s]   settings   <scrapy.settings.Settings object at 0x000002B3C74D97B8>
[s]   spider     <DefaultSpider 'default' at 0x2b3c90e5ba8>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]: 

In [1]: response
Out[1]: <200 http://quotes.toscrape.com>

In [2]: quotes = response.css('.quote')

In [3]: quotes
Out[3]: 
[<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
 <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
 <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
 <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
 <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
 <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
 <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
 <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
 <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
 <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>]

In [4]: quotes[0]
Out[4]: <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" i
temscope itemtype="h'>

In [5]: quotes[0].css('.text')
Out[5]: [<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]" data='<span class="text" i
temprop="text">“The '>]

In [6]:  quotes[0].css('.text::text')
Out[6]: [<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]/text()" data='“The world a
s we have created it is a pr'>]

In [7]:  quotes[0].css('.text::text').extract()
Out[7]: ['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”']

In [8]: quotes[0].css('.text').extract()
Out[8]: ['<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing ou
r thinking.”</span>']

In [9]:  quotes[0].css('.text::text').extract_first()
Out[9]: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

In [10]:  quotes[0].css('.tags .tag::text').extract()
Out[10]: ['change', 'deep-thoughts', 'thinking', 'world']

In [11]: exit()

此時我們再運行下爬蟲scrapy crawl quotes
在終端可以看到很多信息。


3秋泳、單頁爬取完成潦闲,接下來要進行翻頁。網(wǎng)頁的url變化如下迫皱,可以通過next按鈕的href屬性獲得下一頁網(wǎng)址歉闰。


修改quotes.py中的代碼:

# -*- coding: utf-8 -*-
import scrapy
from quotetutorial.items import QuotetutorialItem

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["quotes.toscrape.com"]
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):

        quotes = response.css('.quote')
        for quote in quotes:
            item = QuotetutorialItem()   #字典類型
            text = quote.css('.text::text').extract_first()
            author = quote.css('.author::text').extract_first()
            tags = quote.css('.tags .tag::text').extract()
            item['text'] = text
            item['author'] = author
            item['tags'] = tags
            yield item

        next = response.css('.pager .next a::attr(href)').extract_first()
        url = response.urljoin(next)  #生成完整url
        yield scrapy.Request(url=url, callback=self.parse)  #遞歸調(diào)用

4、保存數(shù)據(jù)
命令行scrapy crawl quotes -o quotes.json卓起,保存為json格式


或者存成jl格式和敬,scrapy crawl quotes -o quotes.jl

或者存成CSV格式,scrapy crawl quotes -o quotes.csv
或者存成xml格式戏阅,scrapy crawl quotes -o quotes.xml
或者存成pickle格式昼弟,scrapy crawl quotes -o quotes.pickle
或者存成marshal格式,scrapy crawl quotes -o quotes.marshal

5、處理一些不想要的item或者保存到數(shù)據(jù)庫
這時候需要修改pipelines.py中的代碼


這里限定了字符最大為50個字符,超過的部分在后面添加...
同時定義了mongodb的保存函數(shù)

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo

from scrapy.exceptions import DropItem

# 這里限定了最大字符個數(shù)為50,超過用省略號代替
class QuotetutorialPipeline(object):

    def __init__(self):
        self.limit = 50

    # 只能返回兩種值币喧,item和DropItem
    def process_item(self, item, spider):
        if item['text']:
            if len(item['text']) > self.limit:
                item['text'] = item['text'][0:self.limit].rstrip() + '...'
                return item
        else:
            return DropItem('Missing Text')

class MongoPipeline(object):

    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db

    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DB')
        )

    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]

    def process_item(self, item, spider):
        name = item.__class__.__name__
        self.db[name].insert(dict(item))
        return item

    def close_spider(self, spider):
        # 關閉mongodb
        self.client.close()

還要修改setting.py中的相關設置

取消setting.py中關于pipeline的注釋四啰,這里的數(shù)字表示優(yōu)先級,越小的優(yōu)先級越高

重新運行命令行scrapy crawl quotes

mongodb中也可以看到保存的數(shù)據(jù)了。


參考自崔慶才博主的Scrapy教學

?著作權歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市,隨后出現(xiàn)的幾起案子铝耻,更是在濱河造成了極大的恐慌,老刑警劉巖蹬刷,帶你破解...
    沈念sama閱讀 218,451評論 6 506
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件瓢捉,死亡現(xiàn)場離奇詭異,居然都是意外死亡办成,警方通過查閱死者的電腦和手機泡态,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,172評論 3 394
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來迂卢,“玉大人某弦,你說我怎么就攤上這事桐汤。” “怎么了靶壮?”我有些...
    開封第一講書人閱讀 164,782評論 0 354
  • 文/不壞的土叔 我叫張陵怔毛,是天一觀的道長。 經(jīng)常有香客問我腾降,道長拣度,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 58,709評論 1 294
  • 正文 為了忘掉前任螃壤,我火速辦了婚禮抗果,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘奸晴。我一直安慰自己冤馏,他們只是感情好,可當我...
    茶點故事閱讀 67,733評論 6 392
  • 文/花漫 我一把揭開白布寄啼。 她就那樣靜靜地躺著逮光,像睡著了一般。 火紅的嫁衣襯著肌膚如雪辕录。 梳的紋絲不亂的頭發(fā)上睦霎,一...
    開封第一講書人閱讀 51,578評論 1 305
  • 那天梢卸,我揣著相機與錄音走诞,去河邊找鬼。 笑死蛤高,一個胖子當著我的面吹牛蚣旱,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播戴陡,決...
    沈念sama閱讀 40,320評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼塞绿,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了恤批?” 一聲冷哼從身側(cè)響起异吻,我...
    開封第一講書人閱讀 39,241評論 0 276
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎喜庞,沒想到半個月后诀浪,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,686評論 1 314
  • 正文 獨居荒郊野嶺守林人離奇死亡延都,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 37,878評論 3 336
  • 正文 我和宋清朗相戀三年雷猪,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片晰房。...
    茶點故事閱讀 39,992評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡求摇,死狀恐怖射沟,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情与境,我是刑警寧澤验夯,帶...
    沈念sama閱讀 35,715評論 5 346
  • 正文 年R本政府宣布,位于F島的核電站嚷辅,受9級特大地震影響簿姨,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜簸搞,卻給世界環(huán)境...
    茶點故事閱讀 41,336評論 3 330
  • 文/蒙蒙 一扁位、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧趁俊,春花似錦域仇、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,912評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至怔软,卻和暖如春垦细,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背挡逼。 一陣腳步聲響...
    開封第一講書人閱讀 33,040評論 1 270
  • 我被黑心中介騙來泰國打工括改, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人家坎。 一個月前我還...
    沈念sama閱讀 48,173評論 3 370
  • 正文 我出身青樓嘱能,卻偏偏與公主長得像,于是被迫代替她去往敵國和親虱疏。 傳聞我的和親對象是個殘疾皇子惹骂,可洞房花燭夜當晚...
    茶點故事閱讀 44,947評論 2 355

推薦閱讀更多精彩內(nèi)容