1栽烂、Scrapy安裝
在windows平臺anaconda
環(huán)境下胚吁,在命令窗口輸入conda install scrapy
撑毛,輸入確認的y
后书聚,靜靜等待安裝完成即可。安裝完成后,在窗口輸入scrapy version
雌续,能顯示版本號說明能夠正常使用斩个。
2、Scrapy指令
輸入scrapy -h
可以看到指令驯杜,關于命令行受啥,后面會再總結(jié)。
Scrapy 1.3.3 - project: quotetutorial
Usage:
scrapy <command> [options] [args]
Available commands:
bench Run quick benchmark test
check Check spider contracts
commands
crawl Run a spider
edit Edit spider
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
list List available spiders
parse Parse URL (using its spider) and print the results
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy
Use "scrapy <command> -h" to see more info about a command
3鸽心、新建項目
爬取的為用于測試scrapy的網(wǎng)站:http://quotes.toscrape.com/
爬取目標:獲取名言---作者---標簽滚局。
1、命令窗口下再悼,用cd
指令移動到想用來存放項目的文件夾
2、命令窗口下膝但,scrapy startproject 你的文件夾名
冲九,這里命名為scrapy startproject quotetutorial
。
可以看到兩個提示跟束, cd quotetutorial
,scrapy genspider example example.com
莺奸,(即cd 你的工作文件夾
,scrapy genspider 你的爬蟲名 爬取的目標地址
),根據(jù)提示繼續(xù)操作冀宴。
C:\Users\m1812>scrapy startproject quotetutorial
New Scrapy project 'quotetutorial', using template directory 'C:\\Users\\m1812\\Anaconda3\\lib\\site-packages\\scrapy\\templates\\project', created in:
C:\Users\m1812\quotetutorial
You can start your first spider with:
cd quotetutorial
scrapy genspider example example.com
3灭贷、cd quotetutorial
移動到創(chuàng)建好的文件夾中
C:\Users\m1812>cd quotetutorial
4、scrapy genspider quotes quotes.toscrape.com
略贮,生成一個名為quotes.py
的文件甚疟,地址為quotes quotes.toscrape.com
C:\Users\m1812\quotetutorial>scrapy genspider quotes quotes.toscrape.com
Created spider 'quotes' using template 'basic' in module:
quotetutorial.spiders.quotes
4逃延、Scrapy初窺
1览妖、修改quotes.py
中的parse
函數(shù),讓其打印出網(wǎng)頁的html代碼揽祥,這個網(wǎng)頁直接輸出print(response.text)
會有編碼報錯讽膏。parse
函數(shù)會在爬蟲運行的最后開始執(zhí)行,這里的response就是網(wǎng)頁請求返回的結(jié)果拄丰。
2府树、在命令窗口中使用
scrapy crawl quotes
運行爬蟲,看到scrapy除了打印出網(wǎng)頁html代碼外料按,還有很多信息輸出奄侠。
C:\Users\m1812\quotetutorial>scrapy crawl quotes
2019-04-05 19:50:11 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: quotetutorial)
2019-04-05 19:50:11 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'quotetutorial.spiders', 'SPIDER_MODULES': ['quotetutorial.spi
ders'], 'BOT_NAME': 'quotetutorial', 'ROBOTSTXT_OBEY': True}
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.logstats.LogStats']
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-05 19:50:11 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-05 19:50:11 [scrapy.core.engine] INFO: Spider opened
2019-04-05 19:50:11 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-05 19:50:11 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-05 19:50:12 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-04-05 19:50:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
b'<!DOCTYPE html>\n<html lang="en">\n<head>\n\t<meta charset="UTF-8">\n\t<title>Quotes to Scrape</title>\n <link rel="stylesheet" href="/static/bo
otstrap.min.css">\n <link rel="stylesheet" href="/static/main.css">\n</head>\n<body>\n <div class="container">\n <div class="row header-
box">\n <div class="col-md-8">\n <h1>\n <a href="/" style="text-decoration: none">Quotes to Scrape</a>\n
</h1>\n </div>\n <div class="col-md-4">\n <p>\n \n <a href="/l
ogin">Login</a>\n \n </p>\n </div>\n </div>\n \n\n<div class="row">\n <div class="col-md-8">\n\
n <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n <span class="text" itemprop="text">\xa1\xb0The world as we have
created it is a process of our thinking. It cannot be changed without changing our thinking.\xa1\xb1</span>\n <span>by <small class="author"
itemprop="author">Albert Einstein</small>\n <a href="/author/Albert-Einstein">(about)</a>\n </span>\n <div class="tags">\n
Tags:\n <meta class="keywords" itemprop="keywords" content="change,deep-thoughts,thinking,world" / > \n \n
<a class="tag" href="/tag/change/page/1/">change</a>\n \n <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>\n
\n <a class="tag" href="/tag/thinking/page/1/">thinking</a>\n \n <a class="tag" href="/tag/world/page/1/"
>world</a>\n \n </div>\n </div>\n\n <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n <span cl
ass="text" itemprop="text">\xa1\xb0It is our choices, Harry, that show what we truly are, far more than our abilities.\xa1\xb1</span>\n <span>
by <small class="author" itemprop="author">J.K. Rowling</small>\n <a href="/author/J-K-Rowling">(about)</a>\n </span>\n <div cla
ss="tags">\n Tags:\n <meta class="keywords" itemprop="keywords" content="abilities,choices" / > \n \n
<a class="tag" href="/tag/abilities/page/1/">abilities</a>\n \n <a class="tag" href="/tag/choices/page/1/">choices</a>\n
\n </div>\n </div>\n\n <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n <span class="text" itemprop
="text">\xa1\xb0There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.\xa1
\xb1</span>\n <span>by <small class="author" itemprop="author">Albert Einstein</small>\n <a href="/author/Albert-Einstein">(about)</a>\
n </span>\n <div class="tags">\n Tags:\n <meta class="keywords" itemprop="keywords" content="inspirational,life,l
ive,miracle,miracles" / > \n \n <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>\n \n
<a class="tag" href="/tag/life/page/1/">life</a>\n \n <a class="tag" href="/tag/live/page/1/">live</a>\n \n
<a class="tag" href="/tag/miracle/page/1/">miracle</a>\n \n <a class="tag" href="/tag/miracles/page/1/">miracles</a>\n
\n </div>\n </div>\n\n <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n <span class="text" itemp
rop="text">\xa1\xb0The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.\xa1\xb1</span>\n <sp
an>by <small class="author" itemprop="author">Jane Austen</small>\n <a href="/author/Jane-Austen">(about)</a>\n </span>\n <div c
lass="tags">\n Tags:\n <meta class="keywords" itemprop="keywords" content="aliteracy,books,classic,humor" / > \n
\n <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>\n \n <a class="tag" href="/tag/books/page/1/">books</a
>\n \n <a class="tag" href="/tag/classic/page/1/">classic</a>\n \n <a class="tag" href="/tag/humor/page/1
/">humor</a>\n \n </div>\n </div>\n\n <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n <span
class="text" itemprop="text">\xa1\xb0Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring
.\xa1\xb1</span>\n <span>by <small class="author" itemprop="author">Marilyn Monroe</small>\n <a href="/author/Marilyn-Monroe">(about)</
a>\n </span>\n <div class="tags">\n Tags:\n <meta class="keywords" itemprop="keywords" content="be-yourself,inspi
rational" / > \n \n <a class="tag" href="/tag/be-yourself/page/1/">be-yourself</a>\n \n <a class="tag"
href="/tag/inspirational/page/1/">inspirational</a>\n \n </div>\n </div>\n\n <div class="quote" itemscope itemtype="http://s
chema.org/CreativeWork">\n <span class="text" itemprop="text">\xa1\xb0Try not to become a man of success. Rather become a man of value.\xa1\xb
1</span>\n <span>by <small class="author" itemprop="author">Albert Einstein</small>\n <a href="/author/Albert-Einstein">(about)</a>\n
</span>\n <div class="tags">\n Tags:\n <meta class="keywords" itemprop="keywords" content="adulthood,success,value
" / > \n \n <a class="tag" href="/tag/adulthood/page/1/">adulthood</a>\n \n <a class="tag" href="/tag/
success/page/1/">success</a>\n \n <a class="tag" href="/tag/value/page/1/">value</a>\n \n </div>\n </div>\
n\n <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\n <span class="text" itemprop="text">\xa1\xb0It is better to be
hated for what you are than to be loved for what you are not.\xa1\xb1</span>\n <span>by <small class="author" itemprop="author">Andr\xa8\xa6
Gide</small>\n <a href="/author/Andre-Gide">(about)</a>\n </span>\n <div class="tags">\n Tags:\n <meta cla
ss="keywords" itemprop="keywords" content="life,love" / > \n \n <a class="tag" href="/tag/life/page/1/">life</a>\n
\n <a class="tag" href="/tag/love/page/1/">love</a>\n \n </div>\n </div>\n\n <div class="quote" itemscope itemty
pe="http://schema.org/CreativeWork">\n <span class="text" itemprop="text">\xa1\xb0I have not failed. I've just found 10,000 ways that won&
#39;t work.\xa1\xb1</span>\n <span>by <small class="author" itemprop="author">Thomas A. Edison</small>\n <a href="/author/Thomas-A-Edis
on">(about)</a>\n </span>\n <div class="tags">\n Tags:\n <meta class="keywords" itemprop="keywords" content="edis
on,failure,inspirational,paraphrased" / > \n \n <a class="tag" href="/tag/edison/page/1/">edison</a>\n \n
<a class="tag" href="/tag/failure/page/1/">failure</a>\n \n <a class="tag" href="/tag/inspirational/page/1/">inspirational<
/a>\n \n <a class="tag" href="/tag/paraphrased/page/1/">paraphrased</a>\n \n </div>\n </div>\n\n <div c
lass="quote" itemscope itemtype="http://schema.org/CreativeWork">\n <span class="text" itemprop="text">\xa1\xb0A woman is like a tea bag; you
never know how strong it is until it's in hot water.\xa1\xb1</span>\n <span>by <small class="author" itemprop="author">Eleanor Roosevelt</
small>\n <a href="/author/Eleanor-Roosevelt">(about)</a>\n </span>\n <div class="tags">\n Tags:\n <meta cl
ass="keywords" itemprop="keywords" content="misattributed-eleanor-roosevelt" / > \n \n <a class="tag" href="/tag/misattribut
ed-eleanor-roosevelt/page/1/">misattributed-eleanor-roosevelt</a>\n \n </div>\n </div>\n\n <div class="quote" itemscope itemt
ype="http://schema.org/CreativeWork">\n <span class="text" itemprop="text">\xa1\xb0A day without sunshine is like, you know, night.\xa1\xb1</s
pan>\n <span>by <small class="author" itemprop="author">Steve Martin</small>\n <a href="/author/Steve-Martin">(about)</a>\n </sp
an>\n <div class="tags">\n Tags:\n <meta class="keywords" itemprop="keywords" content="humor,obvious,simile" / > \n
\n <a class="tag" href="/tag/humor/page/1/">humor</a>\n \n <a class="tag" href="/tag/obvious/page/1/">obvi
ous</a>\n \n <a class="tag" href="/tag/simile/page/1/">simile</a>\n \n </div>\n </div>\n\n <nav>\n
<ul class="pager">\n \n \n <li class="next">\n <a href="/page/2/">Next <span aria-hidden="true">&r
arr;</span></a>\n </li>\n \n </ul>\n </nav>\n </div>\n <div class="col-md-4 tags-box">\n \n <
h2>Top Ten tags</h2>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 28px" href="/tag/love/">love</a
>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 26px" href="/tag/inspirationa
l/">inspirational</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 26px" hre
f="/tag/life/">life</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 24px" h
ref="/tag/humor/">humor</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size: 22p
x" href="/tag/books/">books</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="font-size:
14px" href="/tag/reading/">reading</a>\n </span>\n \n <span class="tag-item">\n <a class="tag" style="fo
nt-size: 10px" href="/tag/friendship/">friendship</a>\n </span>\n \n <span class="tag-item">\n <a class="
tag" style="font-size: 8px" href="/tag/friends/">friends</a>\n </span>\n \n <span class="tag-item">\n <a
class="tag" style="font-size: 8px" href="/tag/truth/">truth</a>\n </span>\n \n <span class="tag-item">\n
<a class="tag" style="font-size: 6px" href="/tag/simile/">simile</a>\n </span>\n \n \n </div>\n</div>\n\n </div>\n
<footer class="footer">\n <div class="container">\n <p class="text-muted">\n Quotes by: <a href="https://www.goo
dreads.com/quotes">GoodReads.com</a>\n </p>\n <p class="copyright">\n Made with <span class=\'sh-red\'>\x817\xc5
8</span> by <a >Scrapinghub</a>\n </p>\n </div>\n </footer>\n</body>\n</html>'
2019-04-05 19:50:12 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-05 19:50:12 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 444,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2701,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 4, 5, 11, 50, 12, 560342),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2019, 4, 5, 11, 50, 11, 713697)}
2019-04-05 19:50:12 [scrapy.core.engine] INFO: Spider closed (finished)
5、開始爬取
首先瀏覽下目標信息的html結(jié)構(gòu):
1载矿、修改items.py
中的內(nèi)容遭铺,將欲提取的3個信息按照指定的格式填入:
2、修改
quotes.py
中的內(nèi)容,添加爬取的規(guī)則魂挂,并且和步驟一中items.py
的配置相映射甫题。
# -*- coding: utf-8 -*-
import scrapy
from quotetutorial.items import QuotetutorialItem
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
quotes = response.css('.quote')
for quote in quotes:
item = QuotetutorialItem()
text = quote.css('.text::text').extract_first()
author = quote.css('.author::text').extract_first()
tags = quote.css('.tags .tag::text').extract()
item['text'] = text
item['author'] = author
item['tags'] = tags
yield item
這邊用到了自帶的css選擇器。
在命令窗口中涂召,利用shell
指令可以進行交互性測試坠非,scrapy shell "quotes.toscrape.com"
(注意這里的雙引號),從這里我們可以理解上面的css選出了什么果正,extract_first()
和extract()
有什么區(qū)別炎码。
C:\Users\m1812\quotetutorial>scrapy shell "quotes.toscrape.com"
2019-04-05 20:08:39 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: quotetutorial)
2019-04-05 20:08:39 [scrapy.utils.log] INFO: Overridden settings: {'BOT_NAME': 'quotetutorial', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilte
r', 'SPIDER_MODULES': ['quotetutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'quotetutorial.spiders'}
2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-05 20:08:39 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-05 20:08:39 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-05 20:08:39 [scrapy.core.engine] INFO: Spider opened
2019-04-05 20:08:40 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-04-05 20:08:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com> (referer: None)
2019-04-05 20:08:46 [traitlets] DEBUG: Using default logger
2019-04-05 20:08:46 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x000002B3C7410748>
[s] item {}
[s] request <GET http://quotes.toscrape.com>
[s] response <200 http://quotes.toscrape.com>
[s] settings <scrapy.settings.Settings object at 0x000002B3C74D97B8>
[s] spider <DefaultSpider 'default' at 0x2b3c90e5ba8>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
In [1]:
In [1]: response
Out[1]: <200 http://quotes.toscrape.com>
In [2]: quotes = response.css('.quote')
In [3]: quotes
Out[3]:
[<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>,
<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscop
e itemtype="h'>]
In [4]: quotes[0]
Out[4]: <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" i
temscope itemtype="h'>
In [5]: quotes[0].css('.text')
Out[5]: [<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]" data='<span class="text" i
temprop="text">“The '>]
In [6]: quotes[0].css('.text::text')
Out[6]: [<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]/text()" data='“The world a
s we have created it is a pr'>]
In [7]: quotes[0].css('.text::text').extract()
Out[7]: ['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”']
In [8]: quotes[0].css('.text').extract()
Out[8]: ['<span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing ou
r thinking.”</span>']
In [9]: quotes[0].css('.text::text').extract_first()
Out[9]: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
In [10]: quotes[0].css('.tags .tag::text').extract()
Out[10]: ['change', 'deep-thoughts', 'thinking', 'world']
In [11]: exit()
此時我們再運行下爬蟲scrapy crawl quotes
在終端可以看到很多信息。
3秋泳、單頁爬取完成潦闲,接下來要進行翻頁。網(wǎng)頁的url變化如下迫皱,可以通過next按鈕的href屬性獲得下一頁網(wǎng)址歉闰。
修改
quotes.py
中的代碼:
# -*- coding: utf-8 -*-
import scrapy
from quotetutorial.items import QuotetutorialItem
class QuotesSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["quotes.toscrape.com"]
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
quotes = response.css('.quote')
for quote in quotes:
item = QuotetutorialItem() #字典類型
text = quote.css('.text::text').extract_first()
author = quote.css('.author::text').extract_first()
tags = quote.css('.tags .tag::text').extract()
item['text'] = text
item['author'] = author
item['tags'] = tags
yield item
next = response.css('.pager .next a::attr(href)').extract_first()
url = response.urljoin(next) #生成完整url
yield scrapy.Request(url=url, callback=self.parse) #遞歸調(diào)用
4、保存數(shù)據(jù)
命令行scrapy crawl quotes -o quotes.json
卓起,保存為json格式
或者存成jl格式和敬,
scrapy crawl quotes -o quotes.jl
或者存成CSV格式,
scrapy crawl quotes -o quotes.csv
或者存成xml格式戏阅,
scrapy crawl quotes -o quotes.xml
或者存成pickle格式昼弟,
scrapy crawl quotes -o quotes.pickle
或者存成marshal格式,
scrapy crawl quotes -o quotes.marshal
5、處理一些不想要的item或者保存到數(shù)據(jù)庫
這時候需要修改pipelines.py
中的代碼
這里限定了字符最大為50個字符,超過的部分在后面添加
...
同時定義了mongodb的保存函數(shù)
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymongo
from scrapy.exceptions import DropItem
# 這里限定了最大字符個數(shù)為50,超過用省略號代替
class QuotetutorialPipeline(object):
def __init__(self):
self.limit = 50
# 只能返回兩種值币喧,item和DropItem
def process_item(self, item, spider):
if item['text']:
if len(item['text']) > self.limit:
item['text'] = item['text'][0:self.limit].rstrip() + '...'
return item
else:
return DropItem('Missing Text')
class MongoPipeline(object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DB')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
def process_item(self, item, spider):
name = item.__class__.__name__
self.db[name].insert(dict(item))
return item
def close_spider(self, spider):
# 關閉mongodb
self.client.close()
還要修改setting.py
中的相關設置
重新運行命令行scrapy crawl quotes
mongodb中也可以看到保存的數(shù)據(jù)了。
參考自崔慶才博主的Scrapy教學