1.創(chuàng)建一個scrapy項目
scrapy startproject quotetutorial
2.進(jìn)入到剛才創(chuàng)建的項目quotetutorial文件夾中為項目創(chuàng)建一個爬蟲
scrapy genspider quotes quotes.toscrape.com
這時候發(fā)現(xiàn)quotetutorial-quotetutorial-spider文件夾中有生成quotes.py文件
內(nèi)容如下:
class QuotesSpider(scrapy.Spider):
name ='quotes' # 爬蟲項目的名字
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/'] # 剛才指定的url
def parse(self, response):
pass
到現(xiàn)在為止的文件結(jié)構(gòu):
scrapy.cfg中指定settings文件和部署的配置
[settings]
default = quotetutorial.settings
[deploy]
#url = http://localhost:6800/
project = quotetutorial
1.items.py-保存數(shù)據(jù)結(jié)構(gòu)
2.middlewares.py-爬蟲中間件
3.pipelines.py-定義一些管道
4.settings.py-配置信息
所有的爬蟲是寫在spider文件夾下
我們把def parse方法加上一個print內(nèi)容:
import scrapy
class QuotesSpider(scrapy.Spider):
name ='quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/']
def parse(self, response):
print(response.text)
parse這個方法會在爬取網(wǎng)頁后執(zhí)行,在這里改成print(response.text)然后作如下操作執(zhí)行爬蟲
3.運行爬蟲
quotetutorial下還有一個quotetutorial文件夾,在外層quotetutorial下執(zhí)行
scrapy crawl quotes
這時候可以看到log信息如下,打印了scrapy框架執(zhí)行的信息戴尸,有版本信息畦韭,系統(tǒng)信息罢防,爬蟲信息,使用的中間件繁扎,爬去的網(wǎng)頁數(shù)據(jù)信息晰洒,剛才的print(response.text也會在下面打印)
D:\study\bandwagon\repository\spider\scrapy\quotetutorial>scrapy crawl quotes
2019-02-27 21:58:22 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: quotetutorial)
2019-02-27 21:58:22 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.7.0,
Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p 14 Aug 2018), cryptography 2.3.1,
Platform Windows-10-10.0.17134-SP0
2019-02-27 21:58:22 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'quotetutorial', 'NEWSPIDER_MODULE': 'quotetutorial.spiders', 'ROBO
TSTXT_OBEY': True, 'SPIDER_MODULES': ['quotetutorial.spiders']}
2019-02-27 21:58:22 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2019-02-27 21:58:23 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-02-27 21:58:23 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-02-27 21:58:23 [scrapy.core.engine] INFO: Spider opened
2019-02-27 21:58:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-02-27 21:58:23 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-02-27 21:58:28 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-02-27 21:58:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
<!DOCTYPE html>
<html lang="en">
**3丁!谍珊!在這個位置會打印剛才的response.text治宣,由于篇幅就不放了**
</html>
2019-02-27 21:58:31 [scrapy.core.engine] INFO: Closing spider (finished)
2019-02-27 21:58:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 446,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 2701,
'downloader/response_count': 2,
'downloader/response_status_count/200': 1,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2019, 2, 27, 13, 58, 31, 73758),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2019, 2, 27, 13, 58, 23, 304498)}
2019-02-27 21:58:31 [scrapy.core.engine] INFO: Spider closed (finished)
4.輸入爬蟲結(jié)果到不同格式的文件或ftp server.
通過-o 文件名的參數(shù)方式
scrapy crawl quotes -o quotes.json/quotes.csv/quotes.xml/quotes.pickle/quotes.jl/quote s.marshal/ftp://user:passwd@ftp.xxx.com/path/quotes.json
5.scrapy shell命令行交互模式
scrapy shell quotes.toscrape.com
In [1]: quotes = response.css('.quote')
In [4]: quotes[0]
Out[4]: <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>
In [5]: quotes[0].css('.text')
Out[5]: [<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]" data='<span class="text" itemprop="text">“The '>]
In [6]: quotes[0].css('.text::text')
Out[6]: [<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]/text()" data='“The world as we have created it is a pr'>]
在scrapy中css選擇器可以用::text的方式獲取文本
In [7]: quotes[0].css('.text::text').extract()
Out[7]: ['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”']
In [8]: quotes[0].css('.text::text').extract_first()
Out[8]: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
In [9]: quotes[0].css('.tags .tag::text').extract_first()
Out[9]: 'change'
In [10]: quotes[0].css('.tags .tag::text').extract()
Out[10]: ['change', 'deep-thoughts', 'thinking', 'world']
從上面這四個輸入輸出可以看出,extract_first()用于提取第一個匹配項,extract()用于提取所有匹配項成列表的格式侮邀,所以一般查找結(jié)果唯一的可以用extract_first()缆巧,查找結(jié)果很多項的就用extract()