scrapy基礎(chǔ)筆記1-創(chuàng)建并運行一個項目

1.創(chuàng)建一個scrapy項目

scrapy startproject quotetutorial

2.進(jìn)入到剛才創(chuàng)建的項目quotetutorial文件夾中為項目創(chuàng)建一個爬蟲

scrapy genspider quotes quotes.toscrape.com

這時候發(fā)現(xiàn)quotetutorial-quotetutorial-spider文件夾中有生成quotes.py文件

內(nèi)容如下：

   class QuotesSpider(scrapy.Spider):
       name ='quotes' # 爬蟲項目的名字
       allowed_domains = ['quotes.toscrape.com']
       start_urls = ['http://quotes.toscrape.com/']  # 剛才指定的url
       def parse(self, response):
           pass

到現(xiàn)在為止的文件結(jié)構(gòu)：

image

scrapy.cfg中指定settings文件和部署的配置

[settings]
default = quotetutorial.settings
[deploy]
#url = http://localhost:6800/
project = quotetutorial

1.items.py-保存數(shù)據(jù)結(jié)構(gòu)
2.middlewares.py-爬蟲中間件
3.pipelines.py-定義一些管道
4.settings.py-配置信息

所有的爬蟲是寫在spider文件夾下

我們把def parse方法加上一個print內(nèi)容：

import scrapy

class QuotesSpider(scrapy.Spider):
    name ='quotes'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']
    def parse(self, response):
        print(response.text)

parse這個方法會在爬取網(wǎng)頁后執(zhí)行，在這里改成print(response.text)然后作如下操作執(zhí)行爬蟲

3.運行爬蟲

quotetutorial下還有一個quotetutorial文件夾，在外層quotetutorial下執(zhí)行

scrapy crawl quotes

這時候可以看到log信息如下，打印了scrapy框架執(zhí)行的信息戴尸，有版本信息畦韭，系統(tǒng)信息罢防，爬蟲信息，使用的中間件繁扎，爬去的網(wǎng)頁數(shù)據(jù)信息晰洒，剛才的print(response.text也會在下面打印)

    D:\study\bandwagon\repository\spider\scrapy\quotetutorial>scrapy         crawl quotes
2019-02-27 21:58:22 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: quotetutorial)
2019-02-27 21:58:22 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.7.0,
Python 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p  14 Aug 2018), cryptography 2.3.1,
Platform Windows-10-10.0.17134-SP0
2019-02-27 21:58:22 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'quotetutorial', 'NEWSPIDER_MODULE': 'quotetutorial.spiders', 'ROBO
TSTXT_OBEY': True, 'SPIDER_MODULES': ['quotetutorial.spiders']}
2019-02-27 21:58:22 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2019-02-27 21:58:23 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-02-27 21:58:23 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-02-27 21:58:23 [scrapy.core.engine] INFO: Spider opened
2019-02-27 21:58:23 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-02-27 21:58:23 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-02-27 21:58:28 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-02-27 21:58:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com/> (referer: None)
<!DOCTYPE html>
<html lang="en">
**３丁！谍珊！在這個位置會打印剛才的response.text治宣，由于篇幅就不放了**
</html>
2019-02-27 21:58:31 [scrapy.core.engine] INFO: Closing spider (finished)
2019-02-27 21:58:31 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 446,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 2701,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 2, 27, 13, 58, 31, 73758),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 2, 27, 13, 58, 23, 304498)}
2019-02-27 21:58:31 [scrapy.core.engine] INFO: Spider closed (finished)

4.輸入爬蟲結(jié)果到不同格式的文件或ftp server.

通過-o 文件名的參數(shù)方式

scrapy crawl quotes -o     quotes.json/quotes.csv/quotes.xml/quotes.pickle/quotes.jl/quote s.marshal/ftp://user:passwd@ftp.xxx.com/path/quotes.json

5.scrapy shell命令行交互模式

scrapy shell quotes.toscrape.com

In [1]: quotes = response.css('.quote')
In [4]: quotes[0]
Out[4]: <Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' quote ')]" data='<div class="quote" itemscope itemtype="h'>
In [5]: quotes[0].css('.text')
Out[5]: [<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]" data='<span class="text" itemprop="text">“The '>]
In [6]: quotes[0].css('.text::text')
Out[6]: [<Selector xpath="descendant-or-self::*[@class and contains(concat(' ', normalize-space(@class), ' '), ' text ')]/text()" data='“The world as we have created it is a pr'>]

在scrapy中css選擇器可以用::text的方式獲取文本

In [7]: quotes[0].css('.text::text').extract()
Out[7]: ['“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”']
In [8]: quotes[0].css('.text::text').extract_first()
Out[8]: '“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'
In [9]: quotes[0].css('.tags .tag::text').extract_first()
Out[9]: 'change'
In [10]: quotes[0].css('.tags .tag::text').extract()
Out[10]: ['change', 'deep-thoughts', 'thinking', 'world']

從上面這四個輸入輸出可以看出，extract_first()用于提取第一個匹配項，extract()用于提取所有匹配項成列表的格式侮邀，所以一般查找結(jié)果唯一的可以用extract_first()缆巧，查找結(jié)果很多項的就用extract()

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市豌拙，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌题暖，老刑警劉巖按傅，帶你破解...
沈念sama閱讀 211,743評論 6贊 492
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異胧卤，居然都是意外死亡唯绍，警方通過查閱死者的電腦和手機(jī)，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 90,296評論 3贊 385
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門枝誊，熙熙樓的掌柜王于貴愁眉苦臉地迎上來况芒，“玉大人，你說我怎么就攤上這事叶撒【В” “怎么了？”我有些...
開封第一講書人閱讀 157,285評論 0贊 348
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵祠够，是天一觀的道長压汪。經(jīng)常有香客問我，道長古瓤，這世上最難降的妖魔是什么止剖？我笑而不...
開封第一講書人閱讀 56,485評論 1贊 283
?港島之戀（遺憾婚禮）
正文為了忘掉前任，我火速辦了婚禮落君，結(jié)果婚禮上穿香，老公的妹妹穿的比我還像新娘。我一直安慰自己绎速，他們只是感情好皮获，可當(dāng)我...
茶點故事閱讀 65,581評論 6贊 386
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著朝氓，像睡著了一般魔市。火紅的嫁衣襯著肌膚如雪。梳的紋絲不亂的頭發(fā)上赵哲，一...
開封第一講書人閱讀 49,821評論 1贊 290
城市分裂傳說
那天待德，我揣著相機(jī)與錄音，去河邊找鬼枫夺。笑死将宪，一個胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播较坛，決...
沈念sama閱讀 38,960評論 3贊 408
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼印蔗，長吁一口氣：“原來是場噩夢啊……” “哼！你這毒婦竟也來了丑勤？” 一聲冷哼從身側(cè)響起华嘹，我...
開封第一講書人閱讀 37,719評論 0贊 266
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤，失蹤者是張志新（化名）和其女友劉穎法竞，沒想到半個月后耙厚，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 44,186評論 1贊 303
?護(hù)林員之死
正文獨居荒郊野嶺守林人離奇死亡岔霸，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 36,516評論 2贊 327
?白月光啟示錄
正文我和宋清朗相戀三年薛躬，在試婚紗的時候發(fā)現(xiàn)自己被綠了。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片呆细。...
茶點故事閱讀 38,650評論 1贊 340
活死人
序言：一個原本活蹦亂跳的男人離奇死亡型宝，死狀恐怖，靈堂內(nèi)的尸體忽然破棺而出絮爷，到底是詐尸還是另有隱情趴酣，我是刑警寧澤，帶...
沈念sama閱讀 34,329評論 4贊 330
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布坑夯，位于F島的核電站价卤，受9級特大地震影響，放射性物質(zhì)發(fā)生泄漏渊涝。R本人自食惡果不足惜慎璧，卻給世界環(huán)境...
茶點故事閱讀 39,936評論 3贊 313
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望跨释。院中可真熱鬧胸私，春花似錦、人聲如沸鳖谈。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,757評論 0贊 21
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽缆娃。三九已至捷绒，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間贯要，已是汗流浹背暖侨。一陣腳步聲響...
開封第一講書人閱讀 31,991評論 1贊 266
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機(jī)就差點兒被人妖公主榨干…… 1. 我叫王不留崇渗，地道東北人字逗。一個月前我還...
沈念sama閱讀 46,370評論 2贊 360
代替公主和親
正文我出身青樓京郑，卻偏偏與公主長得像，于是被迫代替她去往敵國和親葫掉。傳聞我的和親對象是個殘疾皇子些举，可洞房花燭夜當(dāng)晚...
茶點故事閱讀 43,527評論 2贊 349

scrapy基礎(chǔ)筆記1-創(chuàng)建并運行一個項目

推薦閱讀更多精彩內(nèi)容