python3的爬蟲筆記14——Scrapy命令

命令格式:scrapy <command> [options] [args]

commands 作用 命令作用域
crawl 使用一個spider開始爬取任務 項目內(nèi)
check 代碼語法檢查 項目內(nèi)
list 列出當前項目中所有可用的spiders 货岭,每一行顯示一個spider 項目內(nèi)
edit 在命令窗口下編輯一個爬蟲 項目內(nèi)
parse 用指定spider方法來訪問URL 項目內(nèi)
bench 測試當前爬行速度 全局
fetch 使用Scrapy downloader獲取URL 全局
genspider 使用預定義模板生成一個新的spider 全局
runspider Run a self-contained spider (without creating a project) 全局
settings 獲取Scrapy配置信息 全局
shell 命令行交互窗口下訪問URL 全局
startproject 創(chuàng)建一個新項目 全局
version 打印Scrapy版本 全局
view 通過瀏覽器打開URL,顯示內(nèi)容為Scrapy實際所見 全局

1滴劲、創(chuàng)建項目 startproject

scrapy startproject myproject [project_dir]
project_dir路徑下創(chuàng)建一個名為myproject的新的爬蟲項目跺涤,若沒有指名project_dir幔嫂,則project_dir名字將和myproject一樣纱昧。

C:\Users\m1812>scrapy startproject mytestproject
New Scrapy project 'mytestproject', using template directory 'C:\\Users\\m1812\\Anaconda3\\lib\\site-packages\\scrapy\\templates\\project', created in:
    C:\Users\m1812\mytestproject

You can start your first spider with:
    cd mytestproject
    scrapy genspider example example.com
C:\Users\m1812>cd mytestproject

C:\Users\m1812\mytestproject>tree
文件夾 PATH 列表
卷序列號為 5680-D4D0
C:.
└─mytestproject
    ├─spiders
    │  └─__pycache__
    └─__pycache__

2跪楞、生成爬蟲 genspider

在上面的目錄下:
scrapy genspider mydomain mydomain.com

C:\Users\m1812\mytestproject>scrapy genspider baidu www.baidu.com
Created spider 'baidu' using template 'basic' in module:
  mytestproject.spiders.baidu



看下genspider的詳細用法:

C:\Users\m1812\mytestproject>scrapy genspider -h
Usage
=====
  scrapy genspider [options] <name> <domain>

Generate new spider using pre-defined templates

Options
=======
--help, -h              show this help message and exit
--list, -l              List available templates
--edit, -e              Edit spider after creating it
--dump=TEMPLATE, -d TEMPLATE
                        Dump template to standard output
--template=TEMPLATE, -t TEMPLATE
                        Uses a custom template.
--force                 If the spider already exists, overwrite it with the
                        template

Global Options
--------------
--logfile=FILE          log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
                        log level (default: DEBUG)
--nolog                 disable logging completely
--profile=FILE          write python cProfile stats to FILE
--pidfile=FILE          write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
                        set/override setting (may be repeated)
--pdb                   enable pdb on failure

模板的使用:-t TEMPLATE
模板類型:

C:\Users\m1812\mytestproject>scrapy genspider -l
Available templates:
  basic
  crawl
  csvfeed
  xmlfeed

測試下模板的使用:

C:\Users\m1812\mytestproject>scrapy genspider -t crawl zhihu www.zhihu.com
Created spider 'zhihu' using template 'crawl' in module:
  mytestproject.spiders.zhihu
對比下和剛剛baidu的區(qū)別

3、運行爬蟲 crawl

scrapy crawl <spider>

C:\Users\m1812\mytestproject>scrapy crawl zhihu
2019-04-06 15:14:18 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: mytestproject)
2019-04-06 15:14:18 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'mytestproject.spiders', 'ROBOTSTXT_OBEY': True, 'BOT_NAME': 'mytestproject', 'SPIDER_MODULES': ['mytestproject.spiders']}
2019-04-06 15:14:18 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats']
2019-04-06 15:14:18 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-06 15:14:18 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-06 15:14:18 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-06 15:14:18 [scrapy.core.engine] INFO: Spider opened
2019-04-06 15:14:18 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 15:14:18 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-06 15:14:23 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://www.zhihu.com/robots.txt> (referer: None)
2019-04-06 15:14:28 [scrapy.core.engine] DEBUG: Crawled (400) <GET http://www.zhihu.com/> (referer: None)
2019-04-06 15:14:28 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <400 http://www.zhihu.com/>: HTTP status code is not handled or not allowed
2019-04-06 15:14:28 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-06 15:14:28 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 527,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 813,
 'downloader/response_count': 2,
 'downloader/response_status_count/400': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 4, 6, 7, 14, 28, 947408),
 'log_count/DEBUG': 3,
 'log_count/INFO': 8,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 4, 6, 7, 14, 18, 593508)}
2019-04-06 15:14:28 [scrapy.core.engine] INFO: Spider closed (finished)

知乎需要一些請求頭才能成功訪問侣灶,所以這里狀態(tài)碼顯示不成功甸祭。

4、檢查代碼 check

scrapy check [-l] <spider>

C:\Users\m1812\mytestproject>scrapy check

----------------------------------------------------------------------
Ran 0 contracts in 0.000s

OK

如果隨便把代碼改錯褥影,這里刪了網(wǎng)址的一個引號池户。再運行下,就能檢查到錯誤凡怎。

C:\Users\m1812\mytestproject>scrapy check
Traceback (most recent call last):
  File "C:\Users\m1812\Anaconda3\Scripts\scrapy-script.py", line 5, in <module>
    sys.exit(scrapy.cmdline.execute())
  File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\cmdline.py", line 141, in execute
    cmd.crawler_process = CrawlerProcess(settings)
  File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\crawler.py", line 238, in __init__
    super(CrawlerProcess, self).__init__(settings)
  File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\crawler.py", line 129, in __init__
    self.spider_loader = _get_spider_loader(settings)
  File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\crawler.py", line 325, in _get_spider_loader
    return loader_cls.from_settings(settings.frozencopy())
  File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\spiderloader.py", line 45, in from_settings
    return cls(settings)
  File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\spiderloader.py", line 23, in __init__
    self._load_all_spiders()
  File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\spiderloader.py", line 32, in _load_all_spiders
    for module in walk_modules(name):
  File "C:\Users\m1812\Anaconda3\lib\site-packages\scrapy\utils\misc.py", line 71, in walk_modules
    submod = import_module(fullpath)
  File "C:\Users\m1812\Anaconda3\lib\importlib\__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 986, in _gcd_import
  File "<frozen importlib._bootstrap>", line 969, in _find_and_load
  File "<frozen importlib._bootstrap>", line 958, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 673, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 661, in exec_module
  File "<frozen importlib._bootstrap_external>", line 767, in get_code
  File "<frozen importlib._bootstrap_external>", line 727, in source_to_code
  File "<frozen importlib._bootstrap>", line 222, in _call_with_frames_removed
  File "C:\Users\m1812\mytestproject\mytestproject\spiders\zhihu.py", line 10
    start_urls = [http://www.zhihu.com/']

這個指令實用性不高校焦。

5、顯示項目內(nèi)可用爬蟲 list

scrapy list

C:\Users\m1812\mytestproject>scrapy list
baidu
zhihu

6统倒、編輯爬蟲 edit

scrapy edit <spider>
windows下好像用不了寨典,一般也用不到,在ide中如pycharm中運行即可房匆。

7耸成、獲取URL fetch

這是個全局命令:scrapy fetch [options] <url>
詳細用法:

C:\Users\m1812\mytestproject>scrapy fetch -h
Usage
=====
  scrapy fetch [options] <url>

Fetch a URL using the Scrapy downloader and print its content to stdout. You
may want to use --nolog to disable logging

Options
=======
--help, -h              show this help message and exit
--spider=SPIDER         use this spider
--headers               print response HTTP headers instead of body
--no-redirect           do not handle HTTP 3xx status codes and print response
                        as-is

Global Options
--------------
--logfile=FILE          log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
                        log level (default: DEBUG)
--nolog                 disable logging completely
--profile=FILE          write python cProfile stats to FILE
--pidfile=FILE          write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
                        set/override setting (may be repeated)
--pdb                   enable pdb on failure

測試下獲取百度的信息,注意這里一定要加上http://

C:\Users\m1812>scrapy fetch http://www.baidu.com
2019-04-06 15:44:51 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2019-04-06 15:44:51 [scrapy.utils.log] INFO: Overridden settings: {}
2019-04-06 15:44:51 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole']
2019-04-06 15:44:51 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-06 15:44:51 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-06 15:44:51 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-06 15:44:51 [scrapy.core.engine] INFO: Spider opened
2019-04-06 15:44:51 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 15:44:51 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-06 15:44:51 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.baidu.com> (referer: None)
2019-04-06 15:44:51 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-06 15:44:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 211,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 1476,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 4, 6, 7, 44, 51, 989960),
 'log_count/DEBUG': 2,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 4, 6, 7, 44, 51, 759268)}
2019-04-06 15:44:51 [scrapy.core.engine] INFO: Spider closed (finished)
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>鐧懼害涓€涓嬶紝浣犲氨鐭ラ亾</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=鐧懼害涓€涓?class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>鏂伴椈</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>鍦板浘</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>瑙嗛</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>璐村惂</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>鐧誨綍</a> </noscript> <script>document.write('<a + encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">鐧誨綍</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">鏇村浜у搧</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>鍏充簬鐧懼害</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>浣跨敤鐧懼害鍓嶅繀璇?/a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>鎰忚鍙嶉</a>&nbsp;浜琁CP璇?30173鍙?nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

>

試下不輸出日志

C:\Users\m1812>scrapy fetch --nolog http://www.baidu.com
<!DOCTYPE html>
<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>鐧懼害涓€涓嬶紝浣犲氨鐭ラ亾</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=鐧懼害涓€涓?class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>鏂伴椈</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>鍦板浘</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>瑙嗛</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>璐村惂</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>鐧誨綍</a> </noscript> <script>document.write('<a + encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ '" name="tj_login" class="lb">鐧誨綍</a>');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">鏇村浜у搧</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>鍏充簬鐧懼害</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>浣跨敤鐧懼害鍓嶅繀璇?/a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>鎰忚鍙嶉</a>&nbsp;浜琁CP璇?30173鍙?nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>

>

獲取headers

C:\Users\m1812>scrapy fetch --nolog --headers http://www.baidu.com
> User-Agent: Scrapy/1.3.3 (+http://scrapy.org)
> Accept-Language: en
> Accept-Encoding: gzip,deflate
> Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>
< Date: Sat, 06 Apr 2019 07:48:42 GMT
< Server: bfe/1.0.8.18
< Content-Type: text/html
< Last-Modified: Mon, 23 Jan 2017 13:28:12 GMT
< Cache-Control: private, no-cache, no-store, proxy-revalidate, no-transform
< Pragma: no-cache
< Set-Cookie: BDORZ=27315; max-age=86400; domain=.baidu.com; path=/

此外還有很多其他功能浴鸿,如--no-redirect:禁止重定向

8井氢、以Scrapy所見在瀏覽器中打開URL view

這是個全局命令:scrapy view [options] <url>
通過瀏覽器打開URL,顯示內(nèi)容為Scrapy實際所見岳链。有時候spider看到的頁面和常規(guī)方式不同花竞,這個方法能檢查spider看到的信息是否和你期待的一致。

C:\Users\m1812>scrapy view http://www.baidu.com
2019-04-06 16:01:45 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2019-04-06 16:01:45 [scrapy.utils.log] INFO: Overridden settings: {}
2019-04-06 16:01:45 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats']
2019-04-06 16:01:46 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-06 16:01:46 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-06 16:01:46 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-06 16:01:46 [scrapy.core.engine] INFO: Spider opened
2019-04-06 16:01:46 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:01:46 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-06 16:01:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.baidu.com> (referer: None)
2019-04-06 16:01:46 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-06 16:01:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 211,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 1476,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 4, 6, 8, 1, 46, 435330),
 'log_count/DEBUG': 2,
 'log_count/INFO': 7,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 4, 6, 8, 1, 46, 78537)}
2019-04-06 16:01:46 [scrapy.core.engine] INFO: Spider closed (finished)

測試下淘寶掸哑,很多加載不出來约急,說明淘寶用的是ajax異步加載,常規(guī)的request方法不能獲得信息苗分。


9烤宙、命令行交互窗口下訪問URL shell

這是個全局命令:scrapy shell [options] <url>

C:\Users\m1812>scrapy shell http://www.baidu.com
2019-04-06 16:11:41 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2019-04-06 16:11:41 [scrapy.utils.log] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
2019-04-06 16:11:41 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2019-04-06 16:11:42 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-06 16:11:42 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-06 16:11:42 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-06 16:11:42 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-06 16:11:42 [scrapy.core.engine] INFO: Spider opened
2019-04-06 16:11:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://www.baidu.com> (referer: None)
2019-04-06 16:11:42 [traitlets] DEBUG: Using default logger
2019-04-06 16:11:42 [traitlets] DEBUG: Using default logger
[s] Available Scrapy objects:
[s]   scrapy     scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s]   crawler    <scrapy.crawler.Crawler object at 0x0000025F58BBA320>
[s]   item       {}
[s]   request    <GET http://www.baidu.com>
[s]   response   <200 http://www.baidu.com>
[s]   settings   <scrapy.settings.Settings object at 0x0000025F593EE6D8>
[s]   spider     <DefaultSpider 'default' at 0x25f5afd1470>
[s] Useful shortcuts:
[s]   fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s]   fetch(req)                  Fetch a scrapy.Request and update local objects
[s]   shelp()           Shell help (print this help)
[s]   view(response)    View response in a browser
In [1]: scrapy
Out[1]: <module 'scrapy' from 'C:\\Users\\m1812\\Anaconda3\\lib\\site-packages\\scrapy\\__init__.py'>

In [2]: request
Out[2]: <GET http://www.baidu.com>

In [3]: response
Out[3]: <200 http://www.baidu.com>

In [4]: view(response)
Out[4]: True

In [5]: response.text
Out[5]: '<!DOCTYPE html>\r\n<!--STATUS OK--><html> <head><meta http-equiv=content-type content=text/html;charset=utf-8><meta http-equiv=X-UA-Compatible content=IE=Edge><meta content=always name=referrer><link rel=stylesheet type=text/css href=http://s1.bdstatic.com/r/www/cache/bdorz/baidu.min.css><title>百度一下,你就知道</title></head> <body link=#0000cc> <div id=wrapper> <div id=head> <div class=head_wrapper> <div class=s_form> <div class=s_form_wrapper> <div id=lg> <img hidefocus=true src=//www.baidu.com/img/bd_logo1.png width=270 height=129> </div> <form id=form name=f action=//www.baidu.com/s class=fm> <input type=hidden name=bdorz_come value=1> <input type=hidden name=ie value=utf-8> <input type=hidden name=f value=8> <input type=hidden name=rsv_bp value=1> <input type=hidden name=rsv_idx value=1> <input type=hidden name=tn value=baidu><span class="bg s_ipt_wr"><input id=kw name=wd class=s_ipt value maxlength=255 autocomplete=off autofocus></span><span class="bg s_btn_wr"><input type=submit id=su value=百度一下 class="bg s_btn"></span> </form> </div> </div> <div id=u1> <a href=http://news.baidu.com name=tj_trnews class=mnav>新聞</a> <a href=http://www.hao123.com name=tj_trhao123 class=mnav>hao123</a> <a href=http://map.baidu.com name=tj_trmap class=mnav>地圖</a> <a href=http://v.baidu.com name=tj_trvideo class=mnav>視頻</a> <a href=http://tieba.baidu.com name=tj_trtieba class=mnav>貼吧</a> <noscript> <a href=http://www.baidu.com/bdorz/login.gif?login&amp;tpl=mn&amp;u=http%3A%2F%2Fwww.baidu.com%2f%3fbdorz_come%3d1 name=tj_login class=lb>登錄</a> </noscript> <script>document.write(\'<a + encodeURIComponent(window.location.href+ (window.location.search === "" ? "?" : "&")+ "bdorz_come=1")+ \'" name="tj_login" class="lb">登錄</a>\');</script> <a href=//www.baidu.com/more/ name=tj_briicon class=bri style="display: block;">更多產(chǎn)品</a> </div> </div> </div> <div id=ftCon> <div id=ftConw> <p id=lh> <a href=http://home.baidu.com>關于百度</a> <a href=http://ir.baidu.com>About Baidu</a> </p> <p id=cp>&copy;2017&nbsp;Baidu&nbsp;<a href=http://www.baidu.com/duty/>使用百度前必讀</a>&nbsp; <a href=http://jianyi.baidu.com/ class=cp-feedback>意見反饋</a>&nbsp;京ICP證030173號&nbsp; <img src=//www.baidu.com/img/gs.gif> </p> </div> </div> </div> </body> </html>\r\n'

In [6]: response.headers
Out[6]:
{b'Cache-Control': b'private, no-cache, no-store, proxy-revalidate, no-transform',
 b'Content-Type': b'text/html',
 b'Date': b'Sat, 06 Apr 2019 08:11:42 GMT',
 b'Last-Modified': b'Mon, 23 Jan 2017 13:28:12 GMT',
 b'Pragma': b'no-cache',
 b'Server': b'bfe/1.0.8.18',
 b'Set-Cookie': b'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/'}

In [7]: response.css('title::text').extract_first()
Out[7]: '百度一下俭嘁,你就知道'

In [8]: exit()
In [4]的顯示結果

10、用指定spider方法來訪問URL parse

scrapy parse <url> [options]
這里用前一講訪問quotes.toscrape.com的spider測試服猪。

C:\Users\m1812>cd quotetutorial

C:\Users\m1812\quotetutorial>scrapy parse http://quotes.toscrape.com -c parse
2019-04-06 16:24:23 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: quotetutorial)
2019-04-06 16:24:23 [scrapy.utils.log] INFO: Overridden settings: {'SPIDER_MODULES': ['quotetutorial.spiders'], 'ROBOTSTXT_OBEY': True, 'NEWSPIDER_MODULE': 'quotetutorial.spiders', 'BOT_NAME': 'quotetutorial'}
2019-04-06 16:24:23 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.corestats.CoreStats']
2019-04-06 16:24:24 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-06 16:24:24 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-06 16:24:24 [scrapy.middleware] INFO: Enabled item pipelines:
['quotetutorial.pipelines.QuotetutorialPipeline',
 'quotetutorial.pipelines.MongoPipeline']
2019-04-06 16:24:24 [scrapy.core.engine] INFO: Spider opened
2019-04-06 16:24:24 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:24:24 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2019-04-06 16:24:24 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://quotes.toscrape.com/robots.txt> (referer: None)
2019-04-06 16:24:25 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://quotes.toscrape.com> (referer: None)
2019-04-06 16:24:25 [scrapy.core.engine] INFO: Closing spider (finished)
2019-04-06 16:24:25 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 444,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 2701,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/404': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 4, 6, 8, 24, 25, 485334),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2019, 4, 6, 8, 24, 24, 258282)}
2019-04-06 16:24:25 [scrapy.core.engine] INFO: Spider closed (finished)

>>> STATUS DEPTH LEVEL 1 <<<
# Scraped Items  ------------------------------------------------------------
[{?[33m'?[39;49;00m?[33mauthor?[39;49;00m?[33m'?[39;49;00m: ?[33m'?[39;49;00m?[33mAlbert Einstein?[39;49;00m?[33m'?[39;49;00m,
  ?[33m'?[39;49;00m?[33mtags?[39;49;00m?[33m'?[39;49;00m: [?[33m'?[39;49;00m?[33mchange?[39;49;00m?[33m'?[39;49;00m, ?[33m'?[39;49;00m?[33mdeep-thoughts?[39;49;00m?[33m'?[39;49;00m, ?[33m'?[39;49;00m?[33mthinking?[39;49;00m?[33m'?[39;49;00m, ?[33m'?[39;49;00m?[33mworld?[39;49;00m?[33m'?[39;49;00m],
  ?[33m'?[39;49;00m?[33mtext?[39;49;00m?[33m'?[39;49;00m: ?[33m'?[39;49;00m?[33m“The world as we have created it is a process of our thinking. It ?[39;49;00m?[33m'?[39;49;00m
          ?[33m'?[39;49;00m?[33mcannot be changed without changing our thinking.”?[39;49;00m?[33m'?[39;49;00m},
 {?[33m'?[39;49;00m?[33mauthor?[39;49;00m?[33m'?[39;49;00m: ?[33m'?[39;49;00m?[33mJ.K. Rowling?[39;49;00m?[33m'?[39;49;00m,
  ?[33m'?[39;49;00m?[33mtags?[39;49;00m?[33m'?[39;49;00m: [?[33m'?[39;49;00m?[33mabilities?[39;49;00m?[33m'?[39;49;00m, ?[33m'?[39;49;00m?[33mchoices?[39;49;00m?[33m'?[39;49;00m],
  ?[33m'?[39;49;00m?[33mtext?[39;49;00m?[33m'?[39;49;00m: ?[33m'?[39;49;00m?[33m“It is our choices, Harry, that show what we truly are, far more ?[39;49;00m?[33m'?[39;49;00m
          ?[33m'?[39;49;00m?[33mthan our abilities.”?[39;49;00m?[33m'?[39;49;00m},
 {?[33m'?[39;49;00m?[33mauthor?[39;49;00m?[33m'?[39;49;00m: ?[33m'?[39;49;00m?[33mAlbert Einstein?[39;49;00m?[33m'?[39;49;00m,
  ?[33m'?[39;49;00m?[33mtags?[39;49;00m?[33m'?[39;49;00m: [?[33m'?[39;49;00m?[33minspirational?[39;49;00m?[33m'?[39;49;00m, ?[33m'?[39;49;00m?[33mlife?[39;49;00m?[33m'?[39;49;00m, ?[33m'?[39;49;00m?[33mlive?[39;49;00m?[33m'?[39;49;00m, ?[33m'?[39;49;00m?[33mmiracle?[39;49;00m?[33m'?[39;49;00m, ?[33m'?[39;49;00m?[33mmiracles?[39;49;00m?[33m'?[39;49;00m],
  ?[33m'?[39;49;00m?[33mtext?[39;49;00m?[33m'?[39;49;00m: ?[33m'?[39;49;00m?[33m“There are only two ways to live your life. One is as though ?[39;49;00m?[33m'?[39;49;00m
          ?[33m'?[39;49;00m?[33mnothing is a miracle. The other is as though everything is a ?[39;49;00m?[33m'?[39;49;00m
          ?[33m'?[39;49;00m?[33mmiracle.”?[39;49;00m?[33m'?[39;49;00m},
 {?[33m'?[39;49;00m?[33mauthor?[39;49;00m?[33m'?[39;49;00m: ?[33m'?[39;49;00m?[33mJane Austen?[39;49;00m?[33m'?[39;49;00m,
  ?[33m'?[39;49;00m?[33mtags?[39;49;00m?[33m'?[39;49;00m: [?[33m'?[39;49;00m?[33maliteracy?[39;49;00m?[33m'?[39;49;00m, ?[33m'?[39;49;00m?[33mbooks?[39;49;00m?[33m'?[39;49;00m, ?[33m'?[39;49;00m?[33mclassic?[39;49;00m?[33m'?[39;49;00m, ?[33m'?[39;49;00m?[33mhumor?[39;49;00m?[33m'?[39;49;00m],
  ?[33m'?[39;49;00m?[33mtext?[39;49;00m?[33m'?[39;49;00m: ?[33m'?[39;49;00m?[33m“The person, be it gentleman or lady, who has not pleasure in a ?[39;49;00m?[33m'?[39;49;00m
          ?[33m'?[39;49;00m?[33mgood novel, must be intolerably stupid.”?[39;49;00m?[33m'?[39;49;00m},
 {?[33m'?[39;49;00m?[33mauthor?[39;49;00m?[33m'?[39;49;00m: ?[33m'?[39;49;00m?[33mMarilyn Monroe?[39;49;00m?[33m'?[39;49;00m,
  ?[33m'?[39;49;00m?[33mtags?[39;49;00m?[33m'?[39;49;00m: [?[33m'?[39;49;00m?[33mbe-yourself?[39;49;00m?[33m'?[39;49;00m, ?[33m'?[39;49;00m?[33minspirational?[39;49;00m?[33m'?[39;49;00m],
  ?[33m'?[39;49;00m?[33mtext?[39;49;00m?[33m'?[39;49;00m: ?[33m"?[39;49;00m?[33m“Imperfection is beauty, madness is genius and it?[39;49;00m?[33m'?[39;49;00m?[33ms better to be ?[39;49;00m?[33m"?[39;49;00m
          ?[33m'?[39;49;00m?[33mabsolutely ridiculous than absolutely boring.”?[39;49;00m?[33m'?[39;49;00m},
 {?[33m'?[39;49;00m?[33mauthor?[39;49;00m?[33m'?[39;49;00m: ?[33m'?[39;49;00m?[33mAlbert Einstein?[39;49;00m?[33m'?[39;49;00m,
  ?[33m'?[39;49;00m?[33mtags?[39;49;00m?[33m'?[39;49;00m: [?[33m'?[39;49;00m?[33madulthood?[39;49;00m?[33m'?[39;49;00m, ?[33m'?[39;49;00m?[33msuccess?[39;49;00m?[33m'?[39;49;00m, ?[33m'?[39;49;00m?[33mvalue?[39;49;00m?[33m'?[39;49;00m],
  ?[33m'?[39;49;00m?[33mtext?[39;49;00m?[33m'?[39;49;00m: ?[33m'?[39;49;00m?[33m“Try not to become a man of success. Rather become a man of ?[39;49;00m?[33m'?[39;49;00m
          ?[33m'?[39;49;00m?[33mvalue.”?[39;49;00m?[33m'?[39;49;00m},
 {?[33m'?[39;49;00m?[33mauthor?[39;49;00m?[33m'?[39;49;00m: ?[33m'?[39;49;00m?[33mAndré Gide?[39;49;00m?[33m'?[39;49;00m,
  ?[33m'?[39;49;00m?[33mtags?[39;49;00m?[33m'?[39;49;00m: [?[33m'?[39;49;00m?[33mlife?[39;49;00m?[33m'?[39;49;00m, ?[33m'?[39;49;00m?[33mlove?[39;49;00m?[33m'?[39;49;00m],
  ?[33m'?[39;49;00m?[33mtext?[39;49;00m?[33m'?[39;49;00m: ?[33m'?[39;49;00m?[33m“It is better to be hated for what you are than to be loved for ?[39;49;00m?[33m'?[39;49;00m
          ?[33m'?[39;49;00m?[33mwhat you are not.”?[39;49;00m?[33m'?[39;49;00m},
 {?[33m'?[39;49;00m?[33mauthor?[39;49;00m?[33m'?[39;49;00m: ?[33m'?[39;49;00m?[33mThomas A. Edison?[39;49;00m?[33m'?[39;49;00m,
  ?[33m'?[39;49;00m?[33mtags?[39;49;00m?[33m'?[39;49;00m: [?[33m'?[39;49;00m?[33medison?[39;49;00m?[33m'?[39;49;00m, ?[33m'?[39;49;00m?[33mfailure?[39;49;00m?[33m'?[39;49;00m, ?[33m'?[39;49;00m?[33minspirational?[39;49;00m?[33m'?[39;49;00m, ?[33m'?[39;49;00m?[33mparaphrased?[39;49;00m?[33m'?[39;49;00m],
  ?[33m'?[39;49;00m?[33mtext?[39;49;00m?[33m'?[39;49;00m: ?[33m"?[39;49;00m?[33m“I have not failed. I?[39;49;00m?[33m'?[39;49;00m?[33mve just found 10,000 ways that won?[39;49;00m?[33m'?[39;49;00m?[33mt work.”?[39;49;00m?[33m"?[39;49;00m},
 {?[33m'?[39;49;00m?[33mauthor?[39;49;00m?[33m'?[39;49;00m: ?[33m'?[39;49;00m?[33mEleanor Roosevelt?[39;49;00m?[33m'?[39;49;00m,
  ?[33m'?[39;49;00m?[33mtags?[39;49;00m?[33m'?[39;49;00m: [?[33m'?[39;49;00m?[33mmisattributed-eleanor-roosevelt?[39;49;00m?[33m'?[39;49;00m],
  ?[33m'?[39;49;00m?[33mtext?[39;49;00m?[33m'?[39;49;00m: ?[33m'?[39;49;00m?[33m“A woman is like a tea bag; you never know how strong it is until ?[39;49;00m?[33m'?[39;49;00m
          ?[33m"?[39;49;00m?[33mit?[39;49;00m?[33m'?[39;49;00m?[33ms in hot water.”?[39;49;00m?[33m"?[39;49;00m},
 {?[33m'?[39;49;00m?[33mauthor?[39;49;00m?[33m'?[39;49;00m: ?[33m'?[39;49;00m?[33mSteve Martin?[39;49;00m?[33m'?[39;49;00m,
  ?[33m'?[39;49;00m?[33mtags?[39;49;00m?[33m'?[39;49;00m: [?[33m'?[39;49;00m?[33mhumor?[39;49;00m?[33m'?[39;49;00m, ?[33m'?[39;49;00m?[33mobvious?[39;49;00m?[33m'?[39;49;00m, ?[33m'?[39;49;00m?[33msimile?[39;49;00m?[33m'?[39;49;00m],
  ?[33m'?[39;49;00m?[33mtext?[39;49;00m?[33m'?[39;49;00m: ?[33m'?[39;49;00m?[33m“A day without sunshine is like, you know, night.”?[39;49;00m?[33m'?[39;49;00m}]

# Requests  -----------------------------------------------------------------
[<GET http://quotes.toscrape.com/page/?[34m2?[39;49;00m/>]

輸出了Scraped Itemsrequests供填。

11、獲取Scrapy配置信息 settings

scrapy settings [options]

C:\Users\m1812\quotetutorial>scrapy settings -h
Usage
=====
  scrapy settings [options]

Get settings values

Options
=======
--help, -h              show this help message and exit
--get=SETTING           print raw setting value
--getbool=SETTING       print setting value, interpreted as a boolean
--getint=SETTING        print setting value, interpreted as an integer
--getfloat=SETTING      print setting value, interpreted as a float
--getlist=SETTING       print setting value, interpreted as a list

Global Options
--------------
--logfile=FILE          log file. if omitted stderr will be used
--loglevel=LEVEL, -L LEVEL
                        log level (default: DEBUG)
--nolog                 disable logging completely
--profile=FILE          write python cProfile stats to FILE
--pidfile=FILE          write process ID to FILE
--set=NAME=VALUE, -s NAME=VALUE
                        set/override setting (may be repeated)
--pdb                   enable pdb on failure

測試:

C:\Users\m1812\quotetutorial>scrapy settings --get MONGO_URI
localhost

12罢猪、運行爬蟲 runspider

crawl不同的是近她,runspider直接運行的是文件名稱(xxx.py),并且要移動到相應目錄下膳帕。
scrapy runspider <spider_file.py>

C:\Users\m1812\quotetutorial>cd quotetutorial

C:\Users\m1812\quotetutorial\quotetutorial>dir
 驅(qū)動器 C 中的卷沒有標簽粘捎。
 卷的序列號是 5680-D4D0

 C:\Users\m1812\quotetutorial\quotetutorial 的目錄

2019/04/05  22:44    <DIR>          .
2019/04/05  22:44    <DIR>          ..
2019/04/05  20:04               364 items.py
2019/04/05  19:16             1,887 middlewares.py
2019/04/05  22:35             1,431 pipelines.py
2019/04/05  22:44             3,292 settings.py
2019/04/05  22:02    <DIR>          spiders
2017/03/10  23:31                 0 __init__.py
2019/04/06  14:33    <DIR>          __pycache__
               5 個文件          6,974 字節(jié)
               4 個目錄 28,533,673,984 可用字節(jié)

C:\Users\m1812\quotetutorial\quotetutorial>cd spiders

C:\Users\m1812\quotetutorial\quotetutorial\spiders>dir
 驅(qū)動器 C 中的卷沒有標簽薇缅。
 卷的序列號是 5680-D4D0

 C:\Users\m1812\quotetutorial\quotetutorial\spiders 的目錄

2019/04/05  22:02    <DIR>          .
2019/04/05  22:02    <DIR>          ..
2019/04/05  22:02               914 quotes.py
2017/03/10  23:31               161 __init__.py
2019/04/05  22:02    <DIR>          __pycache__
               2 個文件          1,075 字節(jié)
               3 個目錄 28,533,673,984 可用字節(jié)
C:\Users\m1812\quotetutorial\quotetutorial\spiders>scrapy runspider quotes.py

運行結果和crawl是一樣的。

13攒磨、顯示版本 version

顯示scrapy的版本信息泳桦,相關依賴庫信息。

C:\Users\m1812\quotetutorial>scrapy version -v
Scrapy    : 1.3.3
lxml      : 3.6.4.0
libxml2   : 2.9.4
cssselect : 1.0.1
parsel    : 1.2.0
w3lib     : 1.17.0
Twisted   : 17.5.0
Python    : 3.5.2 |Anaconda 4.2.0 (64-bit)| (default, Jul  5 2016, 11:41:13) [MSC v.1900 64 bit (AMD64)]
pyOpenSSL : 16.2.0 (OpenSSL 1.0.2j  26 Sep 2016)
Platform  : Windows-10-10.0.17134-SP0

14娩缰、測試爬行速度 bench

C:\Users\m1812>scrapy bench
2019-04-06 16:43:34 [scrapy.utils.log] INFO: Scrapy 1.3.3 started (bot: scrapybot)
2019-04-06 16:43:34 [scrapy.utils.log] INFO: Overridden settings: {'CLOSESPIDER_TIMEOUT': 10, 'LOGSTATS_INTERVAL': 1, 'LOG_LEVEL': 'INFO'}
2019-04-06 16:43:37 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.closespider.CloseSpider',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.logstats.LogStats']
2019-04-06 16:43:37 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-04-06 16:43:37 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-04-06 16:43:37 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2019-04-06 16:43:37 [scrapy.core.engine] INFO: Spider opened
2019-04-06 16:43:37 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:38 [scrapy.extensions.logstats] INFO: Crawled 61 pages (at 3660 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:39 [scrapy.extensions.logstats] INFO: Crawled 109 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:40 [scrapy.extensions.logstats] INFO: Crawled 157 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:41 [scrapy.extensions.logstats] INFO: Crawled 205 pages (at 2880 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:42 [scrapy.extensions.logstats] INFO: Crawled 245 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:43 [scrapy.extensions.logstats] INFO: Crawled 285 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:44 [scrapy.extensions.logstats] INFO: Crawled 317 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:45 [scrapy.extensions.logstats] INFO: Crawled 357 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:46 [scrapy.extensions.logstats] INFO: Crawled 389 pages (at 1920 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:47 [scrapy.core.engine] INFO: Closing spider (closespider_timeout)
2019-04-06 16:43:47 [scrapy.extensions.logstats] INFO: Crawled 429 pages (at 2400 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:48 [scrapy.extensions.logstats] INFO: Crawled 445 pages (at 960 pages/min), scraped 0 items (at 0 items/min)
2019-04-06 16:43:48 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 182101,
 'downloader/request_count': 445,
 'downloader/request_method_count/GET': 445,
 'downloader/response_bytes': 1209563,
 'downloader/response_count': 445,
 'downloader/response_status_count/200': 445,
 'finish_reason': 'closespider_timeout',
 'finish_time': datetime.datetime(2019, 4, 6, 8, 43, 48, 395684),
 'log_count/INFO': 18,
 'request_depth_max': 16,
 'response_received_count': 445,
 'scheduler/dequeued': 445,
 'scheduler/dequeued/memory': 445,
 'scheduler/enqueued': 8901,
 'scheduler/enqueued/memory': 8901,
 'start_time': datetime.datetime(2019, 4, 6, 8, 43, 37, 309871)}
2019-04-06 16:43:48 [scrapy.core.engine] INFO: Spider closed (closespider_timeout)

大概是每分鐘2000個頁面灸撰。

?著作權歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市拼坎,隨后出現(xiàn)的幾起案子浮毯,更是在濱河造成了極大的恐慌,老刑警劉巖泰鸡,帶你破解...
    沈念sama閱讀 217,277評論 6 503
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件债蓝,死亡現(xiàn)場離奇詭異,居然都是意外死亡盛龄,警方通過查閱死者的電腦和手機饰迹,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 92,689評論 3 393
  • 文/潘曉璐 我一進店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來讯嫂,“玉大人蹦锋,你說我怎么就攤上這事∨费浚” “怎么了莉掂?”我有些...
    開封第一講書人閱讀 163,624評論 0 353
  • 文/不壞的土叔 我叫張陵,是天一觀的道長千扔。 經(jīng)常有香客問我憎妙,道長,這世上最難降的妖魔是什么曲楚? 我笑而不...
    開封第一講書人閱讀 58,356評論 1 293
  • 正文 為了忘掉前任厘唾,我火速辦了婚禮,結果婚禮上龙誊,老公的妹妹穿的比我還像新娘抚垃。我一直安慰自己,他們只是感情好趟大,可當我...
    茶點故事閱讀 67,402評論 6 392
  • 文/花漫 我一把揭開白布鹤树。 她就那樣靜靜地躺著,像睡著了一般逊朽。 火紅的嫁衣襯著肌膚如雪罕伯。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 51,292評論 1 301
  • 那天叽讳,我揣著相機與錄音追他,去河邊找鬼坟募。 笑死,一個胖子當著我的面吹牛邑狸,可吹牛的內(nèi)容都是我干的懈糯。 我是一名探鬼主播,決...
    沈念sama閱讀 40,135評論 3 418
  • 文/蒼蘭香墨 我猛地睜開眼推溃,長吁一口氣:“原來是場噩夢啊……” “哼昂利!你這毒婦竟也來了?” 一聲冷哼從身側響起铁坎,我...
    開封第一講書人閱讀 38,992評論 0 275
  • 序言:老撾萬榮一對情侶失蹤蜂奸,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后硬萍,有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體扩所,經(jīng)...
    沈念sama閱讀 45,429評論 1 314
  • 正文 獨居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 37,636評論 3 334
  • 正文 我和宋清朗相戀三年朴乖,在試婚紗的時候發(fā)現(xiàn)自己被綠了祖屏。 大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點故事閱讀 39,785評論 1 348
  • 序言:一個原本活蹦亂跳的男人離奇死亡买羞,死狀恐怖袁勺,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情畜普,我是刑警寧澤期丰,帶...
    沈念sama閱讀 35,492評論 5 345
  • 正文 年R本政府宣布,位于F島的核電站吃挑,受9級特大地震影響钝荡,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜舶衬,卻給世界環(huán)境...
    茶點故事閱讀 41,092評論 3 328
  • 文/蒙蒙 一埠通、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧逛犹,春花似錦端辱、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 31,723評論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至狸捕,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間众雷,已是汗流浹背灸拍。 一陣腳步聲響...
    開封第一講書人閱讀 32,858評論 1 269
  • 我被黑心中介騙來泰國打工做祝, 沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留,地道東北人鸡岗。 一個月前我還...
    沈念sama閱讀 47,891評論 2 370
  • 正文 我出身青樓混槐,卻偏偏與公主長得像,于是被迫代替她去往敵國和親轩性。 傳聞我的和親對象是個殘疾皇子声登,可洞房花燭夜當晚...
    茶點故事閱讀 44,713評論 2 354

推薦閱讀更多精彩內(nèi)容