切換目錄到項(xiàng)目工程文件夾:命令行中輸入
Scrapy gensipder -l
返回結(jié)果:
$ scrapy genspider -l
Available templates:
basic
crawl
csvfeed
xmlfeed
同樣使用命令創(chuàng)建趕集網(wǎng)的另外一個(gè)爬蟲文件
創(chuàng)建一個(gè)crawlspider
cd到項(xiàng)目工程的目錄以后,輸入以下命令:
scrapy genspider -t crawl 新的爬蟲名稱 新的網(wǎng)站域名
例如
scrapy genspider -t crawl ganji2 ganji.com
運(yùn)行結(jié)果
>>>Created spider 'ganji2' using template 'crawl' in module:
secondary_zufang.spiders.ganji2
工程目錄下出現(xiàn)了一個(gè)ganji.2.py
文件內(nèi)部是這樣的
在這里面的start_url應(yīng)該自行操作改成自己想要爬取的網(wǎng)頁砌溺。
比較crawlspider和basic的區(qū)別
創(chuàng)建basic類型的spider
scrapy genspider -t basic tem example.com
創(chuàng)建完成以后影涉,項(xiàng)目spider目錄下多了一個(gè)tmp.py的文件。
同樣是使用命令來新建爬蟲规伐,但是里面和上述的crawlspider相比還是少了幾樣?xùn)|西蟹倾。
在crawlspider中,是parse_item方法猖闪,而且在函數(shù)中是不允許你重寫parse函數(shù)的鲜棠,否則可能會出現(xiàn)異常。
在官方文檔中培慌,crawlspider是爬取有規(guī)律的網(wǎng)站內(nèi)容豁陆。
使用shell命令調(diào)試
cd到項(xiàng)目文件夾下,輸入scrapy shell 網(wǎng)址
例如
scrapy shell http://bj.ganji.com/wblist/haidian/zufang/
返回結(jié)果:
2018-12-21 12:15:06 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: secondary_zufang)
2018-12-21 12:15:06 [scrapy.utils.log] INFO: Versions: lxml 4.2.4.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 3.6.3 (v3.6.3:2c5fed86e0, Oct 3 2017, 00:32:08) - [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i 14 Aug 2018), cryptography 2.3.1, Platform Darwin-17.7.0-x86_64-i386-64bit
2018-12-21 12:15:06 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'secondary_zufang', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'LOGSTATS_INTERVAL': 0, 'NEWSPIDER_MODULE': 'secondary_zufang.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['secondary_zufang.spiders']}
2018-12-21 12:15:06 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage']
2018-12-21 12:15:06 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-12-21 12:15:06 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-12-21 12:15:06 [scrapy.middleware] INFO: Enabled item pipelines:
['secondary_zufang.pipelines.SecondaryZufangPipeline']
2018-12-21 12:15:06 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-12-21 12:15:06 [scrapy.core.engine] INFO: Spider opened
2018-12-21 12:15:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://bj.ganji.com/robots.txt> (referer: None)
2018-12-21 12:15:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://bj.ganji.com/wblist/haidian/zufang/> (referer: None)
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x10d639860>
[s] item {}
[s] request <GET http://bj.ganji.com/wblist/haidian/zufang/>
[s] response <200 http://bj.ganji.com/wblist/haidian/zufang/>
[s] settings <scrapy.settings.Settings object at 0x10e74d710>
[s] spider <Ganji2Spider 'ganji2' at 0x10edaad30>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
再把分析鏈接的模塊導(dǎo)入進(jìn)來:
from scrapy.linkextractors import LinkExtractor
輸入以下命令吵护,他會把頁面里所有的鏈接都有提取出來:
tmp = LinkExtractor(r'') #這個(gè)是空的正則盒音,可以匹配任何鏈接
tmp.extract_links(response)
[Link(url='http://bj.ganji.com/fang/', text='', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/chuzu/', text='租房', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/ershoufang/', text='二手房', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/shangpucs/', text='商鋪', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/zhaozu/', text='寫字樓', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/changfang/', text='廠房', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/cangkucf/', text='倉庫', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/tudi/', text='土地', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/cheku/', text='車位', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/wblist/haidian/zufang/', text='出租房', fragment='', nofollow=True),
Link(url='http://post.58.com/fang/1/8/s5', text='免費(fèi)發(fā)布信息', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/', text='北京趕集', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/zufang/', text='北京租房', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/zufang/?pagetype=area', text='區(qū)域\n \n ', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/zufang/sub/?pagetype=ditie', text='地鐵\n \n ', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/chaoyang/zufang/', text='朝陽', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/haidian/zufang/', text='海淀', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/dongcheng/zufang/', text='東城', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/xicheng/zufang/', text='西城', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/chongwen/zufang/', text='崇文', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/xuanwu/zufang/', text='宣武', fragment='', nofollow=False),
。馅而。里逆。。用爪。原押。。偎血。這里省略部分類似內(nèi)容诸衔。。颇玷。笨农。。帖渠。谒亦。。
Link(url='http://bj.ganji.com/xiaoqu/huayuandonglu16hao/chuzuxq/', text='\n 花園東路16號院...\n ', ft='', nofollow=False),
Link(url='https://jxjump.58.com/service?target=FCADV8oV3os7xtAhI2suhvPnTEJt7VvwSrGZ89jJDSaNiZGPZpk1zEffDjpdRkNz3Q5xoKYl4Bi0ja0STWzx3zCRPYhCR2qu8NqUfyBP1ZKMCGY1mbpJGLQe4MLWCBtO3CV1GeEvZYetOdm79IubjBATd84ukfCbRGVaWhwAwIAsnVFVGVkJ-frjEcIsiu1SCX0XjWK1hBvMP5Ogvh79ZdJK950gnTnuV4ut2oMzJts5psgWNQ37EDbog7g&pubid=53973391&apptype=10&psid=152852492202554173951819533&entinfo=36506725264296_0&cookie=|||&fzbref=1&key=¶ms=rank0830gspriceB2550^desc&gjcity=bj', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/fuxinglu11haoyuan/chuzuxq/', text='\n 復(fù)興路11號院...\n ', frag'', nofollow=False),
Link(url='https://jxjump.58.com/service?target=FCADV8oV3os7xtAhI2suhvPnTEJt7VvwSrGZ89jJDSaNiZGPZpk1zEffDjpdRkNz3Q5xoKYl4Bi0ja0Rjk7py_gFkXIyKx8C3feATyBP1ZKMCGY0_lHXr41EYvY6_kCEzXoV_eEvZYetOdm7tUgy8gGYBrIukfCbRGVaWhwAwIAsnVFVGVkJ-frjEcIsiu1SCX0XjJlshikCCdcyrXame0WrKfkgnTnuV4ut2oMzJts5psgWsF5zmZtTCDw&pubid=53952061&apptype=10&psid=152852492202554173951819533&entinfo=36506153352577_0&cookie=|||&fzbref=1&key=¶ms=rank0830gspriceB2550^desc&gjcity=bj', text='\n \n ', fragment='', nofollow=True),
Link(url='https://jxjump.58.com/service?target=FCADV8oV3os7xtAhI2suhvPnTEJt7VvwSrGZ89jJDSaNiZGPZpk1zEffDjpdRkNz3Q5xoKYl4Bi0ja0RK-qBSOCO4X9Aqxb55Bt8ryBP1ZKMCGY2UE2j0rkCcdgL7Z5Dw3ipDeEvZYetOdm63BuhRypVvZ4ukfCbRGVaWhwAwIAsnVFVGVkJ-frjEcIsiu1SCX0XjNQr1dZSEQZQ-CW_Mhoakx0gnTnuV4ut2oMzJts5psgWZGAfLB2zanA&pubid=53897059&apptype=10&psid=152852492202554173951819533&entinfo=36504407043592_0&cookie=|||&fzbref=1&key=¶ms=rank0830gspriceB2550^desc&gjcity=bj', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/mudanyuandongli/chuzuxq/', text='\n 牡丹園東里...\n ', fragmentnofollow=False),
Link(url='https://jxjump.58.com/service?target=FCADV8oV3os7xtAhI2suhvPnTEJt7VvwSrGZ89jJDSaNiZGPZpk1zEffDjpdRkNz3Q5xoKYl4Bi0ja0TMq88evYBaBSquYHHcYTCiyBP1ZKMCGY1mbpJGLQe4MAc5-aJwKmkPeEvZYetOdm7Aap9GwaXd64ukfCbRGVaWhwAwIAsnVFVGVkJ-frjEcIsiu1SCX0XjENEfUaGPtcBsLmmwuR9Ix0gnTnuV4ut2oMzJts5psgWDhtNAesvD4A&pubid=53903799&apptype=10&psid=152852492202554173951819533&entinfo=36504585975840_0&cookie=|||&fzbref=1&key=¶ms=rank0830gspriceB2550^desc&gjcity=bj', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/longxianglu8hao/chuzuxq/', text='\n 龍翔路8號院...\n ', fragmen nofollow=False),
Link(url='https://jxjump.58.com/service?target=FCADV8oV3os7xtAhI2suhvPnTEJt7VvwSrGZ89jJDSaNiZGPZpk1zEffDjpdRkNz3Q5xoKYl4Bi0ja0RClDRDo67SzOLvSyQZZ8HYyBP1ZKMCGY315cEFUaeoIMHPhud0MxuWeEvZYetOdm772LkDJkdp34ukfCbRGVaWhwAwIAsnVFVGVkJ-frjEcIsiu1SCX0XjmuTv-Y3xRZwiPZB47nGe9UgnTnuV4ut2oMzJts5psgUmMYZH-DENLw&pubid=53895758&apptype=10&psid=152852492202554173951819533&entinfo=36504374330393_0&cookie=|||&fzbref=1&key=¶ms=rank0830gspriceB2550^desc&gjcity=bj', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/309yiyuanjiashulou/chuzuxq/', text='\n 黑山扈路甲17號院...\n ',nt='', nofollow=False),
Link(url='http://bj.ganji.com/zufang/36525781752728x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/cuiweilu21haoyuan/chuzuxq/', text='\n 翠微路21號院...\n ', frag'', nofollow=False),
Link(url='http://bj.ganji.com/zufang/36514856072971x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/landehuating/chuzuxq/', text='\n 蘭德華庭...\n ', fragment='', llow=False),
Link(url='http://bj.ganji.com/zufang/36364461048994x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/xiaoyingdonglu7haoyuan/chuzuxq/', text='\n 小營東路7號院...\n 'ment='', nofollow=False),
Link(url='http://bj.ganji.com/zufang/36283837290114x.shtml?ding=https://short.58.com/zd_p/4d183517-ac8c-4370-a0c1-282763d4a987/?target=dc-16-xgk_hvimob_89368680324775q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/taipinglu34haoyuan/chuzuxq/', text='\n 太平路34號院...\n ', fra='', nofollow=False),
Link(url='http://bj.ganji.com/zufang/36448703569416x.shtml?ding=https://short.58.com/zd_p/57e5e09c-29bc-4db5-b392-46106dfa6069/?target=dc-16-xgk_hvimob_89556048192579q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/zijinzhangan/chuzuxq/', text='\n 紫金長安...\n ', fragment='', llow=False),
Link(url='http://bj.ganji.com/zufang/36452227627012x.shtml?ding=https://short.58.com/zd_p/59f9ea44-183d-40e8-8bcb-e3c7f13874c5/?target=dc-16-xgk_hvimob_89513330930473q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/yongwangjiayuansiqu/chuzuxq/', text='\n 永旺家園四區(qū)...\n ', fr='', nofollow=False),
Link(url='http://bj.ganji.com/zufang/36414250778516x.shtml?ding=https://short.58.com/zd_p/a859ca84-d600-4400-9a95-38f3d762e828/?target=dc-16-xgk_hvimob_89575314006179q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/huarunxiangshuwanyiqi/chuzuxq/', text='\n 橡樹灣一期...\n ', frt='', nofollow=False),
Link(url='http://bj.ganji.com/zufang/36227844908548x.shtml?ding=https://short.58.com/zd_p/9bfed205-196b-4420-9087-e2e1a3269ddd/?target=dc-16-xgk_hvimob_89330655246156q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/yiheshanzhuangbj/chuzuxq/', text='\n 頤和山莊...\n ', fragment=nofollow=False),
Link(url='http://bj.ganji.com/zufang/36455527886345x.shtml?ding=https://short.58.com/zd_p/bce20dd9-651c-428f-a9b5-8c291fc7b376/?target=dc-16-xgk_hvimob_89511130669851q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/beiwujiayuanxili/chuzuxq/', text='\n 北塢嘉園西里...\n ', fragm, nofollow=False),
Link(url='http://bj.ganji.com/zufang/36455272312598x.shtml?ding=https://short.58.com/zd_p/2df7e63a-e90f-4534-9785-a951019f26df/?target=dc-16-xgk_hvimob_89511303873126q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/tujingjiayuan/chuzuxq/', text='\n 圖景嘉園...\n ', fragment='',ollow=False),
Link(url='http://bj.ganji.com/zufang/36455485142684x.shtml?ding=https://short.58.com/zd_p/32d7b98e-3e39-40e2-ae4b-7d7c2c76b5b4/?target=dc-16-xgk_hvimob_89511561753965q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/yichengxishanhuafuxiyuan/chuzuxq/', text='\n 億城西山華府禧園...\n ragment='', nofollow=False),
Link(url='http://bj.ganji.com/zufang/35843472234064x.shtml?ding=https://short.58.com/zd_p/2fb60c53-a4f9-4b09-a4e7-e9674705065a/?target=dc-16-xgk_hvimob_81658503385495q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/wanshoulu18haoyuan/chuzuxq/', text='\n 萬壽路18號院...\n ', fra='', nofollow=False),
Link(url='http://bj.ganji.com/zufang/36005967713804x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/mashenmiaoxiaoqu/chuzuxq/', text='\n 馬神廟小區(qū)...\n ', fragmen nofollow=False),
Link(url='http://bj.ganji.com/zufang/36514389880345x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/zhangzhilu5hao/chuzuxq/', text='\n 財(cái)智會館...\n ', fragment=''follow=False),
Link(url='http://bj.ganji.com/zufang/36210224693127x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/zizaixiangshan/chuzuxq/', text='\n 永泰自在香山...\n ', fragmennofollow=False),
Link(url='http://bj.ganji.com/zufang/36515288107668x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/wanshuyuan/chuzuxq/', text='\n 萬樹園...\n ', fragment='', nofow=False),
Link(url='http://bj.ganji.com/zufang/36219286813961x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/fenghuangxiaoqu1/chuzuxq/', text='\n 鳳凰小區(qū)...\n ', fragment=nofollow=False),
Link(url='http://bj.ganji.com/zufang/36226224648989x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/wufulinglongju/chuzuxq/', text='\n 五福玲瓏居...\n ', fragment=ofollow=False),
Link(url='http://bj.ganji.com/zufang/36523006142221x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/lingnanlu30haoyuan/chuzuxq/', text='\n 科委宿舍...\n ', fragmen, nofollow=False),
Link(url='http://bj.ganji.com/zufang/36518070380572x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/longxiangzhongqu/chuzuxq/', text='\n 龍鄉(xiāng)小區(qū)(中區(qū))...\n ', fra'', nofollow=False),
Link(url='http://bj.ganji.com/zufang/36260836748417x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/huangzhuangxiaoqubj/chuzuxq/', text='\n 中國科學(xué)院黃莊小區(qū)...\n ent='', nofollow=False),
Link(url='http://bj.ganji.com/zufang/36496483775505x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/qiangyouqinghexincheng/chuzuxq/', text='\n 強(qiáng)佑清河新城...\n ',ent='', nofollow=False),
Link(url='http://bj.ganji.com/zufang/36515848489356x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/yiyuanjuyiqi/chuzuxq/', text='\n 頤源居...\n ', fragment='', nolow=False),
Link(url='http://bj.ganji.com/zufang/36453349478792x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/jinyumeiheyuan/chuzuxq/', text='\n 美和園西區(qū)...\n ', fragment=ofollow=False),
Link(url='http://bj.ganji.com/zufang/36448512145167x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/baoshengli/chuzuxq/', text='\n 寶盛里...\n ', fragment='', nofow=False),
Link(url='http://bj.ganji.com/zufang/36440408016649x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/ruiheyuan/chuzuxq/', text='\n 金隅瑞和園...\n ', fragment='', now=False),
Link(url='http://bj.ganji.com/zufang/36521935379723x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/jingouhelu12haoyuan/chuzuxq/', text='\n 金溝河路12號院...\n ', nt='', nofollow=False),
Link(url='http://bj.ganji.com/zufang/36364398864531x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/xuezhiyuan/chuzuxq/', text='\n 學(xué)知園...\n ', fragment='', nofow=False),
Link(url='http://bj.ganji.com/zufang/36459015677700x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/mingguangbeili/chuzuxq/', text='\n 明光北里...\n ', fragment=''follow=False),
Link(url='http://bj.ganji.com/zufang/36522777615364x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/zizhuyuanjia3hao/chuzuxq/', text='\n 紫竹院路甲3號院...\n ', fr'', nofollow=False),
Link(url='http://bj.ganji.com/zufang/36360244840342x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/shanglinxi/chuzuxq/', text='\n 上林溪...\n ', fragment='', nofow=False),
Link(url='http://bj.ganji.com/zufang/36430964540828x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/yongtaidongli/chuzuxq/', text='\n 永泰東里...\n ', fragment='',ollow=False),
Link(url='http://bj.ganji.com/zufang/36485009202206x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36470563064452x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/beiyisanyuan/chuzuxq/', text='\n 北醫(yī)三院家屬區(qū)小區(qū)...\n ', fra nofollow=False),
Link(url='http://bj.ganji.com/zufang/36517073349672x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/taiyueyuan/chuzuxq/', text='\n 太月園(南區(qū))...\n ', fragment=''ollow=False),
Link(url='http://bj.ganji.com/zufang/36505985388960x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/yifengzhuangyuan/chuzuxq/', text='\n 頤豐莊園(西區(qū))...\n ', fra'', nofollow=False),
Link(url='http://bj.ganji.com/zufang/34794182085293x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/xiaoqu/shangdijiayuanbj/chuzuxq/', text='\n 上地佳園...\n ', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/haidian/zufang/pn2/', text='2', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/haidian/zufang/pn3/', text='3', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/haidian/zufang/pn70/', text='70', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/hezu/', text='北京合租房', fragment='', nofollow=False),
Link(url='http://sh.ganji.com/zufang', text='上海租房網(wǎng)', fragment='', nofollow=False),
Link(url='http://zz.ganji.com/zufang', text='鄭州租房網(wǎng)', fragment='', nofollow=False),
Link(url='http://sy.ganji.com/zufang', text='沈陽租房網(wǎng)', fragment='', nofollow=False),
Link(url='http://sz.ganji.com/zufang', text='深圳租房網(wǎng)', fragment='', nofollow=False),
Link(url='http://cd.ganji.com/zufang', text='成都租房網(wǎng)', fragment='', nofollow=False),
Link(url='http://cq.ganji.com/zufang', text='重慶租房網(wǎng)', fragment='', nofollow=False),
Link(url='http://qd.ganji.com/zufang', text='青島租房網(wǎng)', fragment='', nofollow=False),
Link(url='http://wh.ganji.com/zufang', text='武漢租房網(wǎng)', fragment='', nofollow=False),
Link(url='http://tj.ganji.com/zufang', text='天津租房網(wǎng)', fragment='', nofollow=False),
Link(url='http://jn.ganji.com/zufang', text='濟(jì)南租房網(wǎng)', fragment='', nofollow=False),
Link(url='http://nj.ganji.com/zufang', text='南京租房網(wǎng)', fragment='', nofollow=False),
Link(url='http://gz.ganji.com/zufang', text='廣州租房網(wǎng)', fragment='', nofollow=False),
Link(url='http://xa.ganji.com/zufang', text='西安租房網(wǎng)', fragment='', nofollow=False),
Link(url='http://hf.ganji.com/zufang', text='合肥租房網(wǎng)', fragment='', nofollow=False),
Link(url='http://sjz.ganji.com/zufang', text='石家莊租房網(wǎng)', fragment='', nofollow=False),
Link(url='http://dl.ganji.com/zufang', text='大連租房網(wǎng)', fragment='', nofollow=False),
Link(url='http://hz.ganji.com/zufang', text='杭州租房網(wǎng)', fragment='', nofollow=False),
Link(url='http://kezhan.58.com/bj/qingnianlvshe/', text='北京青年旅社', fragment='', nofollow=False),
Link(url='http://bj.58.com/xiaoqu/shenggunanli/', text='勝古南里', fragment='', nofollow=False),
Link(url='http://bj.ganji.com/wblist/haidian/zufang/m.anjuke.com/bj/loupan/haidian/', text='海淀樓盤', fragment='', nofollow=False),
Link(url='http://m.anjuke.com/bj/loupan/249388/', text='京漢鉑寓', fragment='', nofollow=False),
Link(url='http://bj.zu.anjuke.com/fangyuan/haidian/', text='海淀租房', fragment='', nofollow=False),
Link(url='http://bj.58.com/pinpaigongyu/646228473643278336/', text='家樂美地', fragment='', nofollow=False),
Link(url='http://www.ganji.com/misc/abouts/index.php?act=about', text='關(guān)于Ganji', fragment='', nofollow=True),
Link(url='http://www.ganji.com/tuiguang/index/', text='趕集推廣', fragment='', nofollow=True),
Link(url='http://tuiguang.ganji.com/zhaoshang/agent.htm', text=' 渠道合作 ', fragment='', nofollow=True),
Link(url='http://help.ganji.com/', text='幫助中心', fragment='', nofollow=True),
Link(url='http://help.ganji.com/html/sjbmy/', text='手機(jī)號被冒用', fragment='', nofollow=True),
Link(url='http://www.ganji.com/misc/abouts/link.php?act=link', text='友情鏈接', fragment='', nofollow=True),
Link(url='http://www.ganji.com/misc/abouts/index.php?act=job', text='招賢納士', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/quxiandaohang/', text='區(qū)縣導(dǎo)航', fragment='', nofollow=False),
Link(url='http://mobile.ganji.com/', text='手機(jī)趕集', fragment='', nofollow=True),
Link(url='http://3g.ganji.com/bj_fang1/', text='租房觸屏版', fragment='', nofollow=False)]
都取出來了,很恐怖份招。
使用正則過濾下就能拿到需要的鏈接了切揭。在明確鏈接樣式的情況下才能進(jìn)行正則表達(dá)式的設(shè)計(jì)。
問號后面有鍵值锁摔,先不考慮問號后面的部分了廓旬,配前面的吧。頁面網(wǎng)址應(yīng)該是這樣的形式
r'http://bj.ganji.com/zufang/\d+x.shtml'
輸入:
tmp = LinkExtractor(r'http://bj.ganji.com/zufang/\d+x.shtml')
tmp.extract_links(response) #這是個(gè)列表
輸出的鏈接明顯少了
[Link(url='http://bj.ganji.com/zufang/36525781752728x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36514856072971x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36364461048994x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36283837290114x.shtml?ding=https://short.58.com/zd_p/4d183517-ac8c-4370-a0c1-282763d4a987/?target=dc-16-xgk_hvimob_89368680324775q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36448703569416x.shtml?ding=https://short.58.com/zd_p/57e5e09c-29bc-4db5-b392-46106dfa6069/?target=dc-16-xgk_hvimob_89556048192579q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36452227627012x.shtml?ding=https://short.58.com/zd_p/59f9ea44-183d-40e8-8bcb-e3c7f13874c5/?target=dc-16-xgk_hvimob_89513330930473q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36414250778516x.shtml?ding=https://short.58.com/zd_p/a859ca84-d600-4400-9a95-38f3d762e828/?target=dc-16-xgk_hvimob_89575314006179q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36227844908548x.shtml?ding=https://short.58.com/zd_p/9bfed205-196b-4420-9087-e2e1a3269ddd/?target=dc-16-xgk_hvimob_89330655246156q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36455527886345x.shtml?ding=https://short.58.com/zd_p/bce20dd9-651c-428f-a9b5-8c291fc7b376/?target=dc-16-xgk_hvimob_89511130669851q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36455272312598x.shtml?ding=https://short.58.com/zd_p/2df7e63a-e90f-4534-9785-a951019f26df/?target=dc-16-xgk_hvimob_89511303873126q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36455485142684x.shtml?ding=https://short.58.com/zd_p/32d7b98e-3e39-40e2-ae4b-7d7c2c76b5b4/?target=dc-16-xgk_hvimob_89511561753965q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/35843472234064x.shtml?ding=https://short.58.com/zd_p/2fb60c53-a4f9-4b09-a4e7-e9674705065a/?target=dc-16-xgk_hvimob_81658503385495q-feykn&end=end', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36005967713804x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36514389880345x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36210224693127x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36515288107668x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36219286813961x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36226224648989x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36523006142221x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36518070380572x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36260836748417x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36496483775505x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36515848489356x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36453349478792x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36448512145167x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36440408016649x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36521935379723x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36364398864531x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36459015677700x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36522777615364x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36360244840342x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36430964540828x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36485009202206x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36470563064452x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36517073349672x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/36505985388960x.shtml', text='\n \n ', fragment='', nofollow=True),
Link(url='http://bj.ganji.com/zufang/34794182085293x.shtml', text='\n \n ', fragment='', nofollow=True)]
這出來的一個(gè)一個(gè)鏈接是LinkExtractor提取出來的谐腰,但是如果寫進(jìn)RULES里面去孕豹,就會直接把鏈接爬上來,
打開Pycharm工程十气,把Rule里面的正則條件替換成
http://bj.ganji.com/zufang/\d+x.shtml
但是這一步要規(guī)定回調(diào)函數(shù)励背,這里面設(shè)置成的是parse_item,parse_item會自動獲取這個(gè)鏈接的response砸西。那如果想查詢剛才的url呢叶眉,要使用response.url
才能得到鏈接。
Crawlspider知識點(diǎn)整理
CrawlSpider繼承最基礎(chǔ)的Spider籍胯,所以Spider有的方法和屬性竟闪,CrawlSpider全部具備离福。
CrawlSpider別于Spider的特性是多了一個(gè)rules參數(shù)杖狼,其作用是定義提取動作,可以快速的檢索符合正則的路由妖爷,并非常方便的回調(diào)到函數(shù)中蝶涩。
幾點(diǎn)說明
1、 follow
是一個(gè)布爾(boolean)值絮识,指定了根據(jù)該規(guī)則從response提取的鏈 接是否需要跟進(jìn)绿聘。如果 callback
為None
, follow
默認(rèn)設(shè)置為True
次舌,否則默認(rèn)為 False
熄攘。 follow
默認(rèn)設(shè)置為True
時(shí)候,會一直跟進(jìn)爬取此鏈接打開的頁面的response的符合規(guī)則的鏈接彼念。
注意:如果不寫callback
也不寫follow
的話挪圾,表示follow
默認(rèn)跟進(jìn),至于要將拿到的鏈接重新打開逐沙,根據(jù)規(guī)則再提取里面的鏈接哲思,如果里面的鏈接觸發(fā)了某個(gè)支持callback
的規(guī)則,那么再傳到callback
對應(yīng)的函數(shù)里進(jìn)行提取吩案。
2、rules
:一個(gè)包含一個(gè)(或多個(gè)) Rule
對象的集合(list)靠益。 每個(gè) Rule
對爬取網(wǎng)站的動作定義了特定表現(xiàn)丧肴。 Rule
對象在下邊會介紹。 如果多個(gè)rule
匹配了相同的鏈接捆毫,則根據(jù)他們在本屬性中被定義的順序闪湾,第一個(gè)會被使用。
3绩卤、URL
鏈接提取的類LinkExtractor
途样,主要參數(shù)為:
allow
:滿足括號中“正則表達(dá)式”的值會被提取,如果為空濒憋,則全部 匹配何暇。 deny
:與這個(gè)正則表達(dá)式(或正則表達(dá)式列表)不匹配的URL一定不 提取。
allow_domains
:會被提取的鏈接的domains
凛驮。
deny_domains
:一定不會被提取鏈接的domains
裆站。
restrict_xpaths
:使用xpath
表達(dá)式,和allow
共同作用過濾鏈接黔夭。還有一個(gè)類似的restrict_css
警告
當(dāng)編寫CrawlSpider爬蟲規(guī)則時(shí)宏胯,請避免使用 parse
作為回調(diào)函數(shù)。由于 CrawlSpider 使用parse
方法來實(shí)現(xiàn)其邏輯本姥,如果您覆蓋了 parse
方
法肩袍,Crawlspider 將會運(yùn)行失敗。涉及的示例:
$ scrapy shell http://bj.ganji.com/fang1/
# ......
# 略過 Scrapy Log
>>> from scrapy.linkextractors import LinkExtractor >>> tmp = LinkExtractor(r'')
>>> len(tmp.extract_links(response))
Out: 875
>>> get_links =
LinkExtractor(r'http://bj.ganji.com/fang1/\d+x.htm')
>>> len(get_links.extract_links(response))
Out: 89
實(shí)際操作的時(shí)候并不簡單
另外如果需要轉(zhuǎn)碼到j(luò)son婚惫,可以使用如下語句