本次以下廚房為例
創(chuàng)建(繼承自CrawlSpider類)
scrapy genspider -t crawl xcfCrawlSpider xiachufang.com
打開爬蟲文件
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
#通用爬蟲提取到的連接會(huì)構(gòu)建一個(gè)Link對(duì)象
from scrapy.link import Link
from xiachufang.items import XiachufangCaiPuItem,XiachufangTagItem,XiachufangUserInfoItem
class XcfcrawlspiderSpider(CrawlSpider):
#爬蟲名稱
name = 'xcfCrawlSpider'
#設(shè)置允許爬取的域
allowed_domains = ['xiachufang.com']
#設(shè)置起始的url
# start_urls = ['http://www.xiachufang.com/category/']
redis_key = 'xcfCrawlSpider:start_urls'
# LinkExtractor中的相關(guān)參數(shù)
"""
allow = (), :設(shè)置正則規(guī)則,符合正則表達(dá)式的所有url都會(huì)被提取,
如果為空,則提取全部的url連接
restrict_xpaths = ():使用xpath語(yǔ)法,定位到指定的標(biāo)簽(節(jié)點(diǎn))下,
在該標(biāo)簽(節(jié)點(diǎn))下獲取我們的url連接
restrict_css = ()
"""
rules = (
#分類列表地址
# http://www.xiachufang.com/category/40073/
Rule(
LinkExtractor(allow=r'.*?/category/\d+/'),
follow=True,
process_links='check_category_url'
),
# 菜單詳情地址,
# http://www.xiachufang.com/recipe/1055105/
Rule(
LinkExtractor(
allow=r'.*?/recipe/\d+/',
),
callback='parse_caipu_detail',
follow=True,
),
#用戶信息接口
#http://www.xiachufang.com/cook/118870772/
Rule(
LinkExtractor(
allow=r'.*?/cook/\d+/'
),
callback='parse_userinfo_detail',
follow=True,
)
)
def parse_start_url(self, response):
"""
如果想要對(duì)起始url的響應(yīng)結(jié)果做處理的話,就需要回調(diào)這個(gè)方法
"""
self.parse_item
def parse_item(self, response):
pass
def check_category_url(self,links):
"""
可以在此方法做對(duì)規(guī)則提取的url構(gòu)建成的的link對(duì)象做過(guò)濾處理
當(dāng)正則無(wú)法匹配到完整的url亮蛔,我們也可以通過(guò)這個(gè)方法對(duì)url進(jìn)行拼接
如:某些網(wǎng)頁(yè)需要拼接錨點(diǎn)可以通過(guò)調(diào)用這個(gè)方法
:param links:
:return:
"""
print('===================',links,'===================')
return links
rule規(guī)則常用的參數(shù)
LinkExtractor: 提取連接的規(guī)則(正則)
allow = ():設(shè)置允許提取的目標(biāo)url
deny=():設(shè)置不允許提取的目標(biāo)url(優(yōu)先級(jí)比allow高)
restrict_xpaths=(): 根據(jù)xpath語(yǔ)法痴施,定位到某一標(biāo)簽提取目標(biāo)url
restrict_css=(): 根據(jù)css語(yǔ)法,定位到某一標(biāo)簽提取目標(biāo)url
callback=None:設(shè)置回調(diào)函數(shù)
follow=None:是否設(shè)置跟進(jìn)(下一頁(yè)滿足條件跟進(jìn))
process_links:可設(shè)置回調(diào)函數(shù),對(duì)request對(duì)象攔截(標(biāo)簽下無(wú)法直接獲取的url,拼接url錨點(diǎn))
注意
- 通用爬蟲不能用parse方法