1.Spider模板
scrapy默認(rèn)創(chuàng)建的spider模板就是basic模板,創(chuàng)建spider文件的命令是:
scrapy genspider dribbble dribbble.com
桦山,查看spider模板的命令是:scrapy genspider --list
攒射;在項(xiàng)目中明確指明使用crawl生成模板生成spider的命令是:
scrapy genspider -t crawl csdn www.csdn.net
醋旦;
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class CsdnSpider(CrawlSpider):
name = 'csdn'
allowed_domains = ['www.csdn.net']
start_urls = ['https://www.csdn.net/']
rules = (
Rule(LinkExtractor(allow=r'Items/'), callback='parse_item', follow=True),
)
def parse_item(self, response):
return item
2.CrawlSpider類介紹
- CrawlSpider是Spider的派生類,目的是對(duì)全站信息爬取更加簡(jiǎn)單会放,爬取那些具有一定規(guī)則網(wǎng)站的常用的爬蟲(chóng)饲齐, 它基于Spider并有一些獨(dú)特屬性;
3.rules規(guī)則列表
語(yǔ)法:
Rule(link_extractor, callback=None, cb_kwargs=None, follow=None, process_links=None, process_request=None)
咧最,rules是Rule對(duì)象的集合捂人,用于匹配目標(biāo)網(wǎng)站并排除干擾;link_extractor
:是一個(gè)LinkExtractor對(duì)象窗市,其定義了如何從爬取到的頁(yè)面提取鏈接先慷;callback
:從link_extractor中每獲取到鏈接得到Responses時(shí),會(huì)調(diào)用參數(shù)所指定的值作為回調(diào)函數(shù)咨察,該回調(diào) 函數(shù)接收一個(gè)response作為其一個(gè)參數(shù)论熙;cb_kwargs
:用于作為**kwargs參數(shù),傳遞給callback摄狱;follow
:是一個(gè)布爾值脓诡,指爬取了之后,是否還繼續(xù)從該頁(yè)面提取鏈接媒役,然后繼續(xù)爬下去, 默認(rèn)是False祝谚;process_links
:指定spider中哪個(gè)的函數(shù)將會(huì)被調(diào)用,從link_extractor中獲取到鏈接列表時(shí)將會(huì)調(diào)用該函數(shù) 酣衷。該方法主要用來(lái)過(guò)濾交惯;process_request
:指定處理函數(shù),根據(jù)該Rule提取到的每個(gè)Request時(shí)穿仪,該函數(shù)將會(huì)被調(diào)用席爽,可以對(duì)Request進(jìn) 行處理,該函數(shù)必須返回Request或者None啊片;
4.LinkExtractors
LinkExtractors 的目的是提取鏈接只锻,每個(gè)LinkExtractor有唯一的公共方法是extract_links(),它接收一個(gè) Response對(duì)象紫谷,并返回一個(gè)scrapy.link.Link對(duì)象齐饮;
Link Extractors要實(shí)例化一次,并且 extract_links 方法會(huì)根據(jù)不同的 response 調(diào)用多次提取鏈接笤昨;
主要參數(shù):
allow
:滿足括號(hào)中”正則表達(dá)式”的值會(huì)被提取祖驱,如果為空,則全部匹配瞒窒;deny
:與這個(gè)正則表達(dá)式(或正則表達(dá)式列表)不匹配的url一定不提雀拧;allow_domains
:會(huì)被提取的連接的根竿;deny_domains
:一定不會(huì)被提取鏈接的domains陵像;restrict_xpaths
:使用xpath表達(dá)式就珠,和allow共同作用過(guò)濾鏈接;
5.爬取CSDN的文章, 且提取URL和文章標(biāo)題
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class DoubanSpider(CrawlSpider):
name = 'csdn'
allowed_domains = ['blog.csdn.net']
start_urls = ['https://blog.csdn.net']
# 指定鏈接提取的規(guī)律
rules = (
# follow:是指爬取了之后醒颖,是否還繼續(xù)從該頁(yè)面提取鏈接妻怎,然后繼續(xù)爬下去
Rule(LinkExtractor(allow=r'.*/article/.*'), callback='parse_item', follow=True),
)
def parse_item(self, response):
print('-'*100)
print(response.url)
title = response.css('h1::text').extract()[0]
print(title)
print('-' * 100)
return None