Scrapy 自定義擴(kuò)展
自定義擴(kuò)展時(shí)茬腿,利用信號(hào)在指定位置注冊(cè)指定操作。
新建custom_extensions.py
from scrapy import signals
class MyExtend:
def __init__(self,crawler):
self.crawler = crawler
# 在指定信號(hào)上注冊(cè)操作
crawler.signals.connect(self.start, signals.engine_started)
crawler.signals.connect(self.close, signals.spider_closed)
@classmethod
def from_crawler(cls, crawler):
return cls(crawler)
def start(self):
print('signals.engine_started.start')
def close(self):
print('signals.spider_closed.close')
最后需要在settings.py里的修改EXTENSIONS:
EXTENSIONS = {
'scrapy_learn.custom_extensions.MyExtend': 300,
}
可以用的信號(hào)
engine_started = object() # 引擎啟動(dòng)時(shí)
engine_stopped = object() # 引擎停止時(shí)
spider_opened = object() # 爬蟲啟動(dòng)時(shí)
spider_idle = object() # 爬蟲閑置時(shí)
spider_closed = object() # 爬蟲停止時(shí)
spider_error = object() # 爬蟲錯(cuò)誤時(shí)
request_scheduled = object() # 調(diào)度器調(diào)度時(shí)
request_dropped = object() # 調(diào)取器丟棄時(shí)
response_received = object() # 得到response時(shí)
response_downloaded = object() # response下載時(shí)
item_scraped = object() # yield item 時(shí)
item_dropped = object() # drop item 時(shí)
有了這些信號(hào)化戳,就可以在指定時(shí)刻自定義某些操作单料。
配置文件(settings.py)詳解
# 1. 爬蟲名稱埋凯,不是spider,name里的名稱,而是整個(gè)爬蟲項(xiàng)目的名稱扫尖,
# 很多網(wǎng)站都會(huì)有自己的爬蟲(百度白对,谷歌等都有)。
BOT_NAME = 'scrapy_learn'
# 2. 爬蟲應(yīng)用路徑
SPIDER_MODULES = ['scrapy_learn.spiders']
NEWSPIDER_MODULE = 'scrapy_learn.spiders'
# 3. 客戶端 user-agent請(qǐng)求頭换怖,常偽造成瀏覽器
USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.167 Safari/537.36'
# 4. 是否遵循爬蟲規(guī)則甩恼,正經(jīng)的要遵循,但我們搞爬蟲都不正經(jīng)
ROBOTSTXT_OBEY = False
# 5. 并發(fā)請(qǐng)求數(shù)沉颂,默認(rèn)16
CONCURRENT_REQUESTS = 32
# 6. 延遲下載秒數(shù)条摸,默認(rèn)0
DOWNLOAD_DELAY = 3
# 7. 單域名訪問并發(fā)數(shù),并且延遲下次秒數(shù)也應(yīng)用在每個(gè)域名铸屉,比CONCURRENT_REQUESTS更加細(xì)致的并發(fā)
CONCURRENT_REQUESTS_PER_DOMAIN = 16
# 單IP訪問并發(fā)數(shù)钉蒲,如果有值則忽略:CONCURRENT_REQUESTS_PER_DOMAIN,
# 并且延遲下次秒數(shù)也應(yīng)用在每個(gè)IP
CONCURRENT_REQUESTS_PER_IP = 16
# 8. 是否支持cookie彻坛,cookiejar進(jìn)行操作cookie顷啼,默認(rèn)支持
COOKIES_ENABLED = True
# 是否是調(diào)試模式,調(diào)試模式下每次得到cookie都會(huì)打印
COOKIES_DEBUG = True
# 9. Telnet用于查看當(dāng)前爬蟲的信息(爬了多少小压,還剩多少等)线梗,操作爬蟲(暫停等)等...椰于,
# cmd中:telnet 127.0.0.1 6023(6023是專門給爬蟲用的端口)
# telnet 命令
# est() 檢查引擎狀態(tài)
# engine.pass 暫停引擎怠益, 還有很多命令,在網(wǎng)上可搜
TELNETCONSOLE_ENABLED = True
# 10. 默認(rèn)請(qǐng)求頭
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# 中間件瘾婿,需要詳細(xì)講蜻牢,另寫
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'scrapy_learn.middlewares.ScrapyLearnSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'scrapy_learn.middlewares.ScrapyLearnDownloaderMiddleware': 543,
#}
# 11. 定義pipeline處理請(qǐng)求
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'scrapy_learn.pipelines.ScrapyLearnPipeline': 300,
}
# 12. 自定義擴(kuò)展,基于信號(hào)進(jìn)行調(diào)用
# See https://doc.scrapy.org/en/latest/topics/extensions.html
EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
}
# 自動(dòng)限速算法(智能請(qǐng)求)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# 第一次下載延遲幾秒
AUTOTHROTTLE_START_DELAY = 5
# 最大延遲
AUTOTHROTTLE_MAX_DELAY = 60
# 波動(dòng)范圍偏陪,不用管
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# 做緩存的抢呆,以后說
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
# 13. 爬蟲允許的最大深度,可以通過meta查看當(dāng)前深度笛谦;0表示無深度
DEPTH_LIMIT = 4
# DEPTH_PRIORITY只能設(shè)置為0或1抱虐,
# 0深度優(yōu)先,一下找到底饥脑,然后再找其他的
# 1廣度優(yōu)先恳邀,一層一層找
# 他們內(nèi)部的原理就是根據(jù)response.meta里的depth(層數(shù))來找。
# DEPTH_PRIORITY = 0