用scrapy框架迭代爬取時報錯
scrapy日志:
在 setting.py 文件中 設置 日志 記錄等級
LOG_LEVEL= 'DEBUG'
LOG_FILE ='log.txt'
觀察 scrapy 日志
2017-08-15 21:58:05 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'sou.zhaopin.com': <GET http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E4%B8%8A%E6%B5%B7&kw=python&sm=0&source=0&p=2>
2017-08-15 21:58:05 [scrapy.core.engine] INFO: Closing spider (finished)
2017-08-15 21:58:05 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 782,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 58273,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/302': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 8, 15, 13, 58, 5, 915565),
'item_scraped_count': 59,
'log_count/DEBUG': 64,
'log_count/INFO': 7,
'memusage/max': 52699136,
'memusage/startup': 52699136,
'offsite/domains': 1,
'offsite/filtered': 1,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2017, 8, 15, 13, 58, 5, 98357)}
2017-08-15 21:58:05 [scrapy.core.engine] INFO: Spider closed (finished)
重要的是第一行维蒙,我開始做的時候沒有意識到這竟然是一個錯誤颅痊,應該是被記錄的一個錯誤提示局待,然后程序也就沒有報錯
2017-08-15 21:58:05 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'sou.zhaopin.com': <GET http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E4%B8%8A%E6%B5%B7&kw=python&sm=0&source=0&p=2>
DEBUG: Filtered offsite request to
因為 Request中請求的 URL 和 allowed_domains 中定義的域名沖突,所以將Request中請求的URL過濾掉了钳榨,無法請求
name = 'zhilianspider'
allowed_domains = ['http://sou.zhaopin.com']
page = 1
url = 'http://sou.zhaopin.com/jobs/searchresult.ashx?jl=%E4%B8%8A%E6%B5%B7&kw=python&sm=0&source=0&p='
start_urls = [url+str(page)]
在 Request 請求參數(shù)中舰罚,設置 dont_filter = True ,Request 中請求的 URL 將不通過 allowed_domains 過濾。
if self.page <= 10:
self.page +=1
yield scrapy.Request(self.url+str(self.page),callback=self.parse,dont_filter = True)
由于關掉了allowed_domains 過濾薛耻,所以要將yield 寫在判斷條件呢营罢,開始我寫在了外面程序一直迭代,停不下來了饼齿,尷尬饲漾。
之前都是寫在if同級下的,那時候還沒有關掉過濾所以沒問題