Item Pipeline簡介:
Item管道的主要責(zé)任是負(fù)責(zé)處理有蜘蛛從網(wǎng)頁中抽取的Item,他的主要任務(wù)是清晰灯荧、驗證和存儲數(shù)據(jù)。
當(dāng)頁面被蜘蛛解析后,將被發(fā)送到Item管道载庭,并經(jīng)過幾個特定的次序處理數(shù)據(jù)。
每個Item管道的組件都是有一個簡單的方法組成的Python類廊佩。
他們獲取了Item并執(zhí)行他們的方法囚聚,同時他們還需要確定的是是否需要在Item管道中繼續(xù)執(zhí)行下一步或是直接丟棄掉不處
執(zhí)行的過程:
清理HTML數(shù)據(jù)驗證解析到的數(shù)據(jù)(檢查Item是否包含必要的字段)檢查是否是重復(fù)數(shù)據(jù)(如果重復(fù)就刪除)將解析到的數(shù)據(jù)存儲到數(shù)據(jù)庫中
process_item(item, spider)
每一個item管道組件都會調(diào)用該方法,并且必須返回一個item對象實例或raise DropItem異常标锄。
被丟掉的item將不會在管道組件進(jìn)行執(zhí)行
此外顽铸,我們也可以在類中實現(xiàn)以下方法
open_spider(spider)
當(dāng)spider執(zhí)行的時候?qū)⒄{(diào)用該方法
close_spider(spider)
當(dāng)spider關(guān)閉的時候?qū)⒄{(diào)用該方法
在settings.py文件中,往ITEM_PIPELINES中添加項目管道的類名料皇,就可以激活項目管道組件
如:
ITEM_PIPELINES = {
'myproject.pipeline.PricePipeline':300,
'myproject.pipeline.JsonWriterPipeline':800,
}
在此設(shè)置中分配給類的整數(shù)值決定了它們在其中運行的順序——項通過管道從訂單號低到高
整數(shù)值通常設(shè)置在0-1000之間
setting文件詳解:
# 1. 爬蟲名稱BOT_NAME = 'step8_king'#?
2. 爬蟲應(yīng)用路徑SPIDER_MODULES = ['step8_king.spiders']NEWSPIDER_MODULE = 'step8_king.spiders'# Crawl responsibly by identifying yourself (and your website) on the user-agent#?
3. 客戶端 user-agent請求頭# USER_AGENT = 'step8_king (+http://www.yourdomain.com)'# Obey robots.txt rules#?
4. 禁止爬蟲配置# ROBOTSTXT_OBEY = False# Configure maximum concurrent requests performed by Scrapy (default: 16)#?
5. 并發(fā)請求數(shù)# CONCURRENT_REQUESTS = 4# Configure a delay for requests for the same website (default: 0)# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay# See also autothrottle settings and docs
#6. 延遲下載秒數(shù)# DOWNLOAD_DELAY = 2# The download delay setting will honor only one of:
# 7. 單域名訪問并發(fā)數(shù)谓松,并且延遲下次秒數(shù)也應(yīng)用在每個域名# CONCURRENT_REQUESTS_PER_DOMAIN = 2# 單IP訪問并發(fā)數(shù),如果有值則忽略:CONCURRENT_REQUESTS_PER_DOMAIN践剂,并且延遲下次秒數(shù)也應(yīng)用在每個IP# CONCURRENT_REQUESTS_PER_IP = 3# Disable cookies (enabled by default)
# 8. 是否支持cookie鬼譬,cookiejar進(jìn)行操作cookie# COOKIES_ENABLED = True# COOKIES_DEBUG = True# Disable Telnet Console (enabled by default)
# 9. Telnet用于查看當(dāng)前爬蟲的信息,操作爬蟲等...# ? ?使用telnet ip port 逊脯,然后通過命令操作# TELNETCONSOLE_ENABLED = True# TELNETCONSOLE_HOST = '127.0.0.1'# TELNETCONSOLE_PORT = [6023,]
# 10. 默認(rèn)請求頭# Override the default request headers:# DEFAULT_REQUEST_HEADERS = {# ? ? 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',# ? ? 'Accept-Language': 'en',# }# Configure item pipelines# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
# 11. 定義pipeline處理請求# ITEM_PIPELINES = {# ? ?'step8_king.pipelines.JsonPipeline': 700,# ? ?'step8_king.pipelines.FilePipeline': 500,# }
# 12. 自定義擴(kuò)展优质,基于信號進(jìn)行調(diào)用# Enable or disable extensions# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html# EXTENSIONS = {# ? ? # 'step8_king.extensions.MyExtension': 500,# }
# 13. 爬蟲允許的最大深度,可以通過meta查看當(dāng)前深度;0表示無深度# DEPTH_LIMIT = 3
# 14. 爬取時巩螃,0表示深度優(yōu)先Lifo(默認(rèn))演怎;1表示廣度優(yōu)先FiFo# 后進(jìn)先出,深度優(yōu)先# DEPTH_PRIORITY = 0# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleLifoDiskQueue'# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.LifoMemoryQueue'# 先進(jìn)先出避乏,廣度優(yōu)先# DEPTH_PRIORITY = 1# SCHEDULER_DISK_QUEUE = 'scrapy.squeue.PickleFifoDiskQueue'# SCHEDULER_MEMORY_QUEUE = 'scrapy.squeue.FifoMemoryQueue'
# 15. 調(diào)度器隊列# SCHEDULER = 'scrapy.core.scheduler.Scheduler'# from scrapy.core.scheduler import Scheduler
# 16. 訪問URL去重# DUPEFILTER_CLASS = 'step8_king.duplication.RepeatUrl'# Enable and configure the AutoThrottle extension (disabled by default)# See http://doc.scrapy.org/en/latest/topics/autothrottle.html"""
17. 自動限速算法? ?
?from scrapy.contrib.throttle import AutoThrottle? ??
自動限速設(shè)置? ??
1. 獲取最小延遲 DOWNLOAD_DELAY? ?
?2. 獲取最大延遲 AUTOTHROTTLE_MAX_DELAY? ?3. 設(shè)置初始下載延遲AUTOTHROTTLE_START_DELAY? ?
?4. 當(dāng)請求下載完成后爷耀,獲取其"連接"時間 latency,即:請求連接到接受到響應(yīng)頭之間的時間? ??
5. 用于計算的... AUTOTHROTTLE_TARGET_CONCURRENCY? ??
target_delay = latency /self.target_concurrency? ??
new_delay = (slot.delay + target_delay) / 2.0 # 表示上一次的延遲時間? ??
new_delay = max(target_delay, new_delay)? ??
new_delay = min(max(self.mindelay, new_delay), self.maxdelay)? ??
slot.delay = new_delay
"""# 開始自動限速# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# 初始下載延遲# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# 最大下載延遲#?
AUTOTHROTTLE_MAX_DELAY = 10
# The average number of requests Scrapy should be sending in parallel to each remote server
# 平均每秒并發(fā)數(shù)# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:# 是否顯示
# AUTOTHROTTLE_DEBUG = True# Enable and configure HTTP caching (disabled by default)
#Seehttp://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings"""
18. 啟用緩存? ??
目的用于將已經(jīng)發(fā)送的請求或相應(yīng)緩存下來淑际,以便以后使用? ? ? ? fromscrapy.downloadermiddlewares.httpcache import HttpCacheMiddleware? ?
from scrapy.extensions.httpcache import DummyPolicy? ??
from scrapy.extensions.httpcache import FilesystemCacheStorage"""
# 是否啟用緩存策略#?
HTTPCACHE_ENABLED = True
# 緩存策略:所有請求均緩存畏纲,下次在請求直接訪問原來的緩存即可#?
HTTPCACHE_POLICY = "scrapy.extensions.httpcache.DummyPolicy"
# 緩存策略:根據(jù)Http響應(yīng)頭:Cache-Control、Last-Modified 等進(jìn)行緩存的策略# HTTPCACHE_POLICY = "scrapy.extensions.httpcache.RFC2616Policy"
# 緩存超時時間
# HTTPCACHE_EXPIRATION_SECS = 0
# 緩存保存路徑# HTTPCACHE_DIR = 'httpcache'
# 緩存忽略的Http狀態(tài)碼# HTTPCACHE_IGNORE_HTTP_CODES = []
# 緩存存儲的插件# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'"""
19. 代理春缕,需要在環(huán)境變量中設(shè)置? ? from scrapy.contrib.downloadermiddleware.httpproxy import HttpProxyMiddleware? ??? ? 方式一:使用默認(rèn)? ? ? ??
os.environ? ? ? ??
? ? ? ? ? ? ? ? {http_proxy:http://root:woshiniba@192.168.11.11:9999/? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? https_proxy:http://192.168.11.11:9999/? ? ? ?
? ? ? ? ? ? ? ? ? }? ??
方式二:使用自定義下載中間件? ??? ??
def to_bytes(text, encoding=None, errors='strict'):
? ? ? ? if isinstance(text, bytes):
? ? ? ? ? ? return text
? ? ? ? if not isinstance(text, six.string_types):
? ? ? ? ? ? raise TypeError('to_bytes must receive a unicode, str or bytes '? ? ? ? ? ? ? ? ? ? ? ? ? ? 'object, got %s' % type(text).__name__)
? ? ? ? if encoding is None:
? ? ? ? ? ? encoding = 'utf-8'
? ? ? ? return text.encode(encoding, errors)
class ProxyMiddleware(object):
? ? ? ? def process_request(self, request, spider):
? ? ? ? ? ? PROXIES = [
? ? ? ? ? ? ? ? {'ip_port': '111.11.228.75:80', 'user_pass': ''},
? ? ? ? ? ? ? ? {'ip_port': '120.198.243.22:80', 'user_pass': ''},
? ? ? ? ? ? ? ? {'ip_port': '111.8.60.9:8123', 'user_pass': ''},
? ? ? ? ? ? ? ? {'ip_port': '101.71.27.120:80', 'user_pass': ''},
? ? ? ? ? ? ? ? {'ip_port': '122.96.59.104:80', 'user_pass': ''},
? ? ? ? ? ? ? ? {'ip_port': '122.224.249.122:8088', 'user_pass': ''},
? ? ? ? ? ? ]
? ? ? ? ? ? proxy = random.choice(PROXIES)
? ? ? ? ? ? if proxy['user_pass'] is not None:
? ? ? ? ? ? ? ? request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])? ? ? ? ? ? ? ????????????? encoded_user_pass = base64.encodestring(to_bytes(proxy['user_pass']))? ? ? ? ? ? ? ? request.headers['Proxy-Authorization'] = to_bytes('Basic ' + encoded_user_pass)
? ? ? ? ? ? ? ? print "**************ProxyMiddleware have pass************" + proxy['ip_port']
? ? ? ? ? ? else:
? ? ? ? ? ? ? ? print "**************ProxyMiddleware no pass************" + proxy['ip_port']? ? ? ? ? ? ? ? request.meta['proxy'] = to_bytes("http://%s" % proxy['ip_port'])? ??? ? DOWNLOADER_MIDDLEWARES = {? ? ? ?'step8_king.middlewares.ProxyMiddleware': 500,? ? }? ??""""""
20. Https訪問
? ? Https訪問時有兩種情況:
? ? 1. 要爬取網(wǎng)站使用的可信任證書(默認(rèn)支持)? ? ? ? DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"? ? ? ? DOWNLOADER_CLIENTCONTEXTFACTORY = "scrapy.core.downloader.contextfactory.ScrapyClientContextFactory"? ? ? ??? ??
2. 要爬取網(wǎng)站使用的自定義證書? ? ? ??
DOWNLOADER_HTTPCLIENTFACTORY = "scrapy.core.downloader.webclient.ScrapyHTTPClientFactory"? ? ? ? DOWNLOADER_CLIENTCONTEXTFACTORY = "step8_king.https.MySSLFactory"
?# https.py
from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory? ? ? ? from twisted.internet.ssl import (optionsForClientTLS, CertificateOptions, PrivateCertificate)? ? ? ??? ? ? ?
?class MySSLFactory(ScrapyClientContextFactory):
? ? ? ? ? ? def getCertificateOptions(self):
? ? ? ? ? ? ? ? from OpenSSL import crypto? ? ? ? ? ? ? ?
?????????????????v1 = crypto.load_privatekey(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.key.unsecure', mode='r').read())? ? ? ? ? ? ? ?
?????????????????v2 = crypto.load_certificate(crypto.FILETYPE_PEM, open('/Users/wupeiqi/client.pem', mode='r').read())
? ? ? ? ? ? ? ? return CertificateOptions(
? ? ? ? ? ? ? ? ? ? privateKey=v1, ?# pKey對象
? ? ? ? ? ? ? ? ? ? certificate=v2, ?# X509對象
? ? ? ? ? ? ? ? ? ?verify=False,
? ? ? ? ? ? ? ? ? ? method=getattr(self, 'method', getattr(self, '_ssl_method', None))? ? ? ? ? ? ? ? )? ? 其他:
? ? ? ? 相關(guān)類
? ? ? ? ? ? scrapy.core.downloader.handlers.http.HttpDownloadHandler? ? ? ? ? ? ????????????scrapy.core.downloader.webclient.ScrapyHTTPClientFactory? ? ? ? ? ? ? ? ? ? ? ? ? ?????????????scrapy.core.downloader.contextfactory.ScrapyClientContextFactory? ? ? ? 相關(guān)配置? ? ? ? ? ??
????????DOWNLOADER_HTTPCLIENTFACTORY? ? ? ? ? ? ????????????DOWNLOADER_CLIENTCONTEXTFACTORY""""""