pipelines.py
這是是用來實(shí)現(xiàn)分布式處理的作用。它將Item存儲(chǔ)在redis中以實(shí)現(xiàn)分布式處理驹沿。由于在這里需要讀取配置艘策,所以就用到了from_crawler()函數(shù)。
from scrapy.utils.misc import load_object
from scrapy.utils.serialize import ScrapyJSONEncoder
from twisted.internet.threads import deferToThread
from . import connection
default_serialize = ScrapyJSONEncoder().encode
class RedisPipeline(object):
"""Pushes serialized item into a redis list/queue"""
def __init__(self, server,
key='%(spider)s:items',
serialize_func=default_serialize):
self.server = server
self.key = key
self.serialize = serialize_func
@classmethod
def from_settings(cls, settings):
params = {
'server': connection.from_settings(settings),
}
if settings.get('REDIS_ITEMS_KEY'):
params['key'] = settings['REDIS_ITEMS_KEY']
if settings.get('REDIS_ITEMS_SERIALIZER'):
params['serialize_func'] = load_object(
settings['REDIS_ITEMS_SERIALIZER']
)
return cls(**params)
@classmethod
def from_crawler(cls, crawler):
return cls.from_settings(crawler.settings)
def process_item(self, item, spider):
return deferToThread(self._process_item, item, spider)
def _process_item(self, item, spider):
key = self.item_key(item, spider)
data = self.serialize(item)
self.server.rpush(key, data)
return item
def item_key(self, item, spider):
"""Returns redis key based on given spider.
Override this function to use a different key depending on the item
and/or spider.
"""
return self.key % {'spider': spider.name}
pipelines文件實(shí)現(xiàn)了一個(gè)item pipieline類渊季,和scrapy的item pipeline是同一個(gè)對(duì)象朋蔫,通過從settings中拿到我們配置的REDIS_ITEMS_KEY
作為key,把item串行化之后存入redis數(shù)據(jù)庫對(duì)應(yīng)的value中(這個(gè)value可以看出出是個(gè)list却汉,我們的每個(gè)item是這個(gè)list中的一個(gè)結(jié)點(diǎn))驯妄,這個(gè)pipeline把提取出的item存起來,主要是為了方便我們延后處理數(shù)據(jù)病涨。