Scrapy-7.Scrapy-redis

本文地址：http://www.reibang.com/p/3de01adfff23

簡介

scrapy-redis是一個基于Redis的Scrapy組件惶我。其主要有以下特性：

分布式抓取數(shù)據(jù)

你可以讓多個爬蟲實例使用同一個Redis隊列笑陈，非常適合廣泛的多域抓取。
分布式處理數(shù)據(jù)

抓取到的Items會被推送到redis隊列中，這意味著你可以通過共享的items隊列，按照需求自行決定后期處理數(shù)據(jù)的數(shù)量棠赛。
即插即用

提供的Scheduler + Duplication, Item Pipeline, Base Spiders組件都是即插即用凭疮，非常方便饭耳。

其構(gòu)建分布式的方案采用的是master-slave的方式，大概的原理是所有在slave上生成的url执解，都會被遠程發(fā)送到master寞肖，然后在master上使用Redis數(shù)據(jù)庫來存儲需要抓取的url隊列。slave要獲取下一個抓取的url衰腌，也是遠程從master獲得新蟆。

通過這種方式，就實現(xiàn)了所有Spider抓取的地址統(tǒng)一由master調(diào)度右蕊，并保存到Redis中的set琼稻，這樣實現(xiàn)了斷點續(xù)爬功能。

并且scrapy-redis會將所有抓取過的地址生成指紋并保存下來饶囚，由此可以避免url的重復(fù)抓取欣簇。

安裝

scrapy-redis是基于Redis的，所以在使用之前需要先安裝Redis數(shù)據(jù)庫坯约。

如果配置分布式熊咽，那么需要把redis的遠程連接打開，以及配置好訪問密碼闹丐。

scrapy-redis模塊的安裝比較簡單横殴，用pip安裝即可。

pip install scrapy-redis

Scrapy-redis用法

在安裝scrapy-redis完畢之后卿拴，我們只需要在Scrapy項目中添加一些配置選項就可以啟用scrapy-redis組件了衫仑。

# Enables scheduling storing requests queue in redis.
SCHEDULER = "scrapy_redis.scheduler.Scheduler"

# Ensure all spiders share same duplicates filter through redis.
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

# Default requests serializer is pickle, but it can be changed to any module
# with loads and dumps functions. Note that pickle is not compatible between
# python versions.
# Caveat: In python 3.x, the serializer must return strings keys and support
# bytes as values. Because of this reason the json or msgpack module will not
# work by default. In python 2.x there is no such issue and you can use
# 'json' or 'msgpack' as serializers.
#SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat"

# Don't cleanup redis queues, allows to pause/resume crawls.
#SCHEDULER_PERSIST = True

# Schedule requests using a priority queue. (default)
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'

# Alternative queues.
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'
#SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'

# Max idle time to prevent the spider from being closed when distributed crawling.
# This only works if queue class is SpiderQueue or SpiderStack,
# and may also block the same time when your spider start at the first time (because the queue is empty).
#SCHEDULER_IDLE_BEFORE_CLOSE = 10

# Store scraped item in redis for post-processing.
ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 300
}

# The item pipeline serializes and stores the items in this redis key.
#REDIS_ITEMS_KEY = '%(spider)s:items'

# The items serializer is by default ScrapyJSONEncoder. You can use any
# importable path to a callable object.
#REDIS_ITEMS_SERIALIZER = 'json.dumps'

# Specify the host and port to use when connecting to Redis (optional).
#REDIS_HOST = 'localhost'
#REDIS_PORT = 6379

# Specify the full Redis URL for connecting (optional).
# If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings.
#REDIS_URL = 'redis://user:pass@hostname:9001'

# Custom redis client parameters (i.e.: socket timeout, etc.)
#REDIS_PARAMS  = {}
# Use custom redis client class.
#REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient'

# If True, it uses redis' ``SPOP`` operation. You have to use the ``SADD``
# command to add URLs to the redis queue. This could be useful if you
# want to avoid duplicates in your start urls list and the order of
# processing does not matter.
#REDIS_START_URLS_AS_SET = False

# Default start urls key for RedisSpider and RedisCrawlSpider.
#REDIS_START_URLS_KEY = '%(name)s:start_urls'

# Use other encoding than utf-8 for redis.
#REDIS_ENCODING = 'latin1'

以上列出來的配置語句，選擇需要配置的部分堕花，復(fù)制到Scrapy項目的settings.py文件中即可文狱。

核心組件

如果要啟用scrapy-redis的話，那么有兩個設(shè)置項必須配置：

SCHEDULER = "scrapy_redis.scheduler.Scheduler"
DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter"

這是scrapy-redis的兩個核心組件缘挽，實現(xiàn)了大部分的邏輯瞄崇。

配置Redis連接

配置Redis的方式有兩種，一種是通過參數(shù)配置：

REDIS_HOST = 'localhost'
REDIS_PORT = 6379
REDIS_PASSWORD = 'foobared'

另一種是通過url地址來進行配置：

REDIS_URL = 'redis://user:password@hostname:9001'

url地址支持以下三種格式：

redis://[:password]@host:port/db
rediss://[:password]@host:port/db
unix://[:password]@/path/to/socket.sock?db=db

配置調(diào)度隊列

還可以通過設(shè)置字段來配置隊列的調(diào)度方式壕曼。調(diào)度方式總共有以下三種：

# 默認的調(diào)度方式苏研，優(yōu)先隊列，使用redis中的有序集合
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue'
# 先入先出隊列
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue'
# 后入先出隊列
SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue'

配置的時候可以選擇其中之一腮郊。

配置斷點續(xù)爬

由于scrapy-redis將去重的指紋和爬取隊列存儲在Redis數(shù)據(jù)庫中摹蘑，所以可以實現(xiàn)斷點續(xù)爬的功能。

首先我們需要開啟一個持久化的配置轧飞，將此設(shè)置配置為True后衅鹿，那么Scrapy在退出時將不會清空Redis的隊列撒踪。

SCHEDULER_PERSIST = True

保留下去重的指紋和爬取隊列后，那么下一次開啟抓取就會重新繼續(xù)上次的爬取隊列了大渤。

配置Pipeline

scrapy-redis中有一個功能是可以將各個分布式的slave抓取到的item傳輸給master制妄，這樣的話，所有抓取的數(shù)據(jù)都會保存到一個統(tǒng)一的master中兼犯。

但是這項功能會較大的影響抓取的速度忍捡，所以在大規(guī)模抓取的時候一般都不會開啟此選項。

ITEM_PIPELINES = {
    'scrapy_redis.pipelines.RedisPipeline': 300
}

系列文章：