dupefilter.py
負責(zé)執(zhí)行requst的去重役拴,實現(xiàn)的很有技巧性惩阶,使用redis的set數(shù)據(jù)結(jié)構(gòu)挎狸。但是注意scheduler并不使用其中用于在這個模塊中實現(xiàn)的dupefilter鍵做request的調(diào)度,而是使用queue.py模塊中實現(xiàn)的queue断楷。
當request不重復(fù)時锨匆,將其存入到queue中,調(diào)度時將其彈出冬筒。
import logging
import time
from scrapy.dupefilters import BaseDupeFilter
from scrapy.utils.request import request_fingerprint
from .connection import get_redis_from_settings
DEFAULT_DUPEFILTER_KEY = "dupefilter:%(timestamp)s"
logger = logging.getLogger(__name__)
# TODO: Rename class to RedisDupeFilter.
class RFPDupeFilter(BaseDupeFilter):
"""Redis-based request duplicates filter.
This class can also be used with default Scrapy's scheduler.
"""
logger = logger
def __init__(self, server, key, debug=False):
"""Initialize the duplicates filter.
Parameters
----------
server : redis.StrictRedis
The redis server instance.
key : str
Redis key Where to store fingerprints.
debug : bool, optional
Whether to log filtered requests.
"""
self.server = server
self.key = key
self.debug = debug
self.logdupes = True
@classmethod
def from_settings(cls, settings):
"""Returns an instance from given settings.
This uses by default the key ``dupefilter:<timestamp>``. When using the
``scrapy_redis.scheduler.Scheduler`` class, this method is not used as
it needs to pass the spider name in the key.
Parameters
----------
settings : scrapy.settings.Settings
Returns
-------
RFPDupeFilter
A RFPDupeFilter instance.
"""
server = get_redis_from_settings(settings)
# XXX: This creates one-time key. needed to support to use this
# class as standalone dupefilter with scrapy's default scheduler
# if scrapy passes spider on open() method this wouldn't be needed
# TODO: Use SCRAPY_JOB env as default and fallback to timestamp.
key = DEFAULT_DUPEFILTER_KEY % {'timestamp': int(time.time())}
debug = settings.getbool('DUPEFILTER_DEBUG')
return cls(server, key=key, debug=debug)
@classmethod
def from_crawler(cls, crawler):
"""Returns instance from crawler.
Parameters
----------
crawler : scrapy.crawler.Crawler
Returns
-------
RFPDupeFilter
Instance of RFPDupeFilter.
"""
return cls.from_settings(crawler.settings)
def request_seen(self, request):
"""Returns True if request was already seen.
Parameters
----------
request : scrapy.http.Request
Returns
-------
bool
"""
fp = self.request_fingerprint(request)
# This returns the number of values added, zero if already exists.
added = self.server.sadd(self.key, fp)
return added == 0
def request_fingerprint(self, request):
"""Returns a fingerprint for a given request.
Parameters
----------
request : scrapy.http.Request
Returns
-------
str
"""
return request_fingerprint(request)
def close(self, reason=''):
"""Delete data on close. Called by Scrapy's scheduler.
Parameters
----------
reason : str, optional
"""
self.clear()
def clear(self):
"""Clears fingerprints data."""
self.server.delete(self.key)
def log(self, request, spider):
"""Logs given request.
Parameters
----------
request : scrapy.http.Request
spider : scrapy.spiders.Spider
"""
if self.debug:
msg = "Filtered duplicate request: %(request)s"
self.logger.debug(msg, {'request': request}, extra={'spider': spider})
elif self.logdupes:
msg = ("Filtered duplicate request %(request)s"
" - no more duplicates will be shown"
" (see DUPEFILTER_DEBUG to show all duplicates)")
msg = "Filtered duplicate request: %(request)s"
self.logger.debug(msg, {'request': request}, extra={'spider': spider})
self.logdupes = False
這個文件看起來比較復(fù)雜恐锣,重寫了scrapy本身已經(jīng)實現(xiàn)的request判重功能。因為本身scrapy單機跑的話舞痰,只需要讀取內(nèi)存中的request隊列或者持久化的request隊列(scrapy默認的持久化似乎是json格式的文件土榴,不是數(shù)據(jù)庫)就能判斷這次要發(fā)出的request url是否已經(jīng)請求過或者正在調(diào)度(本地讀就行了)。而分布式跑的話响牛,就需要各個主機上的scheduler都連接同一個數(shù)據(jù)庫的同一個request池來判斷這次的請求是否是重復(fù)的了玷禽。
在這個文件中,通過繼承BaseDupeFilter重寫他的方法呀打,實現(xiàn)了基于redis的判重矢赁。根據(jù)源代碼來看,scrapy-redis使用了scrapy本身的一個fingerprint接request_fingerprint贬丛,這個接口很有趣撩银,根據(jù)scrapy文檔所說,他通過hash來判斷兩個url是否相同(相同的url會生成相同的hash結(jié)果)瘫寝,但是當兩個url的地址相同蜒蕾,get型參數(shù)相同但是順序不同時,也會生成相同的hash結(jié)果(這個真的比較神奇焕阿。咪啡。。)所以scrapy-redis依舊使用url的fingerprint來判斷request請求是否已經(jīng)出現(xiàn)過暮屡。
這個類通過連接redis撤摸,使用一個key來向redis的一個set中插入fingerprint(這個key對于同一種spider是相同的,redis是一個key-value的數(shù)據(jù)庫褒纲,如果key是相同的准夷,訪問到的值就是相同的,這里使用spider名字+DupeFilter的key就是為了在不同主機上的不同爬蟲實例莺掠,只要屬于同一種spider衫嵌,就會訪問到同一個set,而這個set就是他們的url判重池)彻秆,如果返回值為0楔绞,說明該set中該fingerprint已經(jīng)存在(因為集合是沒有重復(fù)值的),則返回False唇兑,如果返回值為1酒朵,說明添加了一個fingerprint到set中,則說明這個request沒有重復(fù)扎附,于是返回True蔫耽,還順便把新fingerprint加入到數(shù)據(jù)庫中了。 DupeFilter判重會在scheduler類中用到留夜,每一個request在進入調(diào)度之前都要進行判重匙铡,如果重復(fù)就不需要參加調(diào)度,直接舍棄就好了碍粥,不然就是白白浪費資源慰枕。