scrapy核心組件包括:
- spider
- scheduler
- middleware
- itempipelines
- engine
scrapy運行流程如圖所示:
scrapy.jpg
spider發(fā)送requests給我們的engine勺阐,這里我們需要明確吝沫,scrapy是單線程的疹娶,并非多線程,那么它的運行核心就是epoll+select事件循環(huán),engine就像一顆心臟月趟,保證scrapy框架正常運行槐瑞,所有的請求都需要經過這個組件。
spider發(fā)送requests到scheduler中間經過中間件過濾,在scheduler中enqueue_request方法調用了request_seen函數(shù)
def enqueue_request(self, request):
if not request.dont_filter and self.df.request_seen(request):
self.df.log(request, self.spider)
return False
dqok = self._dqpush(request)
if dqok:
self.stats.inc_value('scheduler/enqueued/disk', spider=self.spider)
else:
self._mqpush(request)
self.stats.inc_value('scheduler/enqueued/memory', spider=self.spider)
self.stats.inc_value('scheduler/enqueued', spider=self.spider)
return True
這里是request_seen方法惋砂,最終調用request_fingerprint
def request_seen(self, request):
fp = self.request_fingerprint(request)
if fp in self.fingerprints:
return True
self.fingerprints.add(fp)
if self.file:
self.file.write(fp + '\n')
def request_fingerprint(self, request):
return request_fingerprint(request)
request_fingerprint核心代碼,利用hash進行過濾
if include_headers:
include_headers = tuple(to_bytes(h.lower())
for h in sorted(include_headers))
cache = _fingerprint_cache.setdefault(request, {})
cache_key = (include_headers, keep_fragments)
if cache_key not in cache:
fp = hashlib.sha1()
fp.update(to_bytes(request.method))
fp.update(to_bytes(canonicalize_url(request.url, keep_fragments=keep_fragments)))
fp.update(request.body or b'')
if include_headers:
for hdr in include_headers:
if hdr in request.headers:
fp.update(hdr)
for v in request.headers.getlist(hdr):
fp.update(v)
cache[cache_key] = fp.hexdigest()
return cache[cache_key]
一般的數(shù)據(jù)量都可以進行有效過濾蝶俱,后面會介紹布隆過濾器