When our spiders is stopped by some accident. It's necessary to know how many jobs has be done. And where should we resume our work again. So, this lesson is try to make a flag to mark how many job remain to do.
Coding
from multiprocessing import Pool
from page_parsing import get_item_info_from,url_list,item_info,get_links_from
from channel_extracing import channel_list
# ================================================= < <鏈接去重 > > =====================================================
# 設(shè)計(jì)思路:
# 1.分兩個(gè)數(shù)據(jù)庫铐懊,第一個(gè)用于只用于存放抓取下來的 url (ulr_list)腔剂;第二個(gè)則儲(chǔ)存 url 對應(yīng)的物品詳情信息(item_info)
# 2.在抓取過程中在第二個(gè)數(shù)據(jù)庫中寫入數(shù)據(jù)的同時(shí),新增一個(gè)字段(key) 'index_url' 即該詳情對應(yīng)的鏈接
# 3.若抓取中斷挚币,在第二個(gè)存放詳情頁信息的數(shù)據(jù)庫中的 url 字段應(yīng)該是第一個(gè)數(shù)據(jù)庫中 url 集合的子集
# 4.兩個(gè)集合的 url 相減得出圣賢應(yīng)該抓取的 url 還有哪些
db_urls = [item['url'] for item in url_list.find()] # 用列表解析式裝入所有要爬取的鏈接
index_urls = [item['url'] for item in item_info.find()] # 所引出詳情信息數(shù)據(jù)庫中所有的現(xiàn)存的 url 字段
x = set(db_urls) # 轉(zhuǎn)換成集合的數(shù)據(jù)結(jié)構(gòu)
y = set(index_urls)
rest_of_urls = x-y # 相減
# ======================================================================================================================
These code is copied from Plan4Combat teacher. I am still consider to simplify it.