創(chuàng)建項(xiàng)目:
scrapy startproject ang
創(chuàng)建爬蟲:
scrapy genspider -t crawl angel angelimg.com
總結(jié):
1:rule規(guī)則是從上到下執(zhí)行的 所以需要倒著寫
rules = (
# 圖集翻頁 需要翻頁 提取圖片url 供下載用
Rule(LinkExtractor(allow=r'http://www.angelimg.com/ang/\d+/\d+'), callback='parse_item', follow=True),
# 圖集首頁 不要回調(diào)
Rule(LinkExtractor(allow=r'http://www.angelimg.com/ang/\d+'), follow=True),
# 首頁翻頁员淫,不需要回調(diào)
Rule(LinkExtractor(allow=r'http://www.angelimg.com/index/\d+'), follow=True),
)
2:圖片有防跨域請(qǐng)求的 所以需要在settings里面找到默認(rèn)請(qǐng)求頭 在里面加上referer(參考頁):"域名" 指定來源:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'Referer': 'http://www.angelimg.com'
}
3:需要在settings里面添加代理IP池 在中間件里面的process_request函數(shù)下面調(diào)用 還要寫一個(gè)初始化函數(shù)
def __init__(self, settings):
self.settings = settings
self.ips = settings.getlist("IPS")
def process_request(self, request, spider):
if "proxy" not in request.meta:
proxy = random.choice(self.ips)
print(proxy, "---------------------------------")
request.meta['proxy'] = proxy
return None