- 爬蟲(chóng)代理的重要性這里就不在贅述了,先貼一張代理池流程圖:
1.代理IP抓取
網(wǎng)上免費(fèi)代理都不靠譜(你懂的),推薦一家代理--訊代理,靠譜.本文選用的是動(dòng)態(tài)切換代理10s請(qǐng)求一次,返回5個(gè)代理IP.
while True:
try:
proxies_list = download_proxies() # download_proxies即請(qǐng)求IP代理
thread_list = []
for proxies in proxies_list: # 多線程方式校驗(yàn)聯(lián)通性并存入redis
t = Thread(target=store_proxies, args=(proxies, ))
t.start()
thread_list.append(t)
for t in thread_list:
t.join()
end = time.time()
if end - begin > 15:
continue
else:
time.sleep(15.5-(end - begin))
except Exception as e:
now = time.strftime('%Y-%m-%d %H:%M:%S', time.localtime(time.time()))
err_logger.error(str(now) + str(e))
time.sleep(10)
2.IP代理連通性測(cè)試
采用多線程測(cè)試每個(gè)代理的連通性,測(cè)試網(wǎng)站為要爬取的目標(biāo)網(wǎng)站
ping_url = 'http://www.xxxx.com'
try:
status_code = requests.get(ping_url, proxies=proxies, timeout=3).status_code # 返回非200,或者報(bào)錯(cuò)均認(rèn)為測(cè)試未通過(guò)
if status_code == 200:
return True
else:
return False
except Exception as e:
print(e)
return False
3.IP代理存入Redis
對(duì)于通過(guò)連通測(cè)試的IP,設(shè)置一個(gè)過(guò)期時(shí)間(如90s),按Key,Value(比如設(shè)為1)存入單獨(dú)一個(gè)數(shù)據(jù)庫(kù)中(比如1)
conn = redis.Redis(db=1)
conn_check = connect_check(proxies) # 對(duì)應(yīng)2.中的連通測(cè)試
if conn_check:
proxies = json.dumps(proxies)
duplicate_check = conn.exist(proxies) # 代理池去重測(cè)試
if not duplicate_check:
conn.setex(proxies, 1, time=90) # 設(shè)置過(guò)期時(shí)間并存入redis
print('new proxies: ', proxies)
else:
print(' Already exist proxies: ' + str(proxies))
else:
print(str(now) + ' Can not connect $ping_url -- proxies: ' + str(proxies))
4.Restful接口
用后端框架(如Flask)啟個(gè)服務(wù)提供給爬蟲(chóng)程序