作業(yè)思路
昨天在嘗試了幾次拉勾網(wǎng)的爬取完疫,因為調(diào)試了幾次后運行,然后IP就被封了
谷歌了一番廷雅,發(fā)現(xiàn)谷歌上還是比較少關于這方面的教程,要么是重復京髓,要么是一筆帶過航缀,心塞……
進入正題:一個爬蟲如何反Ban?
讓請求更像是瀏覽器發(fā)出的
之前只是用了一個默認的頭,這樣很容易被網(wǎng)站給識別出來朵锣,畢竟你一個瀏覽器短時間發(fā)出這么多請求谬盐,非人也,所以可以聯(lián)想到用很多個瀏覽器诚些,這樣就降低了一個瀏覽器發(fā)出請求的數(shù)量飞傀,看起來合理了那么些皇型,列出很多瀏覽器的頭,然后隨機選砸烦,那頭哪里來呢弃鸦?這里推薦一個網(wǎng)站
想要更多頭點我
代碼在這里:
middlewares.py
# -*-coding:utf-8-*-
import random
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
class RotateUserAgentMiddleware(UserAgentMiddleware):
#初始化頭
def __init__(self, user_agent=''):
self.user_agent = user_agent
#通過random函數(shù),隨機選擇頭幢痘,然后偽裝請求
def process_request(self, request, spider):
ua = random.choice(self.user_agent_list)
if ua:
print ua, '-----------------'
request.headers.setdefault('User-Agent', ua)
# 羅列出頭
user_agent_list = [
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"
"Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
"Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
"Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
]
setting.py
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware':None,
'lagou.middlewares.RotateUserAgentMiddleware':400,
}
讓手速慢一點
一個爬蟲一秒鐘請求N次唬格,這不就是告訴別人說:“你看,我是爬蟲颜说,我的手速有多快呀”购岗,然后ban的就是你這只爬蟲,所以可以設置一下下載延遲(setting.py
):
DOWNLOAD_DELAY = 1
门粪,雖然是慢一點喊积,但是穩(wěn)定,保險玄妈。
多個地點一起請求
如何來改變自己請求的地點乾吻?一般來說,我們的IP地址就是我們的地理位置標簽拟蜻,所以只要改變我們的IP绎签,就可以改變我們請求的位置,隨機IP請求酝锅,就能夠模擬出來自全國各地的用戶來訪問诡必。
所以在這里就需要用到代理IP池了,代理IP可以在通過網(wǎng)站上提供的屈张,更多代理IP請點我
具體如何實現(xiàn)呢擒权?
middlewares.py
編寫一個設置代理的類
import random
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import base64
from settings import PROXIES
class ProxyMiddleware(object):
#編寫傳遞請求的方法
def process_request(self, request, spider):
proxy = random.choice(PROXIES)
#隨機選擇設置中的代理
if proxy['user_pass'] is not None:
request.meta['proxy'] = "http://%s" % proxy['ip_port']
#request.meta是一個字典,包含了很多請求附加信息阁谆,這里是更改請求的IP
encoded_user_pass = base64.encodestring(proxy['user_pass'])
#Base64是一種基于64個可打印字符來表示二進制數(shù)據(jù)的表示方法
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
print "Successful" + proxy['ip_port']
else:
print "Fail" + proxy['ip_port']
request.meta['proxy'] = "http://%s" % proxy['ip_port']
#再換一個IP...注意是一次請求用一個有效的IP
分析:
在注釋中提到request.meta,先來探究一下request.meta中到時有哪些東西愉老,用scrapy shell測試了一下场绿,所返回的內(nèi)容是:
{'depth': 0,
'download_latency': 16.514000177383423,
'download_slot': 'docs.python-requests.org',
'download_timeout': 180.0,
'handle_httpstatus_list': <scrapy.utils.datatypes.SequenceExclude at 0x45a84b0>,
'proxy': 'http://125.73.40.123:8118'}
從這里看,我們可以先更改請求的proxy
那么我們再來探究一下request.headers里有什么內(nèi)容嫉入,同樣調(diào)試焰盗,得到
{'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding': 'gzip,deflate',
'Accept-Language': 'en',
'Proxy-Authorization': 'Basic ',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3'}
從字典上看需要修改的是proxy-authorization(因為其他的都是固定的)
setting.py
添加IP
PROXIES = [
{'ip_port': '119.5.1.38:808', 'user_pass': ''},
{'ip_port': '115.216.31.87:8118', 'user_pass': ''},
{'ip_port': '125.73.40.123:8118', 'user_pass': ''},
{'ip_port': '171.38.35.20:8123', 'user_pass': ''},
{'ip_port': '171.38.35.97:8123', 'user_pass': ''},
{'ip_port': '222.85.39.136:808', 'user_pass': ''},
]
setting.py
設置
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
'lagou.middlewares.ProxyMiddleware': 100,
}
如何來獲取代理IP
上面有提供獲取代理IP的網(wǎng)址,這里來說一下如何批量采集
編寫腳本:
get_ip.py
# -*- coding: utf-8 -*-
import lxml
import requests
import re
import sys
from bs4 import BeautifulSoup
f = open('proxy.txt' , 'w')
headers = {"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"}
for page in range(1, 101):
link = 'http://www.xici.net.co/nn/' + str(page)
html = requests.get(link, headers=headers)
soup = BeautifulSoup(html.content, 'html.parser')
trs = soup.find('table', id='ip_list').findAll('tr')
for tr in trs[1:]:
tds = tr.find_all('td')
ip = tds[1].get_text()
port = tds[2].get_text()
protocol = tds[5].text.strip()
if protocol == 'HTTP' or protocol == 'HTTPS':
f.write('%s=%s:%s\n' % (protocol, ip, port) )
print '%s=%s:%s' % (protocol, ip, port)
f.close()
但是咒林,問題也來了熬拒,如何保證這些IP都是有用的呢?所以需要驗證一下IP是否有用
驗證代理IP是否有效
test_ip.py
# -*- coding: utf-8 -*-
import requests
import sys
import threading
import time
inFile = open('proxy.txt', 'r')
outFile = open('available.txt', 'w')
#of = open('proxy.txt' , 'w')
headers = {"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36"}
lock = threading.Lock()
def test():
while True:
line = inFile.readline()
if len(line) == 0: break
protocol, proxy = line.split('=')
try:
proxies = {protocol: proxy}
r = requests.get("https://www.lagou.com", headers=headers, proxies=proxies)
#采用requests的狀態(tài)碼來判斷
if r.status_code == requests.codes.ok:
print 'add proxy:' + proxy
outFile.write(proxy + '\n')
else:
print '...'
except:
print "fail"
#采用多線程
all_thread = []
for i in range(50):
t = threading.Thread(target=test)
all_thread.append(t)
t.start()
for t in all_thread:
t.join()
inFile.close()
outFile.close()
作業(yè)結果:
計算了一下頁面上所有的個數(shù)垫竞,這個個數(shù)只和總數(shù)相差不超過5個澎粟,相較于第一次爬取的結果為50多個好了很多蛀序。
作業(yè)中的問題
啊...這是學爬蟲來研究的最久的一篇...
還有兩個問題沒有解決
問題一:
上面的代碼
encoded_user_pass = base64.encodestring(proxy['user_pass'])
#Base64是一種基于64個可打印字符來表示二進制數(shù)據(jù)的表示方法
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
為什么要加encoded_user_pass = base64.encodestring(proxy['user_pass'])
問題二:
在采用了代理IP后,測試過了活烙,正式爬取的時候徐裸,還是有一些IP會有這樣的信息顯示
問題三:
還是沒有采用數(shù)據(jù)庫的存儲方式,下次改進