error1:
NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x00000000038F2B00>: Failed to establish a new connection:[WinError 10060] 由于連接方在一段時間后沒有正確答復或連接的主機沒有反應狭姨,連接嘗試失敗回季。',))
解決辦法:
session.keep_alive=False
error2:
python hostname doesn't match either of? facebookXXXXX
解決辦法:
importssl
ssl.match_hostname =lambdacert, hostname:True
多方查閱后發(fā)現(xiàn)了解決問題的原因:http連接太多沒有關(guān)閉導致的轧粟。
解決辦法:
1、增加重試連接次數(shù)
? requests.adapters.DEFAULT_RETRIES = 5
2首装、關(guān)閉多余的連接
requests使用了urllib3庫创夜,默認的http connection是keep-alive的,requests設置False關(guān)閉仙逻。
s = requests.session()
s.keep_alive =False
3驰吓、只用session進行操作涧尿。即只創(chuàng)建一個連接,并設置最大連接數(shù)或者重試次數(shù)檬贰。
import requests?
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
session.get(url)
import? requests??
from? requests.adapters? import? HTTPAdapter
from? requests.packages.urllib3.util.retry? import? Retry?
?s = requests.Session()?
?retry = Retry(connect =5, backoff_factor =1)?
?adapter = HTTPAdapter(max_retries = retry)?
?s.mount('http://', adapter)?
?s.keep_alive =False
res = s.post(self.conn.host +'/sign-in', data = json.dumps({'name':"XXX",'pwd':"XXX"}))?
?response = res.json()
但是在starkoverflow上有人給出了這樣的解釋姑廉。
4.安裝 py
pip install -U pyopenssl
5、設定固定的睡眠時間在發(fā)送請求之間
https://github.com/requests/requests/issues/4246#event
https://stackoverflow.com/questions/23013220/max-retries-exceeded-with-url
在爬取boss直聘時出現(xiàn)這種錯誤翁涤,總結(jié)如下:?
1.http連接太多沒有關(guān)閉導致的桥言,解決方法:
import requests
requests.adapters.DEFAULT_RETRIES =5? ? # 增加重連次數(shù)
s = requests.session()
s.keep_alive = False? ?# 關(guān)閉多余連接
s.get(url)# 你需要的網(wǎng)址
2.訪問次數(shù)頻繁,被禁止訪問葵礼,解決方法:使用代理
import requests
s = requests.session()
url ="https://mail.163.com/"
s.proxies= {"https":"47.100.104.247:8080","http":"36.248.10.47:8080", }
s.headers= header
s.get(url)
查找代理的網(wǎng)址:http://ip.zdaye.com/shanghai_ip.html#Free?
使用代理時需注意:1.代理分為http和https兩種号阿,不能用混,如果把http的代理用作https也是會報上面的錯誤;2.上面的代理以字典格式傳入鸳粉,例如上面的例子扔涧,可以是“47.100.104.247:8080”這種格式,也可以是“https://47.100.104.247:8080”這種格式届谈;3.如果代理不可用一樣會報上面的錯誤枯夜。以下方法判斷代理是否可用:
import requests
s = requests.session()
url ="https://mail.163.com/"
s.keep_alive = False
s.proxies= {"https":"47.100.104.247:8080","http":"36.248.10.47:8080", }
s.headers= header
r = s.get(url)
print(r.status_code)? ? # 如果代理可用則正常訪問,不可用報以上錯誤
升級
pip install? --upgrade? requests
如果同一ip訪問次數(shù)過多也會封ip艰山,這里就要用代理了proxies湖雹,python很簡單,直接在請求中帶上proxies參數(shù)就行程剥,
r = requests.get(url, headers=headers, cookies=cookies,proxies = proxies)
代理ip的話劝枣,給大家推薦個網(wǎng)站
最下方會有20個免費的汤踏,一般小爬蟲夠用了织鲸,使用代理就會出現(xiàn)代理連接是否通之類的問題,需要在程序中添加下面的代碼溪胶,設置連接時間
requests.adapters.DEFAULT_RETRIES =5
s = requests.session()
s.keep_alive = False
from bs4 import BeautifulSoup
import json,requests,sys
reload(sys)
sys.setdefaultencoding('utf-8')
list =[22711693,24759450,69761921,69761921,22743334,66125712,22743270,57496584,75153221,57641884,66061653,70669333,57279088,24740739,66126129,75100027,92667587,92452007,72345827,90004047,90485109,90546031,83527455,91070982,83527745,94273474,80246564,83497073,69027373,96191554,96683472,90500524,92454863,92272204,70443082,96076068,91656438,75633029,96571687,97659144,69253863,98279207,90435377,70669359,96403354,83618952,81265224,77365611,74592526,90479676,56540304,37924067,27496773,56540319,32571869,43611843,58612870,22743340,67293664,67292945, 57641749,75157068,58934198,75156610,59081304,75156647,75156702,67293838,]
returnList = []
proxies = {
? ? # "https": "http://14.215.177.73:80",? ? "http": "http://202.108.2.42:80",
}
headers = {
? ? 'Host': 'www.dianping.com',
? ? 'Referer': 'http://www.dianping.com/shop/22711693',
? ? 'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/535.19',
? ? 'Accept-Encoding': 'gzip'}
cookies = {
? ? '_lxsdk_cuid': '16146a366a7c8-08cd0a57dad51b-32637402-fa000-16146a366a7c8',
? ? 'lxsdk': '16146a366a7c8-08cd0a57dad51b-32637402-fa000-16146a366a7c8',
? ? '_hc.v': 'ec20d90c-0104-0677-bf24-391bdf00e2d4.1517308569',
? ? 's_ViewType': '10',
? ? 'cy': '16',
? ? 'cye': 'wuhan',
? ? '_lx_utm': 'utm_source%3DBaidu%26utm_medium%3Dorganic',
? ? '_lxsdk_s': '1614abc132e-f84-b9c-2bc%7C%7C34'}
requests.adapters.DEFAULT_RETRIES = 5
s = requests.session()
s.keep_alive = False
for i in list:
? ? url = "https://www.dianping.com/shop/%s/review_all" % i
? ? r = requests.get(url, headers=headers, cookies=cookies,proxies = proxies)
? ? # print r.text? ? soup = BeautifulSoup(r.text, 'lxml')
? ? lenth = soup.find_all(class_='PageLink').__len__() + 1? ??
????#print lenth? ? for j in xrange(lenth):
? ? ? ? urlIn = "http://www.dianping.com/shop/%s/review_all/p%s" % (i, j)
? ? ? ? re = requests.get(urlIn, headers=headers, cookies=cookies,proxies =proxies)
? ? ? ? soupIn = BeautifulSoup(re.text, 'lxml')
? ? ? ? title = soupIn.title.string[0:15]
? ? ? ? #print title? ? ? ? coment = []
? ? ? ? coment = soupIn.select('.reviews-items li')
? ? ? ? for one in coment:
? ? ? ? ? ? try:
? ? ? ? ? ? ? ? if one['class'][0]=='item':
? ? ? ? ? ? ? ? ? ? continue
? ? ? ? ? ? except(KeyError),e:
? ? ? ? ? ? ? ? pass
? ? ? ? ? ? name = one.select_one('.main-review .dper-info .name')
? ? ? ? ? ? #print name.get_text().strip()? ? ? ? ? ? name = name.get_text().strip()
? ? ? ? ? ? star = one.select_one('.main-review .review-rank span')
? ? ? ? ? ? #print star['class'][1][7:8]? ? ? ? ? ? star = star['class'][1][7:8]
? ? ? ? ? ? pl = one.select_one('.main-review .review-words')
? ? ? ? ? ? pl['class'] = {'review-words'}
? ? ? ? ? ? words = pl.get_text().strip()
? ? ? ? ? ? returnList.append([title,name,star,words])file = open("/Users/huojian/Desktop/store_shop.sql","w")for one in returnList:
? ? file.write("\n")
? ? file.write(unicode(one[0]))
? ? file.write("\n")
? ? file.write(unicode(one[1]))
? ? file.write("\n")
? ? file.write(unicode(one[2]))
? ? file.write("\n")
? ? file.write(unicode(one[3]))
? ? file.write("\n")