一開始使用requests.get(url)就開始報錯,后面查了下資料住涉,說需要在后面加上allow_redirects=False∧粕埽可惜當加上這個條件的時候舆声,直接返回304,獲取不了實際內容柳爽。還有的資料顯示是因為沒有hearders的問題媳握,后面設置上去了也是不行。
Traceback (most recent call last): File "E:/my_project/project/測試/簡單分布式爬蟲(cs)/爬蟲節(jié)點/HTMLdown.py", line 18, in t = h.download(url) File "E:/my_project/project/測試/簡單分布式爬蟲(cs)/爬蟲節(jié)點/HTMLdown.py", line 8, in download r = requests.get(url, headers=headers, allow_redirects=True) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\api.py", line 72, in get return request('get', url, params=params, **kwargs) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\api.py", line 58, in request return session.request(method=method, url=url, **kwargs) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\sessions.py", line 508, in request resp = self.send(prep, **send_kwargs) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\sessions.py", line 640, in send history = [resp for resp in gen] if allow_redirects else [] File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\sessions.py", line 640, in history = [resp for resp in gen] if allow_redirects else [] File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\requests\sessions.py", line 140, in resolve_redirects raise TooManyRedirects('Exceeded %s redirects.' % self.max_redirects, response=resp) requests.exceptions.TooManyRedirects: Exceeded 30 redirects.
后面想到是否是因為重定向磷脯,而導致hearders沒有維持蛾找,后面通過session去get請求
class HtmlDownloader(object):
????def download(self, url):
????????if urlis None:
????????????return None
? ? ? ? user_agent ='Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
? ? ? ? headers = {'User_Agent': user_agent}
????????sessions = requests.session()
????????sessions.headers = headers
????????r = sessions.get(url, allow_redirects=True)
????????print(r.url)
????????if r.status_code ==200:
????????????r.encoding ='utf-8'
? ? ? ? ? ? return r.text
????????return None
if __name__ =='__main__':
????h = HtmlDownloader()
????url ='https://baike.baidu.com/view/284853.htm'
? ? t = h.download(url)
????print(t)
發(fā)現(xiàn)可以正常訪問