一.下載網(wǎng)頁
1.版本1.0:
from urllib.request import urlopen
def download(url):
html=urlopen(url).read()
return html
2.不簡潔隘庄,不直觀所以有了升級
版本1.1:
def download(url):
print('Downloading:',url)
return urlopen(url).read()
3.當獲取網(wǎng)頁時有錯誤時,防止崩潰
版本2.0:
def download(url):
print('Downloading:',url)
try:
html=urlopen(url).read()
except Exception as e:
html=None
print('Download error:',e.reason)
return html
4.一般有兩種錯誤碼404或者5(2是正常)癣亚,其中有時下載會出現(xiàn)5**丑掺,表示服務(wù)器異常,這個時候希望重新連接述雾。(404表示請求網(wǎng)頁不存在街州,一般再訪問也沒結(jié)果)
版本2.1(實現(xiàn)重新連接):
def download(url,num_retry=2):
print('Downloading:',url)
try:
html=urlopen(url).read()
except Exception as e:
html=None
print('Download error:',e.reason)
if num_retry>0:
if hasattr(e,'code') and 500<=e.code<600:
return download(url,num_retry=num_retry-1)
return html
5.下載過程中的用戶代理問題,在請求中加玻孟,修改請求用Request函數(shù).
版本3.0(最終版本):
from urllib.request import *
def download(url,User_agent='wswp',num_retry=2):
print('Downloading:',url)
headers={'User-agent':User_agent}
request=Request(url,headers=headers)
try:
html=urlopen(request).read()
except Exception as e:
html=None
print('Download error:',e.reason)
if num_retry>0:
if hasattr(e,'code') and 500<=e.code<600:
return download(url,num_retry=num_retry-1)
return html
6.引入urllib.error模塊進行分析:
版本4.0:
from urllib.request import *
from urllib.error import URLError
def download(url,User_agent='wswp',num_retry=2):
print('Downloading:',url)
headers={'User-agent':User_agent}
request=Request(url,headers=headers)
try:
html=urlopen(request).read()
except URLError as e:#引入URLError進行分析
html=None
print('Download error:',e.reason)
if num_retry>0:
if hasattr(e,'code') and 500<=e.code<600:
return download(url,num_retry=num_retry-1)
return html
7.下載中的代理問題唆缴。
版本5.0:
from urllib.request import *
from urllib.parse import *
from urllib.error import URLError
def download(url,User_agent='wswp',proxy=None,num_retry=2):
print('Downloading:',url)
headers={'User-agent':User_agent}
request=Request(url,headers=headers)
#加入代理服務(wù)器的處理取募,就不用urlopen來下載網(wǎng)頁了琐谤,而是用自己構(gòu)建的opener來打開
opener=build_opener()
#若設(shè)置了代理,執(zhí)行下面操作加入代理到opener中
if proxy:
proxy_params={urlparse(url).scheme:proxy}
opener.add_handler(ProxyHandler(proxy_params))#在自己構(gòu)建的瀏覽器中加入了代理服務(wù)器
#當沒有設(shè)置代理時玩敏,下面的打開方式和urlopen是一樣的
try:
html=opener.open(request).read()
#urlopen和opene.open(request)都是返回的
<http.client.HTTPResponse object at XXX>對象
時一個類文件斗忌。有read方法,和code方法(鏈接狀態(tài)碼)
except URLError as e:#引入URLError進行分析
html=None
print('Download error:',e.reason)
if num_retry>0:
if hasattr(e,'code') and 500<=e.code<600:
return download(url,num_retry=num_retry-1)
return html