urllib庫簡單的介紹

參考資料：廖雪峰的Python教程蒿讥，崔慶才的博客

urllib是Python內(nèi)建的一個(gè)http請求庫囚聚，主要分為urllib.request / urllib.error / urllib.parse / urllib.robotparser四個(gè)模塊。
下面來分別做簡單的介紹。

1.urllib.request

用來模擬瀏覽器發(fā)送請求。

1.1 urlopen

語法：

urllib.request.urlopen(url, data=None, [timeout,]*, cafile=None, capath=None, cadefault=False, context=None)

用這個(gè)方法來訪問百度御吞，并打印返回：

import urllib.request
response = urllib.request.urlopen('http://www.baidu.com') 
print(response.read().decode('utf-8')) #這里的read方法獲取了response的內(nèi)容，并轉(zhuǎn)換格式

在編輯器或者IDE（推薦pycharm）中執(zhí)行代碼漓藕，就會(huì)看到返回結(jié)果：

返回結(jié)果的一部分截圖.png

接下來試著發(fā)送一個(gè)POST請求：

from urllib import request, parse

data = bytes(urllib.parse.urlencode({'world': 'hi'}),encoding='utf-8')
response = urllib.request.urlopen('http://httpbin.org/post',data=data)
print(response.read().decode('utf-8'))

返回結(jié)果如下魄藕，從圖片中可以看到我們設(shè)置的數(shù)據(jù)data已經(jīng)成功傳遞。

數(shù)據(jù)已經(jīng)傳輸成功.png

第三個(gè)參數(shù)timeout可以設(shè)置請求的時(shí)長：

import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)  # 這里設(shè)置超時(shí)時(shí)間為0.1撵术，如果時(shí)間超過0.1，則拋出錯(cuò)誤
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):  # 對(duì)錯(cuò)誤類型進(jìn)行判斷话瞧，如果是socket.timeout,則打印 TIME OUT
        print('TIME OUT')  # TIME OUT

1.1.1 關(guān)于response

打印一下response:

import urllib.request
import urllib.error

response = urllib.request.urlopen('http://httpbin.org/get')
print(type(response))
print(response.status)  # 獲取狀態(tài)碼
print(response.getheaders())  # 獲取響應(yīng)頭嫩与，是一個(gè)數(shù)組
print(response.getheader('Server'))  #注意這里是 getheader 不是 getheaders；獲取響應(yīng)頭中的某個(gè)值交排，如獲取Server

response截圖.png

1.1.2 設(shè)置request

手動(dòng)設(shè)置request的請求頭等參數(shù)：

from urllib import request, parse
url = 'http://httpbin.org/post'
headers = {  # 請求頭
    'User-Agent': 'Mozilla/4.0(compatible;MSIE 5.5;Windows NT)',
    'Host': 'httpbin.org'
}
dict = {
    'name': 'wg'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, headers=headers, method='POST') # 這樣結(jié)構(gòu)很鮮明
response = request.urlopen(req)
print(response.read().decode('utf-8'))

可以看到返回的數(shù)據(jù)和之前是一致的：

返回的數(shù)據(jù)和之前是一致的.png

2. urllib.error

用來處理程序運(yùn)行中的異常划滋。大致分為HTTPError / URLError:

# 異常處理模塊
from urllib import request, error

try:
    response = request.urlopen('http://www.onebookman.com/index.html')  # 請求一個(gè)并不存在的地址
except error.HTTPError as e:
    print(e.reason, e.headers, e.code, 'wrong', sep='\n')
except error.URLError as e:
    print(e.reason)
except error.ContentTooShortError as e:
    print(e.content)
else:
    print('request suc')

3. urllib.urlparse

解析URL。
看一段代碼就能明白它的用途：

from urllib.parse import urlparse

result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result)

輸出結(jié)果為：

<class 'urllib.parse.ParseResult'> ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')

3.1 urlunparse

這個(gè)方法是urlparse的反操作：

from urllib.parse import urlunparse
data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))

輸出結(jié)果：

http://www.baidu.com/index.html;user?a=6#comment

3.2 urlencode

把一個(gè)字典對(duì)象拼接為get請求的參數(shù)：

from urllib.parse import urlencode

params = {
    'name': 'wg',
    'age': 18
}
base_url = 'http://www.baidu.com'
url = base_url + urlencode(params)
print(url)  # http://www.baidu.comname=wg&age=18

4. urllib.robotparser

robotparser為robots.txt文件實(shí)現(xiàn)了一個(gè)解釋器埃篓，可以用來讀取robots文本的格式和內(nèi)容处坪，用函數(shù)方法檢查給定的User-Agent是否可以訪問相應(yīng)的網(wǎng)站資源。如果要編寫一個(gè)網(wǎng)絡(luò)蜘蛛，這個(gè)模塊可以限制一些蜘蛛抓取無用的或者重復(fù)的信息同窘，避免蜘蛛掉入動(dòng)態(tài)asp/php網(wǎng)頁程序的死循環(huán)中玄帕。

以上就是對(duì)urllib庫的簡單介紹。
完想邦。

最后編輯于：2018.08.13 00:01:34

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者