一提鸟、介紹
#介紹:使用requests可以模擬瀏覽器的請(qǐng)求,比起之前用到的urllib仅淑,requests模塊的api更加便捷(本質(zhì)就是封裝了urllib3)
#注意:requests庫(kù)發(fā)送請(qǐng)求將網(wǎng)頁(yè)內(nèi)容下載下來(lái)以后称勋,并不會(huì)執(zhí)行js代碼,這需要我們自己分析目標(biāo)站點(diǎn)然后發(fā)起新的request請(qǐng)求
#安裝:pip3 install requests
#各種請(qǐng)求方式:常用的就是requests.get()和requests.post()
>>> import requests
>>> r = requests.get('https://api.github.com/events')
>>> r = requests.post('http://httpbin.org/post', data = {'key':'value'})
>>> r = requests.put('http://httpbin.org/put', data = {'key':'value'})
>>> r = requests.delete('http://httpbin.org/delete')
>>> r = requests.head('http://httpbin.org/get')
>>> r = requests.options('http://httpbin.org/get')
#建議在正式學(xué)習(xí)requests前涯竟,先熟悉下HTTP協(xié)議
http://www.cnblogs.com/linhaifeng/p/6266327.html
#官網(wǎng)鏈接
http://docs.python-requests.org/en/master/
二赡鲜、基于GET請(qǐng)求
1、基本請(qǐng)求
import requests
response=requests.get('http://dig.chouti.com/')
print(response.status_code)
print(response.text)
2庐船、headers參數(shù)
通常我們?cè)诎l(fā)送請(qǐng)求時(shí)都需要帶上請(qǐng)求頭银酬,請(qǐng)求頭是將自身偽裝成瀏覽器的關(guān)鍵,常見(jiàn)的有用的請(qǐng)求頭如下
Referer #大型網(wǎng)站通常都會(huì)根據(jù)該參數(shù)判斷請(qǐng)求的來(lái)源
User-Agent #客戶端瀏覽器信息
Cookie #Cookie信息雖然包含在請(qǐng)求頭里筐钟,但requests模塊有單獨(dú)的參數(shù)來(lái)處理他揩瞪,headers={}內(nèi)就不要放它了
#添加headers(瀏覽器會(huì)識(shí)別請(qǐng)求頭,不加可能會(huì)被拒絕訪問(wèn),比如訪問(wèn)https://www.zhihu.com/explore)
import requests
response=requests.get('https://www.zhihu.com/explore')
response.status_code #500
#自己定制headers
headers={
'User-Agent':'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.76 Mobile Safari/537.36',
}
respone=requests.get('https://www.zhihu.com/explore',
headers=headers)
print(respone.status_code) #200
3、params參數(shù)
自己拼接GET參數(shù)
#在請(qǐng)求頭內(nèi)將自己偽裝成瀏覽器篓冲,否則百度不會(huì)正常返回頁(yè)面內(nèi)容
import requests
response=requests.get('https://www.baidu.com/s?wd=python&pn=1',
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
})
print(response.text)
#如果查詢關(guān)鍵詞是中文或者有其他特殊符號(hào)李破,則不得不進(jìn)行url編碼
from urllib.parse import urlencode
wd='egon老師'
encode_res=urlencode({'k':wd},encoding='utf-8')
keyword=encode_res.split('=')[1]
print(keyword)
# 然后拼接成url
url='https://www.baidu.com/s?wd=%s&pn=1' %keyword
response=requests.get(url,
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
})
res1=response.text
params參數(shù)的使用
#上述操作可以用requests模塊的一個(gè)params參數(shù)搞定,本質(zhì)還是調(diào)用urlencode
response=requests.get('https://www.baidu.com/s',
params={
'wd':'egon老師',
'pn':1
},
headers={
'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
})
res2=response.text
4壹将、cookies參數(shù)
#登錄github嗤攻,然后從瀏覽器中獲取cookies,以后就可以直接拿著cookie登錄了诽俯,無(wú)需輸入用戶名密碼
#用戶名:egonlin 郵箱378533872@qq.com 密碼lhf@123
import requests
Cookies={ 'user_session':'wGMHFJKgDcmRIVvcA14_Wrt_3xaUyJNsBnPbYzEL6L0bHcfc',
}
response=requests.get('https://github.com/settings/emails',
cookies=Cookies) #github對(duì)請(qǐng)求頭沒(méi)有什么限制屯曹,我們無(wú)需定制user-agent,對(duì)于其他網(wǎng)站可能還需要定制
print('378533872@qq.com' in response.text) #True
三惊畏、基于POST請(qǐng)求
1恶耽、介紹
#GET請(qǐng)求
HTTP默認(rèn)的請(qǐng)求方法就是GET
* 沒(méi)有請(qǐng)求體
* 數(shù)據(jù)必須在1K之內(nèi)!
* GET請(qǐng)求數(shù)據(jù)會(huì)暴露在瀏覽器的地址欄中
GET請(qǐng)求常用的操作:
1. 在瀏覽器的地址欄中直接給出URL颜启,那么就一定是GET請(qǐng)求
2. 點(diǎn)擊頁(yè)面上的超鏈接也一定是GET請(qǐng)求
3. 提交表單時(shí)偷俭,表單默認(rèn)使用GET請(qǐng)求,但可以設(shè)置為POST
#POST請(qǐng)求
(1). 數(shù)據(jù)不會(huì)出現(xiàn)在地址欄中
(2). 數(shù)據(jù)的大小沒(méi)有上限
(3). 有請(qǐng)求體
(4). 請(qǐng)求體中如果存在中文缰盏,會(huì)使用URL編碼涌萤!
#Q妥瘛!负溪!requests.post()用法與requests.get()完全一致透揣,特殊的是requests.post()有一個(gè)data參數(shù),用來(lái)存放請(qǐng)求體數(shù)據(jù)
2川抡、發(fā)送post請(qǐng)求辐真,模擬瀏覽器的登錄行為
對(duì)于登錄來(lái)說(shuō),應(yīng)該輸錯(cuò)用戶名或密碼然后分析抓包流程崖堤。下面以自動(dòng)登錄github舉例侍咱。
目標(biāo)站點(diǎn)分析:
瀏覽器輸入
https://github.com/login
然后輸入錯(cuò)誤的賬號(hào)密碼,抓包
發(fā)現(xiàn)登錄行為是post提交到:
https://github.com/session
請(qǐng)求頭包含cookie
-
請(qǐng)求體包含
commit:Sign in utf8:? authenticity_token:sGTzT+SC798ARcvdddXYc7caxdZ2gW624PstcBous7Q6uYUwv9ZMwL9FOR32WS7bB7vAR/lpMjGTtUUNS54x5w== login:378533872@qq.com password:123456789
流程分析密幔;
- 通過(guò)get方式訪問(wèn)
https://github.com/login
拿到初始cookie
與authenticity_token
- 通過(guò)post方式訪問(wèn)
https://github.com/session
楔脯,帶上初始cookie,帶上請(qǐng)求體(authenticity_token胯甩,用戶名昧廷,密碼等) - 最后拿到登錄cookie
- 如果密碼時(shí)密文形式,則可以先輸錯(cuò)賬號(hào)偎箫,輸對(duì)密碼麸粮,然后到瀏覽器中拿到加密后的密碼,github的密碼是明文
代碼實(shí)現(xiàn):
import requests
import re
#第一次請(qǐng)求
r1=requests.get('https://github.com/login')
r1_cookie=r1.cookies.get_dict() #拿到初始cookie(未被授權(quán))
#從頁(yè)面中拿到CSRF TOKEN
authenticity_token=re.findall(r'name="authenticity_token".*?value="(.*?)"',r1.text)[0]
#第二次請(qǐng)求:帶著初始cookie和TOKEN發(fā)送POST請(qǐng)求給登錄頁(yè)面镜廉,帶上賬號(hào)密碼
data={
'commit':'Sign in',
'utf8':'?',
'authenticity_token':authenticity_token,
'login':'317828332@qq.com',
'password':'alex3714'
}
r2=requests.post('https://github.com/session',
data=data,
cookies=r1_cookie
)
login_cookie=r2.cookies.get_dict() # 拿到登錄cookie
#第三次請(qǐng)求:以后的登錄弄诲,拿著login_cookie就可以,比如訪問(wèn)一些個(gè)人配置
r3=requests.get('https://github.com/settings/emails',
cookies=login_cookie)
print('317828332@qq.com' in r3.text) #True
requests.session()
自動(dòng)幫我們保存cookie信息
import requests
import re
session=requests.session()
#第一次請(qǐng)求
r1=session.get('https://github.com/login')
authenticity_token=re.findall(r'name="authenticity_token".*?value="(.*?)"',r1.text)[0] #從頁(yè)面中拿到CSRF TOKEN
#第二次請(qǐng)求
data={
'commit':'Sign in',
'utf8':'?',
'authenticity_token':authenticity_token,
'login':'317828332@qq.com',
'password':'alex3714'
}
r2=session.post('https://github.com/session',
data=data,
)
#第三次請(qǐng)求
r3=session.get('https://github.com/settings/emails')
print('317828332@qq.com' in r3.text) #True
3、補(bǔ)充
requests.post(url='xxxxxxxx',
data={'xxx':'yyy'}) #沒(méi)有指定請(qǐng)求頭,#默認(rèn)的請(qǐng)求頭:application/x-www-form-urlencoed
#如果我們自定義請(qǐng)求頭是application/json,并且用data傳值, 則服務(wù)端取不到值
requests.post(url='',
data={'':1,},
headers={
'content-type':'application/json'
})
requests.post(url='',
json={'':1,},
) #默認(rèn)的請(qǐng)求頭:application/json
四娇唯、響應(yīng)Response
1齐遵、response屬性
import requests
respone=requests.get('http://www.reibang.com')
# respone屬性
print(respone.text)
print(respone.content)
print(respone.status_code)
print(respone.headers)
print(respone.cookies)
print(respone.cookies.get_dict())
print(respone.cookies.items())
print(respone.url)
print(respone.history)
print(respone.encoding)
#關(guān)閉:response.close()
from contextlib import closing
with closing(requests.get('xxx',stream=True)) as response:
for line in response.iter_content(): # 可迭代的二進(jìn)制網(wǎng)頁(yè)內(nèi)容
pass
2、編碼問(wèn)題
#編碼問(wèn)題
import requests
response=requests.get('http://www.autohome.com/news')
# response.encoding='gbk' #汽車之家網(wǎng)站返回的頁(yè)面內(nèi)容為gb2312編碼的塔插,而requests的默認(rèn)編碼為ISO-8859-1梗摇,如果不設(shè)置成gbk則中文亂碼
print(response.text)
3、獲取二進(jìn)制數(shù)據(jù)
import requests
response=requests.get('https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000&sec=1509868306530&di=712e4ef3ab258b36e9f4b48e85a81c9d&imgtype=0&src=http%3A%2F%2Fc.hiphotos.baidu.com%2Fimage%2Fpic%2Fitem%2F11385343fbf2b211e1fb58a1c08065380dd78e0c.jpg')
with open('a.jpg','wb') as f:
f.write(response.content)
獲取二進(jìn)制流
#stream參數(shù):一點(diǎn)一點(diǎn)的取,比如下載視頻時(shí),如果視頻100G,用response.content然后一下子寫(xiě)到文件中是不合理的
import requests
response=requests.get('https://gss3.baidu.com/6LZ0ej3k1Qd3ote6lo7D0j9wehsv/tieba-smallvideo-transcode/1767502_56ec685f9c7ec542eeaf6eac93a65dc7_6fe25cd1347c_3.mp4',
stream=True)
with open('b.mp4','wb') as f:
for line in response.iter_content():
f.write(line)
4想许、解析json
#解析json
import requests
response=requests.get('http://httpbin.org/get')
import json
res1=json.loads(response.text) #太麻煩
res2=response.json() #直接獲取json數(shù)據(jù)
print(res1 == res2) #True
5伶授、Redirection and History
官網(wǎng)的解釋
By default Requests will perform location redirection for all verbs except HEAD.
We can use the history property of the Response object to track redirection.
The Response.history list contains the Response objects that were created in order to complete the request. The list is sorted from the oldest to the most recent response.
For example, GitHub redirects all HTTP requests to HTTPS:
>>> r = requests.get('http://github.com')
>>> r.url
'https://github.com/'
>>> r.status_code
200
>>> r.history
[<Response [301]>]
If you're using GET, OPTIONS, POST, PUT, PATCH or DELETE, you can disable redirection handling with the allow_redirects parameter:
>>> r = requests.get('http://github.com', allow_redirects=False)
>>> r.status_code
301
>>> r.history
[]
If you're using HEAD, you can enable redirection as well:
>>> r = requests.head('http://github.com', allow_redirects=True)
>>> r.url
'https://github.com/'
>>> r.history
[<Response [301]>]
利用github登錄后跳轉(zhuǎn)到主頁(yè)面的例子來(lái)驗(yàn)證它
import requests
import re
#第一次請(qǐng)求
r1=requests.get('https://github.com/login')
r1_cookie=r1.cookies.get_dict() #拿到初始cookie(未被授權(quán))
authenticity_token=re.findall(r'name="authenticity_token".*?value="(.*?)"',r1.text)[0] #從頁(yè)面中拿到CSRF TOKEN
#第二次請(qǐng)求:帶著初始cookie和TOKEN發(fā)送POST請(qǐng)求給登錄頁(yè)面,帶上賬號(hào)密碼
data={
'commit':'Sign in',
'utf8':'?',
'authenticity_token':authenticity_token,
'login':'317828332@qq.com',
'password':'alex3714'
}
#測(cè)試一:沒(méi)有指定allow_redirects=False,則響應(yīng)頭中出現(xiàn)Location就跳轉(zhuǎn)到新頁(yè)面流纹,r2代表新頁(yè)面的response
r2=requests.post('https://github.com/session',
data=data,
cookies=r1_cookie
)
print(r2.status_code) #200
print(r2.url) #看到的是跳轉(zhuǎn)后的頁(yè)面
print(r2.history) #看到的是跳轉(zhuǎn)前的response
print(r2.history[0].text) #看到的是跳轉(zhuǎn)前的response.text
#測(cè)試二:指定allow_redirects=False,則響應(yīng)頭中即便出現(xiàn)Location也不會(huì)跳轉(zhuǎn)到新頁(yè)面糜烹,r2代表的仍然是老頁(yè)面的response
r2=requests.post('https://github.com/session',
data=data,
cookies=r1_cookie,
allow_redirects=False
)
print(r2.status_code) #302
print(r2.url) #看到的是跳轉(zhuǎn)前的頁(yè)面https://github.com/session
print(r2.history) #[]
五、高級(jí)用法
1漱凝、SSL Cert Verification
#證書(shū)驗(yàn)證(大部分網(wǎng)站都是https)
import requests
respone=requests.get('https://www.12306.cn') #如果是ssl請(qǐng)求,首先檢查證書(shū)是否合法,不合法則報(bào)錯(cuò),程序終端
#改進(jìn)1:去掉報(bào)錯(cuò),但是會(huì)報(bào)警告
import requests
respone=requests.get('https://www.12306.cn',verify=False) #不驗(yàn)證證書(shū),報(bào)警告,返回200
print(respone.status_code)
#改進(jìn)2:去掉報(bào)錯(cuò),并且去掉警報(bào)信息
import requests
from requests.packages import urllib3
urllib3.disable_warnings() #關(guān)閉警告
respone=requests.get('https://www.12306.cn',verify=False)
print(respone.status_code)
#改進(jìn)3:加上證書(shū)
#很多網(wǎng)站都是https,但是不用證書(shū)也可以訪問(wèn),大多數(shù)情況都是可以攜帶也可以不攜帶證書(shū)
#知乎\百度等都是可帶可不帶
#有硬性要求的,則必須帶疮蹦,比如對(duì)于定向的用戶,拿到證書(shū)后才有權(quán)限訪問(wèn)某個(gè)特定網(wǎng)站
import requests
respone=requests.get('https://www.12306.cn',
cert=('/path/server.crt',
'/path/key'))
print(respone.status_code)
2、使用代理
#官網(wǎng)鏈接: http://docs.python-requests.org/en/master/user/advanced/#proxies
#代理設(shè)置:先發(fā)送請(qǐng)求給代理,然后由代理幫忙發(fā)送(封ip是常見(jiàn)的事情)
import requests
proxies={
'http':'http://egon:123@localhost:9743',#帶用戶名密碼的代理,@符號(hào)前是用戶名與密碼
'http':'http://localhost:9743',
'https':'https://localhost:9743',
}
respone=requests.get('https://www.12306.cn',
proxies=proxies)
print(respone.status_code)
#支持socks代理,安裝:pip install requests[socks]
import requests
proxies = {
'http': 'socks5://user:pass@host:port',
'https': 'socks5://user:pass@host:port'
}
respone=requests.get('https://www.12306.cn',
proxies=proxies)
print(respone.status_code)
3茸炒、超時(shí)設(shè)置
#超時(shí)設(shè)置
#兩種超時(shí):float or tuple
#timeout=0.1 #代表接收數(shù)據(jù)的超時(shí)時(shí)間
#timeout=(0.1,0.2)#0.1代表鏈接超時(shí) 0.2代表接收數(shù)據(jù)的超時(shí)時(shí)間
import requests
respone=requests.get('https://www.baidu.com',
timeout=0.0001)
4愕乎、 認(rèn)證設(shè)置
#官網(wǎng)鏈接:http://docs.python-requests.org/en/master/user/authentication/
#認(rèn)證設(shè)置:登陸網(wǎng)站是,彈出一個(gè)框,要求你輸入用戶名密碼(與alter很類似)阵苇,此時(shí)是無(wú)法獲取html的
# 但本質(zhì)原理是拼接成請(qǐng)求頭發(fā)送
# r.headers['Authorization'] = _basic_auth_str(self.username, self.password)
# 一般的網(wǎng)站都不用默認(rèn)的加密方式,都是自己寫(xiě)
# 那么我們就需要按照網(wǎng)站的加密方式感论,自己寫(xiě)一個(gè)類似于_basic_auth_str的方法
# 得到加密字符串后添加到請(qǐng)求頭
# r.headers['Authorization'] =func('.....')
#看一看默認(rèn)的加密方式吧绅项,通常網(wǎng)站都不會(huì)用默認(rèn)的加密設(shè)置
import requests
from requests.auth import HTTPBasicAuth
r=requests.get('xxx',auth=HTTPBasicAuth('user','password'))
print(r.status_code)
#HTTPBasicAuth可以簡(jiǎn)寫(xiě)為如下格式
import requests
r=requests.get('xxx',auth=('user','password'))
print(r.status_code)
5、異常處理
#異常處理
import requests
from requests.exceptions import * #可以查看requests.exceptions獲取異常類型
try:
r=requests.get('http://www.baidu.com',timeout=0.00001)
except ReadTimeout:
print('===:')
# except ConnectionError: #網(wǎng)絡(luò)不通
# print('-----')
# except Timeout:
# print('aaaaa')
except RequestException:
print('Error')
6比肄、上傳文件
import requests
files={'file':open('a.jpg','rb')}
respone=requests.post('http://httpbin.org/post',files=files)
print(respone.status_code)