python爬蟲：Python-requests模塊學習筆記總結(jié)

前言一膨俐、requests模塊使用1.1 requests模塊發(fā)送get請求1.2 response響應對象1.3 response.text與response.content的區(qū)別1.4 通過對response.content進行decode脓诡，來解決中文亂碼1.5 response響應對象的其他常用的屬性和方法二、requests模塊發(fā)送請求2.1 發(fā)送帶headers的請求2.1.1思考2.1.2 攜帶請求頭發(fā)送請求的方法2.2 發(fā)送帶參數(shù)的請求2.2.1 在url攜帶參數(shù)2.2.2 通過params攜帶參數(shù)字典2.3 在headers參數(shù)中攜帶cookie2.3.1 github登錄抓包分析3.3.2 完成代碼2.4 cookie參數(shù)的使用2.5 cookiejar對象轉(zhuǎn)換為cookies字典的方法2.6 超時timeout的使用2.7 代理proxies的使用2.7.1 理解使用代理的過程2.7.2正向代理和反向代理2.7.3 代理IP(代理服務器)的分類2.7.4 proxies代理參數(shù)的使用2.8 使用verify參數(shù)忽略CA證書三渐尿、 requests模塊發(fā)送post請求3.1 requests發(fā)送post請求的方法四、利用requests.session進行狀態(tài)保持4.1 requests.session的作用及應用場景4.2 requests.session的使用方法4.3 實例：模擬登錄github精彩鏈接最后

前言

爬蟲的門檻不高，高就在于往后余生的每一次實操都會讓你崩潰瑰谜。在這個大數(shù)據(jù)的時代欺冀，數(shù)據(jù)就是金錢！所以越來越多的企業(yè)重視數(shù)據(jù)萨脑，然后再通過爬蟲的手段獲取公開的數(shù)數(shù)據(jù)隐轩，為企業(yè)項目進行賦能。

上一篇文章中渤早，我?guī)Т蠹胰腴T了爬蟲职车，知道什么是爬蟲，對爬蟲有了大體的了解鹊杖。

本篇博文將帶領大家進入新的內(nèi)容悴灵，爬蟲最常用的庫：requests庫，最后并以綜合案例模擬登錄github骂蓖，帶你實戰(zhàn)积瞒。

一、requests模塊使用

本次文章主要分享的是requests這個http模塊的使用登下，該模塊主要用于發(fā)起請求獲取響應茫孔，該模塊有很多替代模塊，比如說urllib模塊被芳，但是在工作中使用最多的是requests模塊缰贝，requests的代碼語法簡單易懂，相對于臃腫的urllib模塊畔濒，使用requests模塊寫爬蟲會大大減少代碼量剩晴，而且實現(xiàn)某一功能會更簡單，因此推薦大家使用requests模塊侵状。

知識點

掌握headers參數(shù)的使用
掌握發(fā)送帶參數(shù)的使用
掌握headers中攜帶cookies
掌握cookies參數(shù)的使用
掌握cookieJar的掌握方法
掌握超時參數(shù)timeout的使用
掌握ip參數(shù)proxies的使用
掌握verify參數(shù)赞弥，忽略CA證書
掌握requests模塊

1.1 requests模塊發(fā)送get請求

1、需求：通過requests向百度發(fā)送請求壹将，獲取頁面的源碼

2嗤攻、運行下面代碼觀察打印結(jié)果

demo1.py

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n35" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import requests
?

目標url

url = 'http://www.baidu.com'

向url發(fā)送get請求

response = requests.get(url)

打印響應內(nèi)容

print(response.text)</pre>

1.2 response響應對象

觀察上面代碼運行的結(jié)果觀察發(fā)現(xiàn)毛嫉，有好多亂碼诽俯，這是因為編碼與解碼所使用的字符集不同造成的；我們嘗試使用下邊的辦法來解決中文亂碼問題承粤。

demo2.py

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n40" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import requests
?

目標url

url = 'http://www.baidu.com'

向url發(fā)送get請求

response = requests.get(url)

打印響應內(nèi)容

print(response.content.decode()) # 注意這里</pre>

1暴区、response.text是requests模塊按照charset模塊推測出的編碼字符串進行解碼的結(jié)果。

2辛臊、網(wǎng)絡傳輸?shù)淖址际莃ytes類型的數(shù)據(jù)仙粱，所以requests.text = response.content.decode('推測出來的編碼字符集')

3、我們可以在網(wǎng)頁源碼中搜索charset彻舰，嘗試參考該編碼的字符集伐割，注意：存在不準確的情況候味。

1.3 response.text與response.content的區(qū)別

response.text

類型：str

解碼類型：requests模塊自動根據(jù)http頭部對響應的編碼做出有根據(jù)的推測，推測文本編碼隔心。

我們可以手動設定編碼格式

demo3.py

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n52" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import requests
?

目標url

url = 'http://www.baidu.com'

向url發(fā)送get請求

response = requests.get(url)
response.encoding='utf-8'

打印響應內(nèi)容

print(response.text)</pre>

response.content

類型：bytes

解碼類型：沒有設定白群。可以自行進行設定硬霍。

知識點：掌握利用decode函數(shù)對requests.content解決中文亂碼

1.4 通過對response.content進行decode帜慢，來解決中文亂碼

response.content.decode() 默認utf-8
response.content.decode('GBK')
常見的字符集編碼

utf-8

gbk

gb2312

ascill(讀音：阿斯克碼)

iso-8859-1

知識點：掌握利用decode函數(shù)對requests.content解決中文亂碼

1.5 response響應對象的其他常用的屬性和方法

response = requests.get(url)中response是發(fā)送請求獲取的響應對象；response響應對象中除了text唯卖，content獲取響應內(nèi)容以外還有其他常用的屬性或方法粱玲。

response.url 響應的URL，有時候響應的URL和請求的URL并不樣拜轨。
response.status_code 響應狀態(tài)碼
response.headers 響應頭
response.request.headers 響應頭對應的請求頭
response.request._cookies 響應對應請求的cookies抽减，返回cookieJar類型
response.cookies 響應的cookie（經(jīng)過了set-cookie動作）返回cookieJar類型
response.json() 自動將json字符串類型的響應內(nèi)容轉(zhuǎn)換為Python對象(dict or list)

demo4.py

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n97" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import requests
?

目標url

url = 'http://www.baidu.com'

向url發(fā)送get請求

response = requests.get(url)
response.encoding='utf-8'
print(response.url)
print(response.status_code)
print(response.request.headers)
print(response.headers)
print(response.request._cookies)
print(response.cookies)</pre>

知識點：掌握response響應對象的其他常用屬性

二、requests模塊發(fā)送請求

2.1 發(fā)送帶headers的請求

我們先寫一個獲取百度首頁的代碼

demo5.py

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n106" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import requests
?

目標url

url = 'http://www.baidu.com'

向url發(fā)送get請求

response = requests.get(url)

打印響應內(nèi)容

print(response.content.decode())
?

打印對應請求頭信息

print(response.request.headers)</pre>

2.1.1思考

1撩轰、對比瀏覽器上百度網(wǎng)頁的源碼和代碼中百度首頁的源碼胯甩，看看有什么不同？

查看網(wǎng)頁源代碼的方法：

右鍵-查看網(wǎng)頁源代碼
右鍵-檢查

2堪嫂、對比url響應內(nèi)容和代碼中的百度首頁的源碼偎箫，有什么不同？

查看對應url響應內(nèi)容的方法：
右鍵-檢查
點擊network
勾選Preserve log
刷新頁面
查看Name欄下和瀏覽器地址欄相同的URL的response

3皆串、代碼中的百度首頁的源碼非常少淹办，為什么？

需要帶上請求頭信息

回顧爬蟲的概念恶复，模擬瀏覽器怜森，欺騙服務器，獲取和瀏覽器一致的內(nèi)容

請求頭中有很多字段谤牡，其中User-Agent字段必不可少副硅，表示客戶端的操作系統(tǒng)以及瀏覽器的信息

2.1.2 攜帶請求頭發(fā)送請求的方法

<mark style="box-sizing: border-box; background: rgb(211, 212, 14); color: rgb(0, 0, 0);">requests.get(url, headers)</mark>

headers 參數(shù)接收字典形式的請求頭
請求頭字段名為key，字段對應的操作為value

demo6.py

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n142" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import requests
?

目標url

url = 'http://www.baidu.com'

構(gòu)造請求頭

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36"}

向url發(fā)送get請求

response = requests.get(url, headers=headers)

打印響應內(nèi)容

print(response.content.decode())
?

打印對應請求頭信息

print(response.request.headers)
?</pre>

2.2 發(fā)送帶參數(shù)的請求

我們在使用百度的時候經(jīng)常發(fā)現(xiàn)URL地址中會有一個<mark style="box-sizing: border-box; background: rgb(211, 212, 14); color: rgb(0, 0, 0);">?</mark>翅萤，那么該問號后面的就是請求參數(shù)恐疲，又叫做查詢字符串。

2.2.1 在url攜帶參數(shù)

直接對含有參數(shù)的url發(fā)送請求

demo7.py

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n149" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import requests
?

目標url

url = 'https://www.baidu.com/s?wd=Python'

構(gòu)造請求頭

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36"}

向url發(fā)送get請求

response = requests.get(url, headers=headers)
with open('baidu.html', 'wb') as f:
f.write(response.content)</pre>

2.2.2 通過params攜帶參數(shù)字典

構(gòu)建請求參數(shù)字典
向接口發(fā)送請求時帶上參數(shù)字典套么，設置字典參數(shù)params培己。

關鍵參數(shù)

demo8.py

<pre spellcheck="false" class="md-fences md-end-block ty-contain-cm modeLoaded" lang="python" cid="n158" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: normal; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import requests
?

目標url

url = 'https://www.baidu.com/s?'

請求參數(shù)是一個字典，即wd=Python

kw = {'wd': 'python'}

構(gòu)造請求頭

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36"}

向url發(fā)送get請求

response = requests.get(url, headers=headers, params=kw)
with open('baidu1.html', 'wb') as f:
f.write(response.content)</pre>

知識點：掌握發(fā)送帶參數(shù)的請求方法

2.3 在headers參數(shù)中攜帶cookie

網(wǎng)站經(jīng)常利用請求頭中的Cookie字段來做用戶狀態(tài)的保持胚泌，那么我們可以在headers參數(shù)中添加Cookie省咨，模擬普通用戶的請求，我們以github為例玷室。

2.3.1 github登錄抓包分析

打開瀏覽器零蓉，右鍵-檢查笤受，點擊network，勾選Preserve log
訪問github登錄的url地址：https://github.com/login
輸入賬號密碼敌蜂，點擊登錄后感论，訪問一個需要登錄后才能獲取正確內(nèi)容的URL。比如點擊右上角的Your profle訪問https://github.com/USER_NAME
確定URL后紊册，再確定發(fā)送該請求所需要的請求頭中的User-Agent和cookie

image

3.3.2 完成代碼

從瀏覽器中復制User-Agent和cookie
瀏覽器中的請求頭字段和值與headers參數(shù)中必須一致
headers請求參數(shù)字典中的cookie鍵對應的值是字符串

demo9.py

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n185" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import requests

headers = {
'Cookie': '你的cookie'
}

url = 'https://github.com/Zhimin7'

response = requests.get(url, headers=headers)
with open('github_withcookie.html', 'wb') as f:
f.write(response.content)</pre>

網(wǎng)頁對比

接下來寫一個不包含cookie的爬蟲比肄，看看對比后的結(jié)果

demo10.py

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n189" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import requests

headers = {

'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'

}

url = 'https://github.com/Zhimin7'

response = requests.get(url, headers=headers)
with open('github_without_cookie.html', 'wb') as f:
f.write(response.content)</pre>

image

不同之處就相當明顯了。

2.4 cookie參數(shù)的使用

上一個小節(jié)中我們在headers參數(shù)中攜帶cookie囊陡，也可以使用專門的cookie參數(shù)

1.cookie參數(shù)的形式：字典

cookies = {'cookie的name' : 'cookie的value'}

該字典對應請求頭中cookie的字符串
等號左邊對應cookie的key
等號右邊對應cookie的value

2.cookies參數(shù)的使用方法

response = requests.get(url, cookies)

3.將cookie字符串轉(zhuǎn)換為cookies參數(shù)所需要的字典

cookie_dict = {cookie.split('=')[0] : cookie.split('=')[-1] for cookie in temp.split(';')}

當然芳绩，如果你的字典生成式學的不夠熟悉的話，那你可以使用較為穩(wěn)妥方法

demo11.py

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n210" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">temp = 'octo=GH1.1.1102395001.1582362358; _ga=GA1.2.454155278.1582362359; _device_id=0442b4dd494cafc0301c2ad3e9eeca31; experiment:homepage_signup_flow=eyJ2ZXJzaW9uIjoiMSIsInJvbGxPdXRQbGFjZW1lbnQiOjI1LjY3MjIzNTIyOTQ0MTk1Miwic3ViZ3JvdXAiOiJjb250cm9sIiwiY3JlYXRlZEF0IjoiMjAyMC0wMy0yNlQxNDozNToxNC45ODdaIiwidXBkYXRlZEF0IjoiMjAyMC0wMy0yNlQxNDozNToxNC45ODdaIn0=; user_session=vsC4WPrJRjDLSTC3Up0h0D5i0Knfyah9hGXzhfrchfW_5eyc; __Host-user_session_same_site=vsC4WPrJRjDLSTC3Up0h0D5i0Knfyah9hGXzhfrchfW_5eyc; logged_in=yes; dotcom_user=Zhimin7; has_recent_activity=1; tz=Asia%2FShanghai; _gh_sess=e9HSDZpXyMNlwvsRH7kjV39DisarWcGKdXqnr65Z3VfFlChN0onUNHwROBPqX2yfS9WudAE71IQF2h7TRiVQ3rvVp1KbvbmfOOkULatFZsHoVRi5UUCI%2FY8wz0QVBLXF3VY0WgLwoUoZhaJ5MhPG%2F22am%2Bowt2XigTISZm289i%2BCYxkDvWz8N7J61WTPz9i3--3YPo3PUW%2B3asHJSS--AmjAHcbcaKfU%2BneNyzA13w%3D%3D'
cookie_list = temp.split(';')
cookies = {}

for cookie in cookie_list:
cookies[cookie.split('=')[0]] = cookie.split('=')[-1]
print(cookies)</pre>

2.5 cookiejar對象轉(zhuǎn)換為cookies字典的方法

使用request獲取的Response對象撞反，具有cookie屬性妥色。該屬性值是一個cookieJar類型，包含了對方服務器設置在本地的cookie遏片。我們?nèi)绾螌⑵滢D(zhuǎn)換為cookie字典呢嘹害？

1.轉(zhuǎn)換方法

cookie_dict = requests.utils.dict_from_cookieJar(response.cookies)

2.其中response.cookies返回的就是cookieJar類型的對象。

3.requests.utils.dict_from_cookieJar函數(shù)返回cookie字典吮便。

demo12.py

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n219" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">from requests import utils
import requests

url = 'http://www.baidu.com'
response = requests.get(url)
print(type(response.cookies))
print(response.cookies)

將cookieJar轉(zhuǎn)換為dict

dict_cookies = requests.utils.dict_from_cookiejar(response.cookies)
print(dict_cookies)

將dict轉(zhuǎn)換為cookieJar

jar_cookies = requests.utils.cookiejar_from_dict(dict_cookies)
print(jar_cookies)</pre>

不過這種方法會造成域名缺失笔呀，不是很常用。在接下來的章節(jié)中會具體說明如何使用cookie保存會話髓需。

2.6 超時timeout的使用

在平時上網(wǎng)的過程中许师，我們經(jīng)常會遇到網(wǎng)絡波動，這個時候僚匆，一個請求等待了很久的時間仍然沒有結(jié)果微渠。

在爬蟲中，一個請求很久沒有結(jié)果咧擂，就會讓整個項目的效率變得非常低逞盆，這個時候我們就需要對請求進行強制要求，讓他必須在特定的時間內(nèi)返回結(jié)果松申，否則就會報錯云芦。

1.超時參數(shù)timeout的使用方法

reponse = requests.get(url, timeout=3)

*timeout=3，表示3秒內(nèi)程序訪問服務器仍然沒有響應攻臀，程序就會終止運行并報錯

2.7 代理proxies的使用

2.7.1 理解使用代理的過程

1.代理IP是一個IP焕数，指向的是一個代理服務器

2.代理服務器能夠幫我們向目標服務器發(fā)起請求

代理服務器的意思是在瀏覽器與服務器之間搭建一個橋梁纱昧，相當于用Python向代理服務器發(fā)起請求刨啸，在通過代理服務器向服務器發(fā)起請求。服務器返回響應也是如此识脆，服務器將響應返回給代理服務器设联，代理服務器再將響應返回給瀏覽器善已。

2.7.2正向代理和反向代理

前面提到proxy參數(shù)指定的代理IP指向的是正向代理服務器，那么相應的就有反向代理服務器离例；現(xiàn)在來了解一下正向代理服務器和反向代理服務器的區(qū)別

從發(fā)送請求一方的角度换团，來區(qū)分正向和反向代理
為瀏覽器或客戶端（發(fā)送請求的一方）轉(zhuǎn)發(fā)請求的，叫做正向代理宫蛆，如VPN
不為瀏覽器或客戶端(發(fā)送請求的一方)轉(zhuǎn)發(fā)請求艘包，而是為最終處理請求的服務器轉(zhuǎn)發(fā)請求的，叫做反向代理耀盗，瀏覽器不知道服務器的真實IP地址想虎，如NGINX

2.7.3 代理IP(代理服務器)的分類

透明代理：透明代理雖然可以直接“隱藏”你的IP地址，但是還是可以直接看到你是誰叛拷。
匿名代理：使用匿名代理舌厨，別人只能知道你用了代理，無法知道你是誰忿薇。
高匿代理：高匿代理讓別人不知道你使用了代理裙椭，所以最好的選擇，毫無疑問使用高匿代理效果最好署浩。

根據(jù)網(wǎng)站所使用的協(xié)議不同揉燃，需要使用相應協(xié)議的代理服務。從代理服務器請求的協(xié)議可以分為：

http代理：目標url為http協(xié)議
https代理：目標urlhttps協(xié)議

2.7.4 proxies代理參數(shù)的使用

為了讓服務器以為是不同客戶端發(fā)送的請求筋栋，防止頻繁向同一個域名發(fā)送請求被封IP你雌，所以我們要使用代理IP。

response = requests.get(url, proxies=proxies)

proxies的形式：字典

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n262" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">proxies = {
'http':'http://12.32.56.78:8000',
'https':'https://12.32.56.78:8000'
}</pre>

注意：如果proxies字典中含有多個鍵值對二汛，發(fā)送請求的時候?qū)凑誹rl地址的協(xié)議來選擇使用相應的代理IP婿崭。

2.8 使用verify參數(shù)忽略CA證書

在使用瀏覽器上網(wǎng)的時候，有時會看到肴颊，【您的鏈接不是私密連接】

原因：該網(wǎng)站的CA證書沒有經(jīng)過【受信任的證書頒發(fā)機構(gòu)】的認證

所以作為爬蟲氓栈，我們需要避免這種情況的發(fā)生，必須無視這個信息婿着。

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n271" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import requests

url = ''    # 填寫沒有認證的URL

response = requests.get(url, verify=False)</pre>

三授瘦、 requests模塊發(fā)送post請求

思考哪些地方會用到POST請求

注冊登錄

需要傳輸文本內(nèi)容

所以同樣我們爬蟲也需要在這兩個方面模擬瀏覽器發(fā)送post請求

3.1 requests發(fā)送post請求的方法

response = requests.post(url, data=data)
data參數(shù)接收一個字典
response模塊發(fā)送post請求函數(shù)和發(fā)送get請求的方法是一樣的

四、利用requests.session進行狀態(tài)保持

requests模塊中的session類能夠自動的處理發(fā)送請求獲取響應的過程中產(chǎn)生cookie竟宋，進而達到狀態(tài)保持的目的

4.1 requests.session的作用及應用場景

requests.session的作用

自動處理cookie提完，即下一次請求會自動帶上前一次的cookie

requests.session的應用場景

自動處理連續(xù)請求多次請求過程產(chǎn)生的cookie

4.2 requests.session的使用方法

session示例在請求一個網(wǎng)站后，對方服務器設置在本地的cookie會保存在session中丘侠，下一次再用session請求網(wǎng)站的時候徒欣，會帶上前一次的cookie

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n304" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">session = requests.session() #實例化session對象
response = session.get(url, headers, ...)
response = session.post(url, data, ...)</pre>

session發(fā)送get請求和post請求的參數(shù)，與requests模塊發(fā)送請求的參數(shù)完全一致

4.3 實例：模擬登錄github

github_sesseion.py

<pre spellcheck="false" class="md-fences mock-cm md-end-block" lang="python" cid="n310" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: Monaco, Consolas, "Andale Mono", "DejaVu Sans Mono", monospace; margin-top: 0px; margin-bottom: 20px; font-size: 0.9rem; display: block; break-inside: avoid; text-align: left; white-space: pre-wrap; background: rgb(51, 51, 51); position: relative !important; padding: 10px 10px 10px 30px; width: inherit; color: rgb(184, 191, 198); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import requests
from lxml import etree

class GitHub(object):
def init(self):
self.session = requests.session()
self.session.headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.66 Safari/537.36'
}
self.login_url = 'https://github.com/login'

def login(self):
    response = self.session.get(self.login_url)
    html = etree.HTML(response.content.decode())
    return html

def get_token(self):
    authenticity_token = self.login().xpath('//form/input[1]/@value')[0]
    return authenticity_token

def get_timestamp_secret(self):
    timestamp_secret = self.login().xpath('//div[@class="auth-form-body mt-3"]/input[11]/@value')[0]
    return timestamp_secret
def get_timestamp(self):
    timestamp = self.login().xpath('//div[@class="auth-form-body mt-3"]/input[10]/@value')[0]
    return timestamp

def get_profile(self):
    url_session = 'https://github.com/session'
    url_profile = 'https://github.com/Zhimin7'
    data = {
        'commit': 'Sign in',
        'authenticity_token': self.get_token(),
        'ga_id':'',
        'login': '你的郵箱',
        'password': '你的密碼',
        'webauthn - support': 'supported',
        'webauthn - iuvpaa - support': 'supported',
        'return_to':'',
        'allow_signup':'',
        'client_id':''
        'integration:',
        'required_field_86b0':'',
        'timestamp': self.get_timestamp(),
        'timestamp_secret': self.get_timestamp_secret()
    }
    self.session.post(url_session, data=data)
    html = self.session.get(url_profile).content
    with open('github.html', 'wb') as f:
        f.write(html)
    print('獲取完畢')

if name == "main":
github = GitHub()
github.get_token()
github.get_timestamp()
github.get_timestamp_secret()
github.get_profile()</pre>

精彩鏈接

Python爬蟲：什么是Python爬蟲蜗字？怎么樣玩爬蟲打肝？

最后

如果你讀到了這里脂新，那么說明我的這篇文章內(nèi)容還是不錯的，也希望你能給我一鍵三連（點贊粗梭、關注争便、留言）。畢竟碼了這么多字我也是花費了不少的心力的断医，你的鼓勵就是我創(chuàng)作的最大動力滞乙。

路漫漫其修遠兮，吾將上下而求索

我是啃書君鉴嗤，一個專注于學習的的人酷宵，關注我，更多精彩內(nèi)容我們下期再見躬窜！

respect

?著作權歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末浇垦，一起剝皮案震驚了整個濱河市，隨后出現(xiàn)的幾起案子荣挨，更是在濱河造成了極大的恐慌男韧，老刑警劉巖，帶你破解...
沈念sama閱讀 206,013評論 6贊 481
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件默垄，死亡現(xiàn)場離奇詭異此虑，居然都是意外死亡，警方通過查閱死者的電腦和手機口锭，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 88,205評論 2贊 382
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門朦前，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人鹃操，你說我怎么就攤上這事韭寸。” “怎么了荆隘？”我有些...
開封第一講書人閱讀 152,370評論 0贊 342
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵恩伺，是天一觀的道長。經(jīng)常有香客問我椰拒，道長晶渠，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 55,168評論 1贊 278
?港島之戀（遺憾婚禮）
正文為了忘掉前任燃观，我火速辦了婚禮褒脯，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘缆毁。我一直安慰自己番川，他們只是感情好，可當我...
茶點故事閱讀 64,153評論 5贊 371
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布。她就那樣靜靜地躺著爽彤，像睡著了一般。火紅的嫁衣襯著肌膚如雪缚陷。梳的紋絲不亂的頭發(fā)上适篙，一...
開封第一講書人閱讀 48,954評論 1贊 283
城市分裂傳說
那天，我揣著相機與錄音箫爷，去河邊找鬼嚷节。笑死，一個胖子當著我的面吹牛虎锚，可吹牛的內(nèi)容都是我干的硫痰。我是一名探鬼主播，決...
沈念sama閱讀 38,271評論 3贊 399
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼窜护，長吁一口氣：“原來是場噩夢啊……” “哼效斑！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起柱徙，我...
開封第一講書人閱讀 36,916評論 0贊 259
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤缓屠，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后护侮，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體敌完，經(jīng)...
沈念sama閱讀 43,382評論 1贊 300
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 35,877評論 2贊 323
?白月光啟示錄
正文我和宋清朗相戀三年羊初，在試婚紗的時候發(fā)現(xiàn)自己被綠了滨溉。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點故事閱讀 37,989評論 1贊 333
活死人
序言：一個原本活蹦亂跳的男人離奇死亡长赞，死狀恐怖晦攒，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情得哆，我是刑警寧澤勤家，帶...
沈念sama閱讀 33,624評論 4贊 322
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站柳恐，受9級特大地震影響伐脖，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜乐设，卻給世界環(huán)境...
茶點故事閱讀 39,209評論 3贊 307
男人毒藥：我在死后第九天來索命
文/蒙蒙一讼庇、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧近尚，春花似錦蠕啄、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,199評論 0贊 19
一樁弒父案歼跟，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽和媳。三九已至，卻和暖如春哈街，著一層夾襖步出監(jiān)牢的瞬間留瞳，已是汗流浹背。一陣腳步聲響...
開封第一講書人閱讀 31,418評論 1贊 260
情欲美人皮
我被黑心中介騙來泰國打工骚秦，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留她倘，地道東北人。一個月前我還...
沈念sama閱讀 45,401評論 2贊 352
代替公主和親
正文我出身青樓作箍，卻偏偏與公主長得像硬梁，于是被迫代替她去往敵國和親。傳聞我的和親對象是個殘疾皇子胞得，可洞房花燭夜當晚...
茶點故事閱讀 42,700評論 2贊 345

python爬蟲：Python-requests模塊學習筆記總結(jié)

前言

一、requests模塊使用

1.1 requests模塊發(fā)送get請求

目標url

向url發(fā)送get請求

打印響應內(nèi)容

1.2 response響應對象

目標url

向url發(fā)送get請求

打印響應內(nèi)容

1.3 response.text與response.content的區(qū)別

目標url

向url發(fā)送get請求

打印響應內(nèi)容

1.4 通過對response.content進行decode帜慢，來解決中文亂碼

1.5 response響應對象的其他常用的屬性和方法

目標url

向url發(fā)送get請求

二、requests模塊發(fā)送請求

2.1 發(fā)送帶headers的請求

目標url

向url發(fā)送get請求

打印響應內(nèi)容

打印對應請求頭信息

2.1.1思考

2.1.2 攜帶請求頭發(fā)送請求的方法

目標url

構(gòu)造請求頭

向url發(fā)送get請求

打印響應內(nèi)容

打印對應請求頭信息

2.2 發(fā)送帶參數(shù)的請求

2.2.1 在url攜帶參數(shù)

目標url

構(gòu)造請求頭

向url發(fā)送get請求

2.2.2 通過params攜帶參數(shù)字典

目標url

請求參數(shù)是一個字典，即wd=Python

構(gòu)造請求頭

向url發(fā)送get請求

2.3 在headers參數(shù)中攜帶cookie

2.3.1 github登錄抓包分析

3.3.2 完成代碼

2.4 cookie參數(shù)的使用

2.5 cookiejar對象轉(zhuǎn)換為cookies字典的方法

將cookieJar轉(zhuǎn)換為dict

將dict轉(zhuǎn)換為cookieJar

2.6 超時timeout的使用

2.7 代理proxies的使用

2.7.1 理解使用代理的過程

2.7.2正向代理和反向代理

2.7.3 代理IP(代理服務器)的分類

2.7.4 proxies代理參數(shù)的使用

2.8 使用verify參數(shù)忽略CA證書

三授瘦、 requests模塊發(fā)送post請求

3.1 requests發(fā)送post請求的方法

四、利用requests.session進行狀態(tài)保持

4.1 requests.session的作用及應用場景

4.2 requests.session的使用方法

4.3 實例：模擬登錄github

精彩鏈接

最后

推薦閱讀更多精彩內(nèi)容