urllib用法
from urllib import request
from urllib import parse
import json
url = 'http://top.hengyan.com/dianji/default.aspx?p=1'
# 構(gòu)建請(qǐng)求頭
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3722.400 QQBrowser/10.5.3738.400'
}
"""
url :目標(biāo)url
data=None :默認(rèn)為None表示是get請(qǐng)求,如果不為None說明是get請(qǐng)求
timeout:設(shè)置請(qǐng)求的過期時(shí)間
cafile=None, capath=None, cadefault=False:證書相關(guān)參數(shù)
context=None :忽略證書認(rèn)證
"""
# url不能添加請(qǐng)求頭
response = request.urlopen(url=url, timeout=10)
# 添加請(qǐng)求頭
req = request.Request(url=url, headers=headers)
response = request.urlopen(req, timeout=10)
code = response.status
url = response.url
b_content = response.read()
html = b_content.decode('utf-8')
# 本地保存
with open('hengyan.html', 'w') as file:
file.write(html)
#####################post請(qǐng)求#############################
# 世紀(jì)佳緣網(wǎng)
def get_ssjy_data(page=1):
url = 'http://search.jiayuan.com/v2/search_v2.php'
# 構(gòu)建請(qǐng)求頭
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3722.400 QQBrowser/10.5.3738.400'
}
form_data = {
'sex': 'f',
'key': '',
'stc': '1:11,2:20.28,23:1',
'sn': 'default',
'sv': '1',
'p': str(page),
'f': 'search',
'listStyle': 'bigPhoto',
'pri_uid': '0',
'jsversion': 'v5',
}
form_data = parse.urlencode(form_data).encode('utf-8')
# 構(gòu)建請(qǐng)求對(duì)象
req = request.Request(url=url, data=form_data, headers=headers)
response = request.urlopen(req, timeout=10)
if response.status == 200:
content = response.read().decode('utf-8').replace('##jiayser##//', '').replace('##jiayser##', '')
data = json.loads(content)
userinfos = data['userInfo']
for user in userinfos:
age = user['age']
name = user['nickname']
gender = user['sex']
# 獲取下一頁
total_page = int(data['pageTotal'])
print(str(page) + '頁爬取完畢')
if page < total_page:
# 需要繼續(xù)提取下一頁
next_page = page + 1
# 遞歸的方式规丽,繼續(xù)提取下一頁數(shù)據(jù)
get_ssjy_data(page=next_page)
else:
# 數(shù)據(jù)提取完畢
print('數(shù)據(jù)爬取完畢')
if __name__ == '__main__':
get_ssjy_data()
requests用法
一蒲牧、什么是requests?
requests是基于urllib的再一次封裝,具有urllib的一切特性赌莺,并且API調(diào)用更加方便冰抢,一個(gè)基于網(wǎng)絡(luò)請(qǐng)求的模塊,模擬瀏覽器發(fā)起請(qǐng)求
二艘狭、為什么使用requests模塊挎扰?
1.自動(dòng)處理url編碼
2.自動(dòng)處理post請(qǐng)求參數(shù)
3.簡(jiǎn)化cookie和代理的操作
cookie的操作:
a.創(chuàng)建一個(gè)cookiejar對(duì)象
b.創(chuàng)建一個(gè)handler對(duì)象
c.創(chuàng)建一個(gè)opener對(duì)象
代理的操作:
a.創(chuàng)建handler對(duì)象,代理ip和端口封裝到該對(duì)象
b.創(chuàng)建opener對(duì)象
三巢音、如何安裝遵倦?
安裝:pip3 install requests
使用流程:
1.指定url
2.使用requests模塊發(fā)起請(qǐng)求
3.獲取響應(yīng)的二進(jìn)制數(shù)據(jù)
4.進(jìn)行持久化存儲(chǔ)
requests包括五中請(qǐng)求:get,post,ajax的get請(qǐng)求,ajax的post請(qǐng)求官撼,綜合
requests的get請(qǐng)求
url = 'http://top.hengyan.com/dianji/default.aspx?'
# 將get請(qǐng)求的參數(shù)放在字典中
params = {
'p': 1
}
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36',
}
response = requests.get(url=url, headers=headers, params=params)
# 獲取html頁面源碼
html = response.text
# 獲取頁面的二進(jìn)制數(shù)據(jù)
b_content = response.content
# 獲取響應(yīng)的狀態(tài)碼
code = response.status_code
# 獲取請(qǐng)求的響應(yīng)頭
response_headers = response.headers
# 獲取請(qǐng)求的url地址
url = response.url
# 獲取cookies信息(使用requests模擬登陸網(wǎng)站后獲取cookies)
cookies = response.cookies
# 將RequestsCookieJar轉(zhuǎn)換成字典
cookies_dict = requests.utils.dict_from_cookiejar(cookies)
# 將字典轉(zhuǎn)換成RequestsCookieJar
cookiesjar_obj = requests.utils.cookiejar_from_dict(cookies_dict)
requests的post用法
url = 'http://search.jiayuan.com/v2/search_v2.php'
# 構(gòu)建請(qǐng)求頭
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3722.400 QQBrowser/10.5.3738.400'
}
form_data = {
'sex': 'f',
'key': '',
'stc': '1:11,2:20.28,23:1',
'sn': 'default',
'sv': '1',
'p': '1',
'f': 'search',
'listStyle': 'bigPhoto',
'pri_uid': '0',
'jsversion': 'v5',
}
response = requests.post(url=url, data=form_data, headers=headers)
if response.status_code == 200:
pattern = re.compile('##jiayser##(.*?)##jiayser//', re.S)
json_str = re.findall(pattern=pattern, string=response.text)[0]
json_data = json.loads(json_str)
自定義請(qǐng)求頭信息:
from fake_useragent import UserAgent
定制請(qǐng)求頭
headers ={
"User-Agent":UserAgent().random
}
封裝get請(qǐng)求參數(shù):
params = {
"變量名":"參數(shù)"
}
requests庫中的session作用:
維持統(tǒng)一會(huì)話梧躺,在跨請(qǐng)求訪問的時(shí)候能夠保存一些信息(比如cookies)
cookie:
1.基于用戶的用戶數(shù)據(jù)
2.需求:爬取xx用戶的豆瓣網(wǎng)的個(gè)人主頁數(shù)據(jù)
cookie作用:服務(wù)端使用cookie來記錄客戶端的狀態(tài)信息
實(shí)現(xiàn)流程:
1.執(zhí)行登錄操作(獲取cookie)
2.在發(fā)起個(gè)人主頁請(qǐng)求時(shí),需要將cookie攜帶到該請(qǐng)求中
注意:session對(duì)象:發(fā)送請(qǐng)求(會(huì)將cookie對(duì)象進(jìn)行自動(dòng)存儲(chǔ))