https://zhuanlan.zhihu.com/p/379836932
1、獲取數(shù)據(jù)
urllib2:python自帶標(biāo)準(zhǔn)庫(kù)
requests:需安裝婶博,更友好
selenium:相較于requests模擬http協(xié)議來(lái)獲取數(shù)據(jù)耙册,selenium是通過(guò)調(diào)用模擬器來(lái)獲取數(shù)據(jù),速度會(huì)更慢烙无。
1.1 requests常用模塊
https://docs.python-requests.org/zh_CN/latest/user/quickstart.html
https://blog.csdn.net/qq_41556318/article/details/86527763
- request.get
get和post都是獲取數(shù)據(jù)的方式锋谐,只不過(guò)采用的http不同協(xié)議方式〗乜幔可以統(tǒng)一采用get方式獲取數(shù)據(jù)涮拗。需要設(shè)置的header等信息通過(guò)字典設(shè)置好后發(fā)送,也可以不設(shè)置則自動(dòng)傳空值或默認(rèn)值迂苛。
>>> payload = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.get("http://httpbin.org/get", params=payload)
headers參數(shù)可以不設(shè)置采用默認(rèn)值三热。
timeout來(lái)限制傳送時(shí)長(zhǎng)
>>> requests.get('http://github.com', timeout=0.001)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
requests.exceptions.Timeout: HTTPConnectionPool(host='github.com', port=80): Request timed out. (timeout=0.001)
- request.text
https://blog.csdn.net/qq_38900441/article/details/79946377
request.content是保留的字節(jié),.text是自動(dòng)編譯后的字符串三幻,但由于自動(dòng)編譯方式不對(duì)就漾,所以遇到漢字之類的情況,需要轉(zhuǎn)換編碼方式才能正確顯示念搬。
import requests
from bs4 import BeautifulSoup
response = requests.get('https://www.baidu.com')
response.encoding = 'utf-8'
re_text = response.text
print (re_text)
- json轉(zhuǎn)換
內(nèi)置的json解碼器來(lái)處理json數(shù)據(jù)
>>> import requests
>>> r = requests.get('https://api.github.com/events')
>>> r.json()
[{u'repository': {u'open_issues': 0, u'url': 'https://github.com/...
- headers
顯示傳送的header數(shù)據(jù)
>>> r.headers
{
'content-encoding': 'gzip',
'transfer-encoding': 'chunked',
'connection': 'close',
'server': 'nginx/1.0.4',
'x-runtime': '148ms',
'etag': '"e1ca502697e5c9317743dc078f67693f"',
'content-type': 'application/json'
}
- status_code
https://www.stubbornhuang.com/555/
正常狀態(tài)碼為200
>>> r = requests.get('http://httpbin.org/get')
>>> r.status_code
200
- session
如果是賬號(hào)登陸基本都需要用到保持抑堡,因?yàn)榈顷懞蟛艜?huì)自動(dòng)跳轉(zhuǎn)
session = requests.session()
response = session.post(url1,data={"userName":name,"pwd":password})
1.2 selenium模塊
https://python-selenium-zh.readthedocs.io/zh_CN/latest/
- 設(shè)定瀏覽器&登陸網(wǎng)址
from selenium import webdriver
from selenium.webdriver.common.keys import Keys ##Keys提供鍵盤各類輸入
driver = webdriver.Chrome()
driver.get(url = url)
- 輸入信息
name = driver.find_element(by='name',value="userName")
pwd = driver.find_element(by='name',value="pwd")
name.send_keys('yvegmn')
pwd.send_keys('yvegmnaa')
- 點(diǎn)擊
理論上需要設(shè)定一個(gè)elements,再進(jìn)行操作朗徊,點(diǎn)擊一般可以用回車代替(Keys.Enter)首妖,也可以用click函數(shù)
driver.find_element_by_class_name('layui-btn.layui-btn-fluid').click() #
find_elements_by_class_name這種寫法的話,元素名稱不能有空格爷恳,把空格換成.就可以了有缆。。但是有時(shí)候click會(huì)失效温亲,用enter會(huì)比較方便
- 獲取頁(yè)面詳情
driver.page_source
2棚壁、解析數(shù)據(jù)
2.1 re正則匹配
可以查看雜項(xiàng)那一章