昨天準(zhǔn)備爬拉鉤的python職位數(shù)據(jù)舌剂,用了老辦法bs4+requests發(fā)現(xiàn)數(shù)據(jù)是空的,心情so downH蘧弧!經(jīng)過網(wǎng)上的查詢才明白闪金,拉鉤使用Ajax技術(shù)疯溺,用bs4查找html元素是找不到數(shù)據(jù)的论颅。今天我總結(jié)下學(xué)習(xí)過程哎垦,也算是鞏固自己的知識了!J逊琛漏设!
分析網(wǎng)頁
登陸拉鉤網(wǎng)站,打開開發(fā)者功能
[圖片上傳失敗...(image-b6ac08-1512815188537)]
我們先用requests發(fā)送請求并保存一個html今妄,來查看數(shù)據(jù)
import requests
import random
user_agents = [
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.2995.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2986.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.0 Safari/537.36',
'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.0 Safari/537.36'
]
headers = {
'Host': 'www.lagou.com',
'Referer': 'https://www.lagou.com/zhaopin/Python/?labelWords=label',
'Upgrade-Insecure-Requests': '1',
'User-Agent': random.choice(user_agents)
}
url = 'https://www.lagou.com/jobs/list_Python?px=default&city=%E4%B8%8A%E6%B5%B7#filterBox'
r = requests.get(url, headers=headers)
result = r.text
#print(r.text)
# 寫入logou.html
with open('laogou.html', 'w', encoding='utf-8') as f:
f.write(result)
運行代碼試一下郑口,代開lagou.html鸳碧,我們看到職位信息數(shù)據(jù)是沒有的
[圖片上傳失敗...(image-4a7f52-1512957356176)]
接下來,我們再觀察下Chrome開發(fā)者工具的NetWork一欄犬性,類型選擇XHR瞻离,找到下面這個鏈接,我們可以看到有Ajax、Json幾個關(guān)鍵字乒裆,點擊Preview
[圖片上傳失敗...(image-d6f890-1512957356176)]
按順序分別點開紅框套利,就得到我們想要的數(shù)據(jù)啦
[圖片上傳失敗...(image-f6b3c3-1512957356176)]
現(xiàn)在來試著寫一下,注意這里的請求是post鹤耍,帶上表單肉迫,改變請求頭的數(shù)據(jù)
data = {
'first': 'true',
'pn': 1,
'kd': 'Python'
}
r = requests.post(url, headers=headers, data=data).json()
positions = r['content']['positionResult']['result']
print(positions)
Run一下,返回的數(shù)據(jù)就是我們想要的啦8寤啤:吧馈!
[圖片上傳失敗...(image-3446d5-1512957356176)]
翻頁
我們觀察下表單內(nèi)有一個pn參數(shù)杆怕,這就是頁碼族购,大家可以跳轉(zhuǎn)頁面來觀察下數(shù)據(jù)的變化
for i in range(1, 17):
data = {
'first': 'true',
'pn': i,
'kd': 'Python'
}
url = 'https://www.lagou.com/jobs/positionAjax.json?px=default&city=%E4%B8%8A%E6%B5%B7&needAddtionalResult=false&isSchoolJob=0'
r = requests.post(url, headers=headers, data=data)
time.sleep(3)
print(json.url)
這樣就把16頁鏈接都打印了出來
[圖片上傳失敗...(image-584b63-1512957356176)]
爬取拉鉤的思路就是這樣,完整代碼在GitHub陵珍,歡迎大家訪問A摹!3沤獭3铡!N敖恪收苏!假如覺得有用點個star噢!愤兵!互勉B拱浴!8讶椤E呈蟆!R傺摺8匾薄!3都睦袖!
最后,附上一張爬下來的數(shù)據(jù)截圖
[圖片上傳失敗...(image-50bd3b-1512957356176)]