為了獲取拉勾網(wǎng)的招聘信息蜕便,對數(shù)據(jù)分析崗位的基本信息進行爬取签赃。之所以選擇拉勾網(wǎng)作為本項目的數(shù)據(jù)源,主要是因為相對于其他招聘網(wǎng)站界拦,拉勾網(wǎng)上的崗位信息非常完整吸申、整潔,極少存在信息的缺漏,并且?guī)缀跛姓宫F(xiàn)出來的信息都是非常規(guī)范化的呛谜,極大的減少了前期數(shù)據(jù)清理和數(shù)據(jù)整理的工作量在跳。
模仿瀏覽器行為
網(wǎng)絡(luò)爬蟲是一種按照一定的規(guī)則,自動地抓取萬維網(wǎng)信息的程序或者腳本隐岛。爬取拉鉤數(shù)據(jù)分析 職位猫妙,首先模擬用戶瀏覽行為,查找職位信息聚凹,可以先選職位割坠,再選城市,根據(jù)城市獲取相應(yīng)數(shù)據(jù)分析崗位妒牙。
現(xiàn)在很多網(wǎng)站都用了一種叫做Ajax(異步加載)的技術(shù)彼哼,就是說,網(wǎng)頁打開了湘今,先給你看上面一部分東西敢朱,然后剩下的東西再慢慢加載。所以你可以看到很多網(wǎng)頁摩瞎,都是慢慢的刷出來的拴签,或者有些網(wǎng)站隨著你的移動,很多信息才慢慢加載出來旗们。這樣的網(wǎng)頁有個好處蚓哩,就是網(wǎng)頁加載速度特別快。
如果一個城市有許多公司發(fā)布職位上渴,類似的會出現(xiàn)網(wǎng)頁下面多個頁面岸梨,點擊下一頁的時候,地址欄的url不會變化稠氮,因此是post請求+異步加載的形式曹阔,通過右鍵審查元素,這次url請求是用了post方法括袒,并且需要帶上cookies次兆,提交表單數(shù)據(jù)稿茉,返回的內(nèi)容是json锹锰。
拆解過程
1、輸入數(shù)據(jù)分析職位漓库,選擇城市恃慧,如下圖觀察url將城市變化頁面也會相應(yīng)變化,通過對城市遍歷渺蒿,傳遞給不同城市參數(shù)生成不同的url進行請求痢士。通過對單個城市的訪問獲取拉勾網(wǎng)包含所有城市信息。
2、拉鉤網(wǎng)的數(shù)據(jù)是通過js的Ajax動態(tài)生成怠蹂,當點擊下一頁時善延,url不會有變化。通過Chrome瀏覽器的檢查功能城侧,選擇Network的篩選器易遣,輸入json,可以找到positionAjax.json嫌佑。如下圖豆茫,你就會發(fā)現(xiàn),它所需要的參數(shù)就是一個當前城市city屋摇,當前頁號pn揩魂,和職位種類kd。這樣我們可以通過post抓取不同頁面的數(shù)據(jù)炮温。
如下圖火脉,每個城市的數(shù)據(jù)分析崗位的總體數(shù)量是totalCount,result里面是每頁的數(shù)據(jù)分析崗位信息柒啤。每頁最多15個招聘信息忘分,拉勾網(wǎng)最多呈現(xiàn)30頁,這樣就可以知道每個城市的數(shù)據(jù)分析崗位的頁數(shù)(也可以通過1直接獲劝仔蕖)妒峦。
3、點擊選擇單個招聘崗位兵睛,通過觀察可以發(fā)現(xiàn)肯骇,拉勾網(wǎng)的職位頁面詳情是由http://www.lagou.com/jobs/+*****(PositionId).html 組成,而PositionId可以通過分析Json的XHR獲得祖很。
爬取策略
爬取拉勾網(wǎng)數(shù)據(jù)分析崗位信息的策略如下:
1笛丙、構(gòu)造城市和職位組合的url,通過get方法解析頁面獲取拉勾網(wǎng)上所有的城市假颇;
2胚鸯、通過所有的城市信息,通過post方法獲取不同城市包含的數(shù)據(jù)分析崗位總數(shù)量笨鸡;
3姜钳、通過總數(shù)量我們可以獲取每個城市的數(shù)據(jù)分析崗位總頁數(shù),通過post方法獲取職位信息和PositionId形耗;
4哥桥、根據(jù)PositionId,通過get方法解析每個職位對應(yīng)的職位描述信息激涤。
代碼實現(xiàn)
# coding:utf-8
from urllib import request,parse
import json
from pandas import DataFrame
from lxml import etree
import random
import math
import time
def get_city_list():
base_url = 'https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?px=default&city=%E6%B7%B1%E5%9C%B3'
headers = {'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.8',
'Connection': 'keep-alive',
'Cookie': 'user_trace_token=20170823172848-77f8f03e-87e5-11e7-9ed0-525400f775ce; LGUID=20170823172848-77f8f6f1-87e5-11e7-9ed0-525400f775ce; JSESSIONID=ABAAABAACBHABBI08566127D8146453353170657FD7089A; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; _putrc=CD839C2A99BB2E8F; login=true; unick=%E5%BC%A0%E5%8F%88%E4%BA%AE; showExpriedIndex=1; showExpriedCompanyHome=1; showExpriedMyPublish=1; hasDeliver=44; TG-TRACK-CODE=search_code; SEARCH_ID=201ece3cee414cfdb6e8461e5484ff28; index_location_city=%E6%B7%B1%E5%9C%B3; _gid=GA1.2.943013801.1503976181; _gat=1; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1504012158,1504059663,1504116125,1504140367; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1504140876; _ga=GA1.2.279566316.1503480294; LGSID=20170831085014-59a334e8-8de6-11e7-9f82-525400f775ce; LGRID=20170831085842-887351d7-8de7-11e7-9f97-525400f775ce',
'Host': 'www.lagou.com',
'Referer': 'https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?px=default&city=%E5%85%A8%E5%9B%BD',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'
}
req=request.Request(base_url,headers=headers,method='GET')
response=request.urlopen(req)
html=response.read().decode('utf-8')
selector=etree.HTML(html)
city1=selector.xpath('//li[@class="hot"]/a/text()')
city2=selector.xpath('//li[@class="other"]/a/text()')
city=city1+city2
city.pop(0)
city.pop(-1)
return city
def page_counts(totalCount):
pages=math.ceil(totalCount/float(15))
if pages>30:
pages=30
return pages
def get_html(url,header,pn=1):
formdata = {'first': 'true', 'pn': pn, 'kd': '數(shù)據(jù)分析'}
data = bytes(parse.urlencode(formdata), encoding='utf-8')
req = request.Request(url, data, header, method='POST')
response = request.urlopen(req)
html = response.read().decode('utf-8')
#time.sleep(5)
return html
def get_city_pages(url,header):
referer = 'https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?px=default&city={a}'
cities = get_city_list()
print(cities)
data=[]
for eachCity in cities:
scity = parse.quote(eachCity)
url1 = url.format(b=str(scity))
#header.pop('Referer')
referer1 = referer.format(a=scity)
header['Referer'] = referer1
html = get_html(url1,header)
# 轉(zhuǎn)化為json
jdict = json.loads(html)
jcontent = jdict['content']
jpositionResult = jcontent['positionResult']
totalCount = jpositionResult['totalCount']
data.append([totalCount,url1,referer1])
return data
if __name__ == '__main__':
iplist=['14.153.53.123:3128','113.105.146.77:8086','219.135.164.250:8080','219.128.75.149:8123']
proxy_support=request.ProxyHandler({'http':random.choice(iplist)})
opener = request.build_opener(proxy_support)
request.install_opener(opener)
json_url = 'https://www.lagou.com/jobs/positionAjax.json?city=拟糕&needAddtionalResult=false&isSchoolJob=0'
json_headers = {'Accept': 'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.8',
'Connection': 'keep-alive',
'Content-Length': '55',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Cookie':'user_trace_token=20170823172848-77f8f03e-87e5-11e7-9ed0-525400f775ce; LGUID=20170823172848-77f8f6f1-87e5-11e7-9ed0-525400f775ce; JSESSIONID=ABAAABAACBHABBI08566127D8146453353170657FD7089A; PRE_UTM=; PRE_HOST=; PRE_SITE=; PRE_LAND=https%3A%2F%2Fwww.lagou.com%2F; _putrc=CD839C2A99BB2E8F; login=true; unick=%E5%BC%A0%E5%8F%88%E4%BA%AE; showExpriedIndex=1; showExpriedCompanyHome=1; showExpriedMyPublish=1; hasDeliver=44; _gid=GA1.2.943013801.1503976181; _ga=GA1.2.279566316.1503480294; LGSID=20170831085014-59a334e8-8de6-11e7-9f82-525400f775ce; LGRID=20170831085859-92c5f026-8de7-11e7-9f98-525400f775ce; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1504012158,1504059663,1504116125,1504140367; Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1504140893; TG-TRACK-CODE=search_code; SEARCH_ID=318e5a8812994350a50e589a318bd332; index_location_city=%E6%B7%B1%E5%9C%B3',
'Host': 'www.lagou.com',
'Origin': 'https://www.lagou.com',
'Referer': 'https://www.lagou.com/jobs/list_%E6%95%B0%E6%8D%AE%E5%88%86%E6%9E%90?px=default&city=%E6%B3%89%E5%B7%9E',
'X-Anit-Forge-Code': '0',
'X-Anit-Forge-Token': 'None',
'X-Requested-With': 'XMLHttpRequest'
}
positionName = []#職位名稱
positionLables=[]#職位標簽
firstType=[]#職位類型1
secondType=[]#職位類型2
bussinessZones=[]#工作地段
district=[]#工作區(qū)域
city = []#所在城市
education=[]#教育背景
workYear=[]#工作年限
salary = []#薪酬
companyName = []#公司名字
companySize = []#公司規(guī)模
companyStage=[]#發(fā)展狀況
industryField=[]#經(jīng)營范圍
totalCounts=[]#城市職位數(shù)量
positionIds= [] # 職位ID
data=get_city_pages(json_url,json_headers)
for totalCount,url,refer in data:
pages=page_counts(totalCount)
json_headers['Referer']=refer
while pages>0:
if pages>9:
json_headers['Content-Length']=56
else:
json_headers['Content-Length']=55
html=get_html(url,json_headers,pn=pages)
jdict = json.loads(html)
jcontent = jdict['content']
jpositionResult = jcontent['positionResult']
jresult = jpositionResult['result']
for each in jresult:
positionName.append(each['positionName'])
positionLables.append(each['positionLables'])
firstType.append(each['firstType'])
secondType.append(each['secondType'])
bussinessZones.append(each['businessZones']) # 工作地段
district.append(each['district']) # 工作區(qū)域
city.append(each['city'])
education.append(each['education']) # 教育背景
workYear.append(each['workYear']) # 工作年限
salary.append(each['salary'])
companyName.append(each['companyFullName'])
companySize.append(each['companySize'])
companyStage.append(each['financeStage']) # 發(fā)展狀況
industryField.append(each['industryField']) # 經(jīng)營范圍
totalCounts.append(totalCount)
positionId=each['positionId']
positionIds.append(positionId)
pages = pages - 1
positionData = {'positionName': positionName, 'positionLables': positionLables, 'positionType1':firstType,'postionType2':secondType,'bussinessZones':bussinessZones,'district':district,'city':city,'education':education,'workYear':workYear,'salary': salary, 'companyName': companyName, 'companySize': companySize, 'financeStage':companyStage,'industryField':industryField,'cityPositionCounts': totalCounts,'positionID':positionIds}
frame = DataFrame(positionData)
frame.to_csv('LagouPositionSociety.csv', index=False, na_rep='NULL')