感覺好久沒寫python了哈哈,最近都在忙工作袄友,所以也是沒有學(xué)習(xí)python殿托。
剛好湊巧朋友正在找工作,也是java的剧蚣,所以我也就順便聯(lián)系下爬蟲支竹,爬下拉勾網(wǎng)的java職位。
以前都是用的bs4鸠按,今天來用一下xpath~
找出請求地址
首先打開拉勾網(wǎng)選擇一個(gè)城市礼搁,然后直接點(diǎn)它的java分類
觀察地址欄可以看到一個(gè)地址
其實(shí)這個(gè)地址沒什么用,別被它忽悠了目尖,接下來我們到頁面最下方選擇第二頁會(huì)發(fā)現(xiàn)地址欄地址變了
再選擇第三頁第四頁會(huì)發(fā)現(xiàn)馒吴,好像也就只是Java后面那個(gè)數(shù)字變了,那這個(gè)地址是不是就是我們需要的呢瑟曲,其實(shí)不是饮戳,代碼里請求這個(gè)地址是拿不到我們想要的東西的,所以我們打開開發(fā)者工具洞拨,輸入java點(diǎn)擊搜索
這個(gè)請求返回了一個(gè)html扯罐,但是往下拉卻會(huì)發(fā)現(xiàn)公司列表是空的,還是沒有數(shù)據(jù)烦衣。那么繼續(xù)往下找
下面那個(gè)companyAjax這個(gè)看起來是最像的了歹河,但是不是它,是上面這個(gè)position花吟,一開始我以為是下面那個(gè)秸歧,然后用來請求發(fā)現(xiàn)一直提示你訪問過于頻繁。讓我錯(cuò)以為是真的訪問過于頻繁衅澈,然后我用手機(jī)4G網(wǎng)訪問了下發(fā)現(xiàn)也是同樣的結(jié)果键菱。后來點(diǎn)了下上面這個(gè),發(fā)現(xiàn)它就是我們要找的請求地址矾麻,它會(huì)返回給我們json數(shù)據(jù)纱耻“盘荩可以說是相當(dāng)?shù)凝R全了险耀。
爬取數(shù)據(jù)
- url :
https://www.lagou.com/jobs/positionAjax.json?city=%E6%B7%B1%E5%9C%B3&needAddtionalResult=false&isSchoolJob=0
- 請求方式:
post
- 請求數(shù)據(jù):
data = {
'first': False,
'pn':1,
'kd': 'java',
}
pn就是頁碼了,kd應(yīng)該就是關(guān)鍵字了
注意要設(shè)置header
data = {
'first': False,
'pn':1,
'kd': 'java',
}
def get_job(data):
url = 'https://www.lagou.com/jobs/positionAjax.json?city=%E6%B7%B1%E5%9C%B3&needAddtionalResult=false&isSchoolJob=0'
page = requests.post(url=url, cookies=cookie, headers=headers, data=data)
page.encoding = 'utf-8'
result = page.json()
jobs = result['content']['positionResult']['result']
for job in jobs:
companyShortName = job['companyShortName']
positionId = job['positionId'] # 主頁ID
companyFullName = job['companyFullName'] # 公司全名
這個(gè)返回的信息是挺全面了玖喘,當(dāng)然如果要獲取更詳細(xì)的信息甩牺,那就需要到詳情頁了,隨便點(diǎn)一個(gè)
可以看到這串?dāng)?shù)字累奈,它就是公司的id了吧贬派,也就是上面的json中返回的
positionId
急但,我們只要拼一下url就可以請求了
detail_url = 'https://www.lagou.com/jobs/{}.html'.format(positionId)
response = requests.get(url=detail_url, headers=headers, cookies=cookies)
response.encoding = 'utf-8'
tree = etree.HTML(response.text)
desc = tree.xpath('//*[@id="job_detail"]/dd[2]/div/p/text()')
不知道為什么,有的公司明明是有職位描述的搞乏,但是卻拿不到波桩,也是有點(diǎn)費(fèi)勁,原諒我是個(gè)菜鳥请敦。誰知道的話煩請告知小弟一下
完整代碼:
# /usr/bin/env python3
# -*- coding:utf-8 -*-
import requests
from lxml import etree
cookie = {
'Cookie':'JSESSIONID=ABAAABAAAGGABCBF0273ED764F089FC46DF6B525A6828FC; '
'user_trace_token=20170901085741-8ea70518-8eb0-11e7-902f-5254005c3644; '
'LGUID=20170901085741-8ea7093b-8eb0-11e7-902f-5254005c3644; '
'index_location_city=%E6%B7%B1%E5%9C%B3; '
'TG-TRACK-CODE=index_navigation; _gat=1; '
'_gid=GA1.2.807135798.1504227456; _ga=GA1.2.1721572155.1504227456; '
'LGSID=20170901085741-8ea70793-8eb0-11e7-902f-5254005c3644; '
'LGRID=20170901095027-ed9ebf87-8eb7-11e7-902f-5254005c3644; '
'Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1504227456; '
'Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1504230623;'
'SEARCH_ID=a274b85f40b54d4da62d5e5740427a0a'
}
headers = {
'User-Agent': 'User-Agent:Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/60.0.3112.90 Safari/537.36',
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Host':'www.lagou.com',
'Origin':'https://www.lagou.com',
'Referer':'https://www.lagou.com/jobs/list_java?city=%E6%B7%B1%E5%9C%B3&cl=false&fromSearch=true&labelWords=&suginput=',
}
cookies = {
'Cookie': 'user_trace_token=20170901085741-8ea70518-8eb0-11e7-902f-5254005c3644;'
'LGUID=20170901085741-8ea7093b-8eb0-11e7-902f-5254005c3644; '
'index_location_city=%E6%B7%B1%E5%9C%B3; SEARCH_ID=7277bc08d137413dac2590cea0465e39; '
'TG-TRACK-CODE=search_code; JSESSIONID=ABAAABAAAGGABCBF0273ED764F089FC46DF6B525A6828FC; '
'PRE_UTM=; PRE_HOST=; '
'PRE_SITE=https%3A%2F%2Fwww.lagou.com%2Fjobs%2Flist_java%3Fcity%3D%25E6%25B7%25B1%25E5%259C%25B3%26cl%3Dfalse%26fromSearch%3Dtrue%26labelWords%3D%26suginput%3D; '
'PRE_LAND=https%3A%2F%2Fwww.lagou.com%2Fjobs%2F3413383.html; _gat=1; _'
'gid=GA1.2.807135798.1504227456; Hm_lvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1504227456; '
'Hm_lpvt_4233e74dff0ae5bd0a3d81c6ccf756e6=1504252636; _ga=GA1.2.1721572155.1504227456; '
'LGSID=20170901153335-dd437749-8ee7-11e7-903c-5254005c3644; '
'LGRID=20170901155728-336ca29d-8eeb-11e7-9043-5254005c3644',
}
data = {
'first': False,
'pn':1,
'kd': 'java',
}
def get_job(data):
url = 'https://www.lagou.com/jobs/positionAjax.json?city=%E6%B7%B1%E5%9C%B3&needAddtionalResult=false&isSchoolJob=0'
page = requests.post(url=url, cookies=cookie, headers=headers, data=data)
page.encoding = 'utf-8'
result = page.json()
jobs = result['content']['positionResult']['result']
for job in jobs:
companyShortName = job['companyShortName']
positionId = job['positionId'] # 主頁ID
companyFullName = job['companyFullName'] # 公司全名
companyLabelList = job['companyLabelList'] # 福利待遇
companySize = job['companySize'] # 公司規(guī)模
industryField = job['industryField']
createTime = job['createTime'] # 發(fā)布時(shí)間
district = job['district'] # 地區(qū)
education = job['education'] # 學(xué)歷要求
financeStage = job['financeStage'] # 上市否
firstType = job['firstType'] # 類型
secondType = job['secondType'] # 類型
formatCreateTime = job['formatCreateTime'] # 發(fā)布時(shí)間
publisherId = job['publisherId'] # 發(fā)布人ID
salary = job['salary'] # 薪資
workYear = job['workYear'] # 工作年限
positionName = job['positionName'] #
jobNature = job['jobNature'] # 全職
positionAdvantage = job['positionAdvantage'] # 工作福利
positionLables = job['positionLables'] # 工種
detail_url = 'https://www.lagou.com/jobs/{}.html'.format(positionId)
response = requests.get(url=detail_url, headers=headers, cookies=cookies)
response.encoding = 'utf-8'
tree = etree.HTML(response.text)
desc = tree.xpath('//*[@id="job_detail"]/dd[2]/div/p/text()')
print(companyFullName)
print('%s 拉勾網(wǎng)鏈接:-> %s' % (companyShortName, detail_url))
print('職位:%s' % positionName)
print('職位類型:%s' % firstType)
print('薪資待遇:%s' % salary)
print('職位誘惑:%s' % positionAdvantage)
print('地區(qū):%s' % district)
print('類型:%s' % jobNature)
print('工作經(jīng)驗(yàn):%s' % workYear)
print('學(xué)歷要求:%s' % education)
print('發(fā)布時(shí)間:%s' % createTime)
x = ''
for label in positionLables:
x += label + ','
print('技能標(biāo)簽:%s' % x)
print('公司類型:%s' % industryField)
for des in desc:
print(des)
def url(data):
for x in range(1,50):
data['pn'] = x
get_job(data)
if __name__ == '__main__':
url(data)
最后的最后镐躲,說來我陸陸續(xù)續(xù)學(xué)習(xí)python 的時(shí)間也有兩個(gè)月了差不多,但是學(xué)的很皮毛侍筛,接下來有時(shí)間還是準(zhǔn)備好好看看cookbook 萤皂,加油