本文僅練習(xí)爬蟲程序的編寫,并無保存任何數(shù)據(jù),網(wǎng)址接口已經(jīng)打碼處理。
目標(biāo):http://xxx.com
我們通過分析網(wǎng)絡(luò)請(qǐng)求可以看到有這兩個(gè)json文件:
https://xxx.cn/www/2.0/schoolprovinceindex/2018/318/12/1/1.json
https://xxx..cn/www/2.0/schoolspecialindex/2018/31/11/1/1.json
其中318是學(xué)校id耐薯,12是省份id舔清,代表的是天津
分別對(duì)應(yīng)著學(xué)校各省分?jǐn)?shù)線以及和各專業(yè)分?jǐn)?shù)線
因此我們當(dāng)前頁面的代碼為:
import requests
HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;",
"Accept-Language": "zh-CN,zh;q=0.8",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0",
'Referer': 'https://xxx.cn/school/search'
}
url = 'https://xxx.cn/www/2.0/schoolprovinceindex/2018/1217/12/1/1.json'
response = requests.get(url,headers=HEADERS)
print(response.json())
接下來我們就要想辦法獲取學(xué)校id了,同樣我們分析到:
https://xxxl.cn/gkcx/api/?uri=apigkcx/api/school/hotlists
通過post如下數(shù)據(jù):
data = {"access_token":"","admissions":"","central":"","department":"","dual_class":"","f211":"","f985":"","is_dual_class":"","keyword":"","page":2,"province_id":"","request_type":1,"school_type":"","size":20,"sort":"view_total","type":"","uri":"apigkcx/api/school/hotlists"}
我們可以看到一個(gè)參數(shù)是page曲初,對(duì)應(yīng)著頁碼:
所以我們這部分的代碼為:
import requests
HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;",
"Accept-Language": "zh-CN,zh;q=0.8",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0",
'Referer': 'https://xxx.cn/school/search'
}
url = 'https://xxx.cn/gkcx/api/?uri=apigkcx/api/school/hotlists'
data = {"access_token":"","admissions":"","central":"","department":"","dual_class":"","f211":"","f985":"","is_dual_class":"","keyword":"","page":2,"province_id":"","request_type":1,"school_type":"","size":20,"sort":"view_total","type":"","uri":"apigkcx/api/school/hotlists"}
response = requests.post(url,headers=HEADERS,data=data)
print(response.json())
我們處理一下就可以獲得學(xué)校的id体谒,為了美觀和之后數(shù)據(jù)處理我們加到字典里,
import requests
HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;",
"Accept-Language": "zh-CN,zh;q=0.8",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0",
'Referer': 'https://xxx.cn/school/search'
}
school_info = []
def get_schoolid(pagenum):
url = 'https://xxx.cn/gkcx/api/?uri=apigkcx/api/school/hotlists'
data = {"access_token":"","admissions":"","central":"","department":"","dual_class":"","f211":"","f985":"","is_dual_class":"","keyword":"","page":pagenum,"province_id":"","request_type":1,"school_type":"","size":20,"sort":"view_total","type":"","uri":"apigkcx/api/school/hotlists"}
response = requests.post(url,headers=HEADERS,data=data)
school_json = response.json()
schools = school_json['data']['item']
for school in schools:
school_id = school['school_id']
school_name = school['name']
school_dict = {
'id':school_id,
'name':school_name
}
school_info.append(school_dict)
def main():
get_schoolid(2)
print(school_info)
if __name__ == '__main__':
main()
結(jié)果如下:
因?yàn)橹笪覀兿胍闅v所有頁面的學(xué)校id臼婆,所以保留了一個(gè)pagenum參數(shù)抒痒,用作循環(huán)。
接下來就是添加上獲取相應(yīng)簡(jiǎn)略信息以及詳細(xì)專業(yè)分?jǐn)?shù):
import requests
HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;",
"Accept-Language": "zh-CN,zh;q=0.8",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0",
'Referer': 'https://xxx.cn/school/search'
}
school_info = []
simple_list = []
pro_list = []
name_list = []
def get_schoolid(pagenum):
url = 'https://xxx.cn/gkcx/api/?uri=apigkcx/api/school/hotlists'
data = {"access_token":"","admissions":"","central":"","department":"","dual_class":"","f211":"","f985":"","is_dual_class":"","keyword":"","page":pagenum,"province_id":"","request_type":1,"school_type":"","size":20,"sort":"view_total","type":"","uri":"apigkcx/api/school/hotlists"}
response = requests.post(url,headers=HEADERS,data=data)
school_json = response.json()
schools = school_json['data']['item']
for school in schools:
school_id = school['school_id']
school_name = school['name']
school_dict = {
'id':school_id,
'name':school_name
}
school_info.append(school_dict)
def get_info(id,name):
simple_url = 'https://xxx.cn/www/2.0/schoolprovinceindex/2018/%s/12/1/1.json'%id
simple_response = requests.get(simple_url,headers=HEADERS)
simple_info = simple_response.json()['data']['item'][0]
simple_infodict = {
'name':name,
'max':simple_info['max'],
'min':simple_info['min'],
'average':simple_info['average'],
'local_batch_name':simple_info['local_batch_name']
}
simple_list.append(simple_infodict)
def get_score(id,name):
professional_url = 'https://xxx.cn/www/2.0/schoolspecialindex/2018/%s/12/1/1.json'%id
professional_response = requests.get(professional_url,headers=HEADERS)
for pro_info in professional_response.json()['data']['item']:
pro_dict = {
'name':name,
'spname':pro_info['spname'],
'max':pro_info['max'],
'min':pro_info['min'],
'average':pro_info['average'],
'min_section':pro_info['min_section'],
'local_batch_name':pro_info['local_batch_name']
}
pro_list.append(pro_dict)
def main():
print('\033[0;36m='*15+'2018全國(guó)高校錄取分?jǐn)?shù)信息查詢系統(tǒng)'+'='*15+'\033[0m'+'\n')
get_schoolid(1)
for school in school_info:
id = school['id']
name = school['name']
try:
get_info(id,name)
print('[*]正在抓取2018%s在天津市錄取分?jǐn)?shù)信息'%name)
except:
print('[*]%s暫時(shí)未查到錄取分?jǐn)?shù)信息'%name)
try:
get_score(id,name)
print('[*]正在抓取2018%s專業(yè)分?jǐn)?shù)線信息'%name)
except:
print('[*]%s暫時(shí)未查專業(yè)分?jǐn)?shù)線信息'%name)
print('\033[0;36m[*]信息抓取結(jié)束颁褂,即將開始整理信息\033[0m')
print('\033[0;36m[*]即將展示天津市各高校2018分?jǐn)?shù)信息\033[0m')
for school in simple_list:
print('學(xué)校名稱:{name}故响,最高分:{max},最低分:{min}颁独,平均分:{average}'.format(**school))
print('\033[0;36m[*]即將展示天津市各高校2018專業(yè)分?jǐn)?shù)線信息\033[0m')
for school in pro_list:
print('學(xué)校名稱:{name}彩届,專業(yè)名稱:{spname},最高分:{max}誓酒,最低分:{min}樟蠕,平均分:{average},最低位次:{min_section}'.format(**school))
if __name__ == '__main__':
main()
因?yàn)橐还灿?42頁,io密集型可以使用多線程提高爬蟲速度寨辩,但是要注意共同變量的問題吓懈,由于之前總結(jié)過python多線程的相關(guān)內(nèi)容,接下來我們可以通過pandas保存到excel靡狞,我們可以先將字典轉(zhuǎn)換成dataframe耻警,然后保存為excel。
也可以通過pyecharts等進(jìn)行數(shù)據(jù)分析耍攘。