Python爬蟲爬取全國(guó)各大高校各專業(yè)分?jǐn)?shù)

本文僅練習(xí)爬蟲程序的編寫,并無保存任何數(shù)據(jù),網(wǎng)址接口已經(jīng)打碼處理。

目標(biāo):http://xxx.com


我們通過分析網(wǎng)絡(luò)請(qǐng)求可以看到有這兩個(gè)json文件:

https://xxx.cn/www/2.0/schoolprovinceindex/2018/318/12/1/1.json
https://xxx..cn/www/2.0/schoolspecialindex/2018/31/11/1/1.json

其中318是學(xué)校id耐薯,12是省份id舔清,代表的是天津
分別對(duì)應(yīng)著學(xué)校各省分?jǐn)?shù)線以及和各專業(yè)分?jǐn)?shù)線




因此我們當(dāng)前頁面的代碼為:

import requests

HEADERS = {
    "Accept": "text/html,application/xhtml+xml,application/xml;",
    "Accept-Language": "zh-CN,zh;q=0.8",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0",
    'Referer': 'https://xxx.cn/school/search'
}
url = 'https://xxx.cn/www/2.0/schoolprovinceindex/2018/1217/12/1/1.json'
response = requests.get(url,headers=HEADERS)
print(response.json())

接下來我們就要想辦法獲取學(xué)校id了,同樣我們分析到:

https://xxxl.cn/gkcx/api/?uri=apigkcx/api/school/hotlists

通過post如下數(shù)據(jù):

data = {"access_token":"","admissions":"","central":"","department":"","dual_class":"","f211":"","f985":"","is_dual_class":"","keyword":"","page":2,"province_id":"","request_type":1,"school_type":"","size":20,"sort":"view_total","type":"","uri":"apigkcx/api/school/hotlists"}

我們可以看到一個(gè)參數(shù)是page曲初,對(duì)應(yīng)著頁碼:
所以我們這部分的代碼為:

import requests

HEADERS = {
    "Accept": "text/html,application/xhtml+xml,application/xml;",
    "Accept-Language": "zh-CN,zh;q=0.8",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0",
    'Referer': 'https://xxx.cn/school/search'
}
url = 'https://xxx.cn/gkcx/api/?uri=apigkcx/api/school/hotlists'
data = {"access_token":"","admissions":"","central":"","department":"","dual_class":"","f211":"","f985":"","is_dual_class":"","keyword":"","page":2,"province_id":"","request_type":1,"school_type":"","size":20,"sort":"view_total","type":"","uri":"apigkcx/api/school/hotlists"}
response = requests.post(url,headers=HEADERS,data=data)
print(response.json())

我們處理一下就可以獲得學(xué)校的id体谒,為了美觀和之后數(shù)據(jù)處理我們加到字典里,

import requests

HEADERS = {
    "Accept": "text/html,application/xhtml+xml,application/xml;",
    "Accept-Language": "zh-CN,zh;q=0.8",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0",
    'Referer': 'https://xxx.cn/school/search'
}

school_info = []
def get_schoolid(pagenum):
    url = 'https://xxx.cn/gkcx/api/?uri=apigkcx/api/school/hotlists'
    data = {"access_token":"","admissions":"","central":"","department":"","dual_class":"","f211":"","f985":"","is_dual_class":"","keyword":"","page":pagenum,"province_id":"","request_type":1,"school_type":"","size":20,"sort":"view_total","type":"","uri":"apigkcx/api/school/hotlists"}
    response = requests.post(url,headers=HEADERS,data=data)
    school_json = response.json()
    schools = school_json['data']['item']
    for school in schools:
        school_id = school['school_id']
        school_name = school['name']
        school_dict = {
        'id':school_id,
        'name':school_name
        }
        school_info.append(school_dict)

def main():
    get_schoolid(2)
    print(school_info)
if __name__ == '__main__':
    main()

結(jié)果如下:



因?yàn)橹笪覀兿胍闅v所有頁面的學(xué)校id臼婆,所以保留了一個(gè)pagenum參數(shù)抒痒,用作循環(huán)。
接下來就是添加上獲取相應(yīng)簡(jiǎn)略信息以及詳細(xì)專業(yè)分?jǐn)?shù):

import requests

HEADERS = {
    "Accept": "text/html,application/xhtml+xml,application/xml;",
    "Accept-Language": "zh-CN,zh;q=0.8",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0",
    'Referer': 'https://xxx.cn/school/search'
}

school_info = []
simple_list = []
pro_list = []
name_list = []
def get_schoolid(pagenum):
    url = 'https://xxx.cn/gkcx/api/?uri=apigkcx/api/school/hotlists'
    data = {"access_token":"","admissions":"","central":"","department":"","dual_class":"","f211":"","f985":"","is_dual_class":"","keyword":"","page":pagenum,"province_id":"","request_type":1,"school_type":"","size":20,"sort":"view_total","type":"","uri":"apigkcx/api/school/hotlists"}
    response = requests.post(url,headers=HEADERS,data=data)
    school_json = response.json()
    schools = school_json['data']['item']
    for school in schools:
        school_id = school['school_id']
        school_name = school['name']
        school_dict = {
        'id':school_id,
        'name':school_name
        }
        school_info.append(school_dict)

def get_info(id,name):
    simple_url  = 'https://xxx.cn/www/2.0/schoolprovinceindex/2018/%s/12/1/1.json'%id
    simple_response = requests.get(simple_url,headers=HEADERS)
    simple_info = simple_response.json()['data']['item'][0]
    simple_infodict = {
        'name':name,
        'max':simple_info['max'],
        'min':simple_info['min'],
        'average':simple_info['average'],
        'local_batch_name':simple_info['local_batch_name']
    }
    simple_list.append(simple_infodict)
def get_score(id,name):
    professional_url  = 'https://xxx.cn/www/2.0/schoolspecialindex/2018/%s/12/1/1.json'%id
    professional_response = requests.get(professional_url,headers=HEADERS)
    for pro_info in professional_response.json()['data']['item']:
        pro_dict = {
        'name':name,
        'spname':pro_info['spname'],
        'max':pro_info['max'],
        'min':pro_info['min'],
        'average':pro_info['average'],
        'min_section':pro_info['min_section'],
        'local_batch_name':pro_info['local_batch_name']
        }
        pro_list.append(pro_dict)

def main():
    print('\033[0;36m='*15+'2018全國(guó)高校錄取分?jǐn)?shù)信息查詢系統(tǒng)'+'='*15+'\033[0m'+'\n')
    get_schoolid(1)
    for school in school_info:
        id = school['id']
        name = school['name']
        try:
            get_info(id,name)
            print('[*]正在抓取2018%s在天津市錄取分?jǐn)?shù)信息'%name)
        except:
            print('[*]%s暫時(shí)未查到錄取分?jǐn)?shù)信息'%name)
        try:
            get_score(id,name)
            print('[*]正在抓取2018%s專業(yè)分?jǐn)?shù)線信息'%name)
        except:
            print('[*]%s暫時(shí)未查專業(yè)分?jǐn)?shù)線信息'%name)

    print('\033[0;36m[*]信息抓取結(jié)束颁褂,即將開始整理信息\033[0m')
    print('\033[0;36m[*]即將展示天津市各高校2018分?jǐn)?shù)信息\033[0m')
    for school in simple_list:
        print('學(xué)校名稱:{name}故响,最高分:{max},最低分:{min}颁独,平均分:{average}'.format(**school))
    print('\033[0;36m[*]即將展示天津市各高校2018專業(yè)分?jǐn)?shù)線信息\033[0m')
    for school in pro_list:
        print('學(xué)校名稱:{name}彩届,專業(yè)名稱:{spname},最高分:{max}誓酒,最低分:{min}樟蠕,平均分:{average},最低位次:{min_section}'.format(**school))
if __name__ == '__main__':
    main()


因?yàn)橐还灿?42頁,io密集型可以使用多線程提高爬蟲速度寨辩,但是要注意共同變量的問題吓懈,由于之前總結(jié)過python多線程的相關(guān)內(nèi)容,接下來我們可以通過pandas保存到excel靡狞,我們可以先將字典轉(zhuǎn)換成dataframe耻警,然后保存為excel。



也可以通過pyecharts等進(jìn)行數(shù)據(jù)分析耍攘。
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末榕栏,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子蕾各,更是在濱河造成了極大的恐慌扒磁,老刑警劉巖,帶你破解...
    沈念sama閱讀 211,290評(píng)論 6 491
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件式曲,死亡現(xiàn)場(chǎng)離奇詭異妨托,居然都是意外死亡,警方通過查閱死者的電腦和手機(jī)吝羞,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,107評(píng)論 2 385
  • 文/潘曉璐 我一進(jìn)店門兰伤,熙熙樓的掌柜王于貴愁眉苦臉地迎上來,“玉大人钧排,你說我怎么就攤上這事敦腔。” “怎么了恨溜?”我有些...
    開封第一講書人閱讀 156,872評(píng)論 0 347
  • 文/不壞的土叔 我叫張陵符衔,是天一觀的道長(zhǎng)。 經(jīng)常有香客問我糟袁,道長(zhǎng)判族,這世上最難降的妖魔是什么? 我笑而不...
    開封第一講書人閱讀 56,415評(píng)論 1 283
  • 正文 為了忘掉前任项戴,我火速辦了婚禮形帮,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘周叮。我一直安慰自己辩撑,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 65,453評(píng)論 6 385
  • 文/花漫 我一把揭開白布仿耽。 她就那樣靜靜地躺著槐臀,像睡著了一般。 火紅的嫁衣襯著肌膚如雪氓仲。 梳的紋絲不亂的頭發(fā)上水慨,一...
    開封第一講書人閱讀 49,784評(píng)論 1 290
  • 那天得糜,我揣著相機(jī)與錄音,去河邊找鬼晰洒。 笑死朝抖,一個(gè)胖子當(dāng)著我的面吹牛,可吹牛的內(nèi)容都是我干的谍珊。 我是一名探鬼主播治宣,決...
    沈念sama閱讀 38,927評(píng)論 3 406
  • 文/蒼蘭香墨 我猛地睜開眼,長(zhǎng)吁一口氣:“原來是場(chǎng)噩夢(mèng)啊……” “哼砌滞!你這毒婦竟也來了侮邀?” 一聲冷哼從身側(cè)響起,我...
    開封第一講書人閱讀 37,691評(píng)論 0 266
  • 序言:老撾萬榮一對(duì)情侶失蹤贝润,失蹤者是張志新(化名)和其女友劉穎绊茧,沒想到半個(gè)月后,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體打掘,經(jīng)...
    沈念sama閱讀 44,137評(píng)論 1 303
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡华畏,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,472評(píng)論 2 326
  • 正文 我和宋清朗相戀三年,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了尊蚁。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片亡笑。...
    茶點(diǎn)故事閱讀 38,622評(píng)論 1 340
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡,死狀恐怖横朋,靈堂內(nèi)的尸體忽然破棺而出仑乌,到底是詐尸還是另有隱情,我是刑警寧澤琴锭,帶...
    沈念sama閱讀 34,289評(píng)論 4 329
  • 正文 年R本政府宣布晰甚,位于F島的核電站,受9級(jí)特大地震影響祠够,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜粪牲,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,887評(píng)論 3 312
  • 文/蒙蒙 一古瓤、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧腺阳,春花似錦落君、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,741評(píng)論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至焙蚓,卻和暖如春纹冤,著一層夾襖步出監(jiān)牢的瞬間洒宝,已是汗流浹背。 一陣腳步聲響...
    開封第一講書人閱讀 31,977評(píng)論 1 265
  • 我被黑心中介騙來泰國(guó)打工萌京, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留雁歌,地道東北人。 一個(gè)月前我還...
    沈念sama閱讀 46,316評(píng)論 2 360
  • 正文 我出身青樓知残,卻偏偏與公主長(zhǎng)得像靠瞎,于是被迫代替她去往敵國(guó)和親。 傳聞我的和親對(duì)象是個(gè)殘疾皇子求妹,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 43,490評(píng)論 2 348