看了一段時(shí)間的爬蟲文章和視頻,總感覺看的懂但是實(shí)際操作自己的項(xiàng)目卻是難點(diǎn)比較多,還是因?yàn)橹R(shí)點(diǎn)不夠扎實(shí)萨西。今天嘗試一下登陸本校的官網(wǎng),并在之后能夠爬取到想要的信息旭旭。
系統(tǒng):win10 1803
工具:Pycharm 1703
python版本:3.6
抓包工具:Charles
用到的模塊:requests,PIL,BeautifulSoup/lxml,os
我們學(xué)校的教務(wù)管理系統(tǒng):http://220.178.71.156:85/(jnw0uoqufqsohg3jngkaci55)/default2.aspx
使用抓包軟件登錄后抓到提交的數(shù)據(jù)
這些數(shù)據(jù)就是在模擬登陸向服務(wù)器post的數(shù)據(jù)谎脯,同時(shí)我們需要提交驗(yàn)證碼,但是驗(yàn)證碼是隨機(jī)動(dòng)態(tài)的持寄,所以我們需要找到驗(yàn)證碼的鏈接源梭。
# 下載驗(yàn)證碼
s = requests.session()? ?#獲取session 在之后使用同一個(gè)session
imgUrl = "http://220.178.71.156:85/(jnw0uoqufqsohg3jngkaci55)/CheckCode.aspx?"
imgresponse = s.get(imgUrl, stream=True)
print(s.cookies)
image = imgresponse.contentDstDir = os.getcwd() + "\\"
print("保存驗(yàn)證碼在:" + DstDir + "code.png" + "\n")
try:
with open(DstDir + "code.png", "wb") as png:
png.write(image)
except IOError:
print("IO Error\n")
finally: png.close
# 打開并手動(dòng)輸入驗(yàn)證碼
img = Image.open('I:\pc_first\pc\code.png')
img.show()
data = {}
emmm,我開始用的圖片格式全是png格式并且可以自動(dòng)打開稍味,這樣對(duì)于隨機(jī)驗(yàn)證碼的獲取和使用就完成了废麻。
url = 'http://220.178.71.156:85/(jnw0uoqufqsohg3jngkaci55)/default2.aspx'
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',}
data = {
'__VIEWSTATE': 'dDwtMTg3MTM5OTI5MTs7Pu9NgXuEf8Rr/BvvkUWH8oCYiXB2',
'TextBox1': '我的學(xué)號(hào)',
'TextBox2': '我的密碼',
'TextBox3': input('輸入驗(yàn)證碼:'),
"Button1": "", 'lbLanguage': ""
}
response = s.post(url=url, data=data, headers=headers)
if 'xh=' in response.url:
print('登陸成功')
else:
False
登錄進(jìn)去后,可以得到另一個(gè)鏈接http://220.178.71.156:85/(jnw0uoqufqsohg3jngkaci55)/xs_main.aspx?xh=15040****模庐;
最后的就是我的學(xué)號(hào)烛愧,所以在最后加一個(gè)小的判斷,字段中如果有‘xh=’就是登錄成功了掂碱。
但是登錄之后呢怜姿,直接爬取成績(jī)就是爬取不下來,一直使用一個(gè)空的列表顶吮。我看了好長(zhǎng)時(shí)間都沒發(fā)現(xiàn)問題社牲,還是請(qǐng)教的朋友。是因?yàn)槌煽?jī)所在的真正url改變了不是登錄跳轉(zhuǎn)的url悴了。
我們需要在最初始的url然后通過操作得到可以進(jìn)行爬取成績(jī)的url搏恤,所以要對(duì)這個(gè)鏈接進(jìn)行組合。現(xiàn)在最初的界面代碼中進(jìn)行查找湃交。
查找發(fā)現(xiàn)能在最初的url找到的只有這個(gè)url 對(duì)比我們需要的url好像有區(qū)別熟空,xm=后面我們需要的是一串碼,但是這里是我的姓名搞莺。我就先試試用這個(gè)url能不能爬取下來成績(jī)息罗,如果不信再進(jìn)行進(jìn)一步的查找(其實(shí)我是找到了最后的url,再試了一下這個(gè)url才沧,發(fā)現(xiàn)效果一樣B鹾怼I芄巍!)
link0 = requests.post(response.url, headers=headers1).text
s1 = etree.HTML(link0)
link = s1.xpath('//*[@id="headDiv"]/ul/li[4]/ul/li[3]/a/@href')
url2 = 'http://220.178.71.156:85/(jnw0uoqufqsohg3jngkaci55)/' + str(link)[2:-2]
把href字段爬取下來挨摸,然后對(duì)比進(jìn)行拼接得到需要的鏈接孩革。
爬取成績(jī)需要提交的data,因?yàn)槲尹c(diǎn)擊的是所有成績(jī) 所以學(xué)期得运,學(xué)年等提交都為空膝蜈,可以進(jìn)一步的提交想要的data得到想要的對(duì)應(yīng)成績(jī)。
data2 = {
'__EVENTTARGET': '',
'__EVENTARGUMENT': '',
'__VIEWSTATE': 'dDwxMzc0MjAwNjg2O3Q8cDxsPFNvcnRFeHByZXM7c2ZkY2JrO2RnMztkeWJ5c2NqO1NvcnREaXJlO3hoO3N0cl90YWJfYmpnO2NqY3hfbHNiO3p4Y2pjeHhzOz47bDxrY21jO1xlO2JqZztcZTthc2M7MTUwNDAyMjE2O3pmX2N4Y2p0al8xNTA0MDIyMTY7XGU7MDs+PjtsPGk8MT47PjtsPHQ8O2w8aTw0PjtpPDEwPjtpPDE5PjtpPDMwPjtpPDMyPjtpPDM0PjtpPDM2PjtpPDM4PjtpPDM5PjtpPDQxPjtpPDQzPjtpPDQ1PjtpPDQ3PjtpPDQ5PjtpPDUxPjtpPDUzPjtpPDU1PjtpPDU3PjtpPDU5PjtpPDYxPjtpPDYyPjtpPDYzPjtpPDY1PjtpPDY3PjtpPDY5PjtpPDcxPjtpPDczPjtpPDc1PjtpPDc3PjtpPDc5PjtpPDgwPjs+O2w8dDx0PDt0PGk8MTk+O0A8XGU7MjAwMS0yMDAyOzIwMDItMjAwMzsyMDAzLTIwMDQ7MjAwNC0yMDA1OzIwMDUtMjAwNjsyMDA2LTIwMDc7MjAwNy0yMDA4OzIwMDgtMjAwOTsyMDA5LTIwMTA7MjAxMC0yMDExOzIwMTEtMjAxMjsyMDEyLTIwMTM7MjAxMy0yMDE0OzIwMTQtMjAxNTsyMDE1LTIwMTY7MjAxNi0yMDE3OzIwMTctMjAxODsyMDE4LTIwMTk7PjtAPFxlOzIwMDEtMjAwMjsyMDAyLTIwMDM7MjAwMy0yMDA0OzIwMDQtMjAwNTsyMDA1LTIwMDY7MjAwNi0yMDA3OzIwMDctMjAwODsyMDA4LTIwMDk7MjAwOS0yMDEwOzIwMTAtMjAxMTsyMDExLTIwMTI7MjAxMi0yMDEzOzIwMTMtMjAxNDsyMDE0LTIwMTU7MjAxNS0yMDE2OzIwMTYtMjAxNzsyMDE3LTIwMTg7MjAxOC0yMDE5Oz4+Oz47Oz47dDx0PHA8cDxsPERhdGFUZXh0RmllbGQ7RGF0YVZhbHVlRmllbGQ7PjtsPGtjeHptYztrY3h6ZG07Pj47Pjt0PGk8OT47QDzlv4Xkv67or7476YCJ5L+u6K++O+WFrOWFseWfuuehgOivvjvlrp7ot7Xor7475LiT5Lia5qC45b+D6K++O+S4k+S4muivvjvkuJPkuJrpgInkv67or74757u85ZCI5a6e6Le16K++O1xlOz47QDwxOzI7Mzs0OzU7Njs3Ozg7XGU7Pj47Pjs7Pjt0PHA8cDxsPFZpc2libGU7PjtsPG88Zj47Pj47Pjs7Pjt0PHA8cDxsPFRleHQ7PjtsPFxlOz4+Oz47Oz47dDxwPHA8bDxUZXh0O1Zpc2libGU7PjtsPOWtpuWPt++8mjE1MDQwMjIxNjtvPHQ+Oz4+Oz47Oz47dDxwPHA8bDxUZXh0O1Zpc2libGU7PjtsPOWnk+WQje+8mum7hOa1qTtvPHQ+Oz4+Oz47Oz47dDxwPHA8bDxUZXh0O1Zpc2libGU7PjtsPOWtpumZou+8muS/oeaBr+S4juiuoeeul+acuuezuztvPHQ+Oz4+Oz47Oz47dDxwPHA8bDxUZXh0O1Zpc2libGU7PjtsPOS4k+S4mu+8mjtvPHQ+Oz4+Oz47Oz47dDxwPHA8bDxUZXh0O1Zpc2libGU7PjtsPOmAmuS/oeW3peeoiztvPHQ+Oz4+Oz47Oz47dDxwPHA8bDxUZXh0Oz47bDzkuJPkuJrmlrnlkJHvvJo7Pj47Pjs7Pjt0PHA8cDxsPFRleHQ7VmlzaWJsZTs+O2w86KGM5pS/54+t77yaMjAxNee6p+mAmuS/oeW3peeoizLnj607bzx0Pjs+Pjs+Ozs+O3Q8cDxwPGw8VmlzaWJsZTs+O2w8bzxmPjs+Pjs+Ozs+O3Q8QDA8cDxwPGw8VmlzaWJsZTs+O2w8bzxmPjs+PjtwPGw8c3R5bGU7PjtsPERJU1BMQVk6bm9uZTs+Pj47Ozs7Ozs7Ozs7Pjs7Pjt0PDtsPGk8MTM+Oz47bDx0PEAwPDs7Ozs7Ozs7Ozs+Ozs+Oz4+O3Q8cDxwPGw8VGV4dDtWaXNpYmxlOz47bDzoh7Pku4rmnKrpgJrov4for77nqIvmiJDnu6nvvJo7bzx0Pjs+Pjs+Ozs+O3Q8QDA8cDxwPGw8UGFnZUNvdW50O18hSXRlbUNvdW50O18hRGF0YVNvdXJjZUl0ZW1Db3VudDtEYXRhS2V5czs+O2w8aTwxPjtpPDE+O2k8MT47bDw+Oz4+O3A8bDxzdHlsZTs+O2w8RElTUExBWTpibG9jazs+Pj47Ozs7Ozs7Ozs7PjtsPGk8MD47PjtsPHQ8O2w8aTwxPjs+O2w8dDw7bDxpPDA+O2k8MT47aTwyPjtpPDM+O2k8ND47aTw1Pjs+O2w8dDxwPHA8bDxUZXh0Oz47bDxKWDAyMTAxNDs+Pjs+Ozs+O3Q8cDxwPGw8VGV4dDs+O2w86K6h566X5py65a+86K66Oz4+Oz47Oz47dDxwPHA8bDxUZXh0Oz47bDzlv4Xkv67or747Pj47Pjs7Pjt0PHA8cDxsPFRleHQ7PjtsPDIuMDs+Pjs+Ozs+O3Q8cDxwPGw8VGV4dDs+O2w8MDs+Pjs+Ozs+O3Q8cDxwPGw8VGV4dDs+O2w8Jm5ic3BcOzs+Pjs+Ozs+Oz4+Oz4+Oz4+O3Q8QDA8cDxwPGw8VmlzaWJsZTs+O2w8bzxmPjs+PjtwPGw8c3R5bGU7PjtsPERJU1BMQVk6bm9uZTs+Pj47Ozs7Ozs7Ozs7Pjs7Pjt0PEAwPHA8cDxsPFZpc2libGU7PjtsPG88Zj47Pj47cDxsPHN0eWxlOz47bDxESVNQTEFZOm5vbmU7Pj4+Ozs7Ozs7Ozs7Oz47Oz47dDxAMDw7Ozs7Ozs7Ozs7Pjs7Pjt0PEAwPHA8cDxsPFZpc2libGU7PjtsPG88Zj47Pj47cDxsPHN0eWxlOz47bDxESVNQTEFZOm5vbmU7Pj4+Ozs7Ozs7Ozs7Oz47Oz47dDxAMDxwPHA8bDxWaXNpYmxlOz47bDxvPGY+Oz4+O3A8bDxzdHlsZTs+O2w8RElTUExBWTpub25lOz4+Pjs7Ozs7Ozs7Ozs+Ozs+O3Q8QDA8cDxwPGw8VmlzaWJsZTs+O2w8bzxmPjs+Pjs+Ozs7Ozs7Ozs7Oz47Oz47dDxAMDxwPHA8bDxWaXNpYmxlOz47bDxvPGY+Oz4+O3A8bDxzdHlsZTs+O2w8RElTUExBWTpub25lOz4+Pjs7Ozs7Ozs7Ozs+Ozs+O3Q8QDA8cDxwPGw8VmlzaWJsZTs+O2w8bzxmPjs+PjtwPGw8c3R5bGU7PjtsPERJU1BMQVk6bm9uZTs+Pj47Ozs7Ozs7Ozs7Pjs7Pjt0PEAwPDtAMDw7O0AwPHA8bDxIZWFkZXJUZXh0Oz47bDzliJvmlrDlhoXlrrk7Pj47Ozs7PjtAMDxwPGw8SGVhZGVyVGV4dDs+O2w85Yib5paw5a2m5YiGOz4+Ozs7Oz47QDA8cDxsPEhlYWRlclRleHQ7PjtsPOWIm+aWsOasoeaVsDs+Pjs7Ozs+Ozs7Pjs7Ozs7Ozs7Oz47Oz47dDxwPHA8bDxUZXh0O1Zpc2libGU7PjtsPOacrOS4k+S4muWFsTExMeS6ujtvPGY+Oz4+Oz47Oz47dDxwPHA8bDxWaXNpYmxlOz47bDxvPGY+Oz4+Oz47Oz47dDxwPHA8bDxWaXNpYmxlOz47bDxvPGY+Oz4+Oz47Oz47dDxwPHA8bDxWaXNpYmxlOz47bDxvPGY+Oz4+Oz47Oz47dDxwPHA8bDxUZXh0Oz47bDxaSlU7Pj47Pjs7Pjt0PHA8cDxsPEltYWdlVXJsOz47bDwuL2V4Y2VsLzk3MDUyNzEuanBnOz4+Oz47Oz47Pj47Pj47PgGGyciVaIkDb4w+sTsVpJ8ImvRN',
'btn_zcj': '(unable to decode value)',
'hidLanguage': '',
'ddl_kcxz': '',
'ddlXQ': '',
'ddlXN': '',}
headers1 = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36', 'Referer': 'http://220.178.71.156:85/(jnw0uoqufqsohg3jngkaci55)/xscjcx.aspx?xh=150402216&xm=%BB%C6%BA%C6&gnmkdm=N121605',
'Connection': 'keep-alive',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Upgrade-Insecure-Requests': '1'
}
data4 = s.post(url3, data=data2, headers=headers1).text
s2 = etree.HTML(data4)
with open('C:/Users\linx00\Desktop\cj.txt', 'w', encoding='utf-8') as f:
years = s2.xpath('//*[@id="Form1"]/div[2]/div/span/table/tr/td[1]/a/text()|//*[@id="Form1"]/div[2]/div/span/table/tr/td[1]/text()')
xueqi = s2.xpath('//*[@id="Form1"]/div[2]/div/span/table/tr/td[2]/a/text()|//*[@id="Form1"]/div[2]/div/span/table/tr/td[2]/text()')
kcmc = s2.xpath('//*[@id="Form1"]/div[2]/div/span/table/tr/td[4]/a/text()|//*[@id="Form1"]/div[2]/div/span/table/tr/td[4]/text()')
#因?yàn)榍懊娴膶傩圆灰粯?所以使用或關(guān)系來爬取
kcxz = s2.xpath('//*[@id="Form1"]/div[2]/div/span/table/tr/td[5]/text()')
xuefen = s2.xpath('//*[@id="Form1"]/div[2]/div/span/table/tr/td[7]/text()')
jidian = s2.xpath('//*[@id="Form1"]/div[2]/div/span/table/tr/td[8]/text()')
chengji = s2.xpath('//*[@id="Form1"]/div[2]/div/span/table/tr/td[9]/text()')
bkchengji = s2.xpath('//*[@id="Form1"]/div[2]/div/span/table/tr/td[11]/text()')
f.write('{}\n,{}\n,{}\n,{}\n,{}\n,{}\n,{}\n,{}\n'.format(years,xueqi, kcmc, kcxz, xuefen, jidian, chengji, bkchengji))
這樣需要的全部成績(jī)就被全部保存下來了熔掺。(但是保存的很不美觀饱搏,對(duì)于格式我不是很清楚怎么進(jìn)行完美的修改!有大佬可以幫助一下)