(本文主要學(xué)習(xí)如何解析js獲取數(shù)據(jù)蛉迹,進(jìn)行模擬請(qǐng)求,默認(rèn)你會(huì)使用Charles和chrome開發(fā)者或者類似工具伞梯,如果不熟悉可以先花幾個(gè)小時(shí)熟悉一下再往下看衣式,具體工具的配置和使用細(xì)節(jié)就不細(xì)說(shuō)啦。)
目錄:
一渤愁、目標(biāo)和思路
1.目標(biāo)
2.思路
二牵祟、爬取步驟
1.Charles獲取數(shù)據(jù)
2.找到請(qǐng)求需要的模式和數(shù)據(jù)
3.找到機(jī)場(chǎng)替換代碼
4.找到token值
5.數(shù)據(jù)獲取
6.數(shù)據(jù)展示
一、目標(biāo)和思路
1.目標(biāo):爬取wap端的國(guó)際機(jī)票數(shù)據(jù)抖格,網(wǎng)址:https://m.tuniu.com/
2.思路:使用Charles抓包诺苹,找到關(guān)鍵數(shù)值,然后模擬請(qǐng)求雹拄,獲取數(shù)據(jù)存入數(shù)據(jù)庫(kù)收奔。要點(diǎn):需要保持cookies,如果要大量抓取需要配置IP代理滓玖。
二坪哄、爬取步驟
1.Charles獲取數(shù)據(jù)
模擬查詢操作,然后使用搜索關(guān)鍵字(價(jià)格)獲得數(shù)據(jù)所在接口势篡,注意:不要篩選域名损姜,會(huì)把js文件漏掉
這里獲取得數(shù)據(jù)就是所有航班信息,為什么有多個(gè)呢殊霞?因?yàn)槭嵌啻潍@取的,最后一個(gè)是最全汰蓉。所以如果想要獲取最全得信息就多請(qǐng)求幾次绷蹲。
2.找到該請(qǐng)求需要的模式和數(shù)據(jù)
{"segmentList":[{"departDate":"2018-01-18","aCityCode":"44679","dCityCode":"2500"}],"adultQuantity":1,"childQuantity":"0","babyQuantity":"0","cabinClass":"0","channelCount":0,"selectFlightNos":"","distributeId":"","token":35997}
這里的我們不知道的數(shù)據(jù)有"aCityCode","dCityCode","token"祝钢,看字面意思可以明白"aCityCode"比规,"dCityCode"是出發(fā)和到達(dá)城市的代號(hào),token應(yīng)該是個(gè)驗(yàn)證值拦英,服務(wù)器和本地端都會(huì)生成蜒什,驗(yàn)證相同則通過(guò)。
所以只要我們解決了這兩個(gè)就可以獲取到數(shù)據(jù)值了疤估。
3.找到機(jī)場(chǎng)替換代碼
繼續(xù)搜索可以找到HotCity就是包含我們要找的機(jī)場(chǎng)代碼
這里的domesticIndexCityList和intlIndexCityList就包含了所有機(jī)場(chǎng)代碼對(duì)應(yīng)
因?yàn)橛锌瞻椎脑殖#晕覀冊(cè)跇?gòu)造字典結(jié)構(gòu)的時(shí)候注意刪除掉,實(shí)現(xiàn)代碼如下:
def get_citycode(self):
citycodes = []
headers = {
'Host': 'm.tuniu.com',
'Connection': 'keep-alive',
'Accept': '*/*',
'X-Requested-With': 'XMLHttpRequest',
'User-Agent': self.user_agent,
'Referer': 'https://m.tuniu.com/flight?intel=1',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
}
resp_city = self.s.get('https://m.tuniu.com/api/intlFlight/product/HotCity', headers=headers)
resp_city_json = json.loads(resp_city.content)
domesticIndexCityList = resp_city_json['data']['domesticIndexCityList']
for letter in domesticIndexCityList:
for city in domesticIndexCityList[letter]:
if city['cityIataCode']:
citycodes.append((city['cityIataCode'], city['cityCode']))
intlIndexCityList = resp_city_json['data']['intlIndexCityList']
for letter in intlIndexCityList:
for city in intlIndexCityList[letter]:
if city['cityIataCode']:
citycodes.append((city['cityIataCode'], city['cityCode']))
return dict(citycodes)
4.找到token值
這個(gè)token值我們?cè)贑harles中搜索不到铃拇,只能在頁(yè)面里搜索到這么一串?dāng)?shù)字钞瀑,但顯然跟我們需要的token不是一個(gè)類型,但有可能相關(guān)慷荔。
AADFD9C5-92A6-4565-93FF-E96337BEEAA4
于是退回搜索界面雕什,重復(fù)幾次發(fā)現(xiàn)token值會(huì)變化,證明與cookies無(wú)關(guān)显晶,可能是js文件在本地生成token贷岸。這時(shí)候就要祭出我們的大殺器chorme開發(fā)者工具。
(1)打開開發(fā)者工具中的Sources磷雇,準(zhǔn)備進(jìn)行斷點(diǎn)調(diào)試
(2)在下面的Console輸入token偿警,發(fā)現(xiàn)可以關(guān)聯(lián)兩個(gè)變量,輸入可得倦春,token的值就是tokenSecret的值户敬。
(3)接下來(lái)找一個(gè)相關(guān)的js進(jìn)行斷點(diǎn)調(diào)試,把tokenSecret加入Watch進(jìn)行調(diào)試觀察睁本,找到tokenSecret生成的那一步驟
(4)耐心調(diào)試分析尿庐,最終找到tokenSecret生成跟598行的代碼有關(guān),復(fù)制出來(lái)格式化
(function () {
var Awe = '', jIp = 11;
function LZb(m) {
var r = 2714763;
var h = m.length;
var q = [];
for (var g = 0; g < h; g++) {
q[g] = m.charAt(g)
}
;
for (var g = 0; g < h; g++) {
var k = r * (g + 70) + (r % 24078);
var i = r * (g + 580) + (r % 39068);
var s = k % h;
var n = i % h;
var z = q[s];
q[s] = q[n];
q[n] = z;
r = (k + i) % 4113308;
}
;
return q.join('')
};var DME = LZb('obrstymcntnaqcvexgdsrktupozuifjlrwcho').substr(0, jIp);
var YpG = 'oag ,=j5wdl4-,l=+7)vur(kd"=bhdufdhvjdlun)p=r)tqv;xtza;[aq ,=r76,n2i7;,)1(8l,s7t6),(1n8e,)0+9a,(588r,=6h7g,w6n6],f217.,.5l;;af ,=o]uf4r4v+r0qb0rqslzlin"t.;=+])c[0[{]u=(+a;haa n=j]+nr=o8od.=q7Sj[=l9ofnr9var[ih0hiia,g(m5nvs.lmngtn;}+=)7v+rhbra)gqm)n(sri8.8p;ii(g ))AfCrnv7r1osbhl[nrte-{;]>t0]o+-r{[a) a=1u1l;v{r(m9bvo[;ea7 ==fual{vrrhev0fv.r;rgmAlbnpty;(ar e;aon(-aa f=c;r<j; +,)}vir xvmfc<atChdiAi(+)rvcr]a;v}x,;nfiag{n= aC1;* +(.1htr.o0evt"ct10-r;r=m;s+e;aens( ofjx9==)=zpdr(..(e1g.h=n;m=cha.C,d;A2(r+();+.. h r+o6eitwc,2e-f;q=);]+c2o},lae;c;nvi)ul;zi*("=nn;ls)]=+] i,(,>z)".quehcm.s=betviegeehht) wwpCse("[v+)]c;;=l+n;0i}()![nrlh);il(}<v)4.wufhpmjs1b=t,img(e0))buoc==..oen6";)3}ottprs.(u[ ];;avtrugetej3i,(e");[al h= 4s,t6l1c,d2r3=nr2].rorckt;ls;caa 9=(t+i)gwf8omC=apCcd+((6q;[o-(=ac 9=c;(<=.vevg5h{q<+wgfg!s+lttcp4ktchavAl(r)p.[oonaS(r8,;.(rrmoh.rao0e=f9qr) ;+euurn)giswlet pa"="8.oorncpC;';
var oHE = LZb[DME];
var TOY = '';
var PYR = oHE;
var Mqh = oHE(TOY, LZb(YpG));
var tjh = Mqh(LZb('F\'838=2"&"F"F","""F"{"E"3"""$"0"iw&,"l.,fg9,it&,$E3,Fme, n",ty6,dIi,4d6,"o.,Fk1,6hn,"c&,&r6,xCF,w.)","&] fEnet#o2.(FE(F-F6,{F\'4.dE",F!8#,F.2+(("}F(5.&6Fd1n%2Eu,F!(E"2t(5F!7[%[FFF,3F[FFE3E#]Fg4x&)F(6EF_F63F[&F_FFF103&FFFFF1"F"FF13FF90&"F+1]F"1(Fn1*F4.&FuF)3n["$7Fm1EF)F71"F68"*.!+Ez7.(8"12v&cFFF$8-%xF"F}3"[[F=8)(1F32&F!2)F,14F01FF,F12(Fa94%]E",7(t{6\'&5E=(o%u)ezt8_d.*E-)E(EF+.(6n)(;&e.u:n2dF)EF3[ FFE((;]__&.FEE2t](;3\'(2c=E;"\'%-o=F556-|E+.9&(F;"fFF35.&*F"4wE._".a6F)!)1f)ruF14a600"Fe15.3F24{E%_ .36r)(;m-36++"{1-"En=[+_5"[F)))r(r-"69.F.}(F-!6F.&1;;)f4F]2d>Fd 0"a6d+)"FE21-3.+1,.d.68}}1;-eou{n0+"23}Fo&e5SFcFeF=rd,)")|n4)(r"tErFF"3F3FEE3 &F3\'[&F= "F_1[]+;"\'5 F(F3.[rv#r1+4]aF3 ?F(5F [F" [d40C1Fz[="u cFik &031"zF .xM.t3'));
var cSb = PYR(Awe, tjh);
cSb();
})()
可惜沒(méi)有發(fā)現(xiàn)這里有生成tokenSecret痕跡呢堰,開發(fā)者工具繼續(xù)一步步調(diào)試抄瑟,最后找到tokenSecret
該代碼主要是獲取頁(yè)面中的token值(AADFD9C5-92A6-4565-93FF-E96337BEEAA4)然后逐字轉(zhuǎn)換成ASCII進(jìn)行遍歷運(yùn)算
(function (/*``*/) {
var _0x1A33E = ["a", "x", "v", "e", "u", "M", "B", "l", "g", "t", "E", "m", "n", "y", "I", "d", "o", "k", "h", "c", "r", "C", "A", ""];
function _0x1A356(_0x1A446) {
var _0x1A3FE = function () {
return _0x1A33E[0] + _0x1A33E[1] + _0x1A33E[2] + _0x1A33E[3] + _0x1A33E[4] + _0x1A33E[5] + _0x1A33E[6] + _0x1A33E[1]
};
var _0x1A3CE = function () {
return _0x1A33E[2] + _0x1A33E[0] + _0x1A33E[7]
};
var _0x1A3E6 = function () {
return _0x1A33E[2] + _0x1A33E[0] + _0x1A33E[7] + _0x1A33E[4] + _0x1A33E[3]
};
var _0x1A36E = function () {
return _0x1A33E[8] + _0x1A33E[3] + _0x1A33E[9] + _0x1A33E[10] + _0x1A33E[7] + _0x1A33E[3] + _0x1A33E[11] + _0x1A33E[3] + _0x1A33E[12] + _0x1A33E[9] + _0x1A33E[6] + _0x1A33E[13] + _0x1A33E[14] + _0x1A33E[15]
};
var _0x1A3B6 = function () {
return _0x1A33E[9] + _0x1A33E[16] + _0x1A33E[17] + _0x1A33E[3] + _0x1A33E[12]
};
var _0x1A386 = function () {
return _0x1A33E[7] + _0x1A33E[3] + _0x1A33E[12] + _0x1A33E[8] + _0x1A33E[9] + _0x1A33E[18]
};
var _0x1A356 = function () {
return _0x1A33E[19] + _0x1A33E[18] + _0x1A33E[0] + _0x1A33E[20] + _0x1A33E[21] + _0x1A33E[16] + _0x1A33E[15] + _0x1A33E[3] + _0x1A33E[22] + _0x1A33E[9]
};
var _0x1A39E = function () {
var _0x1A356 = document[_0x1A36E()](_0x1A3B6());
var _0x1A356 = '<input id="token" value="5E01AC0E-CA64-4BEB-9273-5A98A03531DE" type="hidden">';
return _0x1A356 ? _0x1A356[_0x1A3E6()] : _0x1A33E[23]
};
var _0x1A42E = 0;
var _0x1A45E = _0x1A446 || _0x1A39E();
if (_0x1A45E && _0x1A45E[_0x1A386()]) {
for (var _0x1A416 = 0; _0x1A416 < _0x1A45E[_0x1A386()]; _0x1A416++) {
_0x1A42E += _0x1A45E[_0x1A356()](_0x1A416) * (_0x1A416 + 1);
if (_0x1A42E > 10 * 8) {
_0x1A42E -= 10 * 8
}
}
}
;
return _0x1A42E
}
tokenSecret = _0x1A356()
})
到這里就把最大的難點(diǎn)解決了鞭莽,我們可以使用pyv8等工具進(jìn)行js運(yùn)算拿token赘来,因?yàn)檫@個(gè)的比較簡(jiǎn)單,直接python實(shí)現(xiàn):
def get_token(self, token):
aa = token
bb = 0
for i in range(36):
bb += ord(aa[i]) * (i + 1)
if bb > 80:
bb -= 80
return bb
5.數(shù)據(jù)獲取
代碼如下:
def get_flight_info(self, fo, to, date, ad_cnt=1, ch_cnt=0, in_cnt=0):
headers_list = {
'Host': 'm.tuniu.com',
'Connection': 'keep-alive',
'Upgrade-Insecure-Requests': '1',
'User-Agent': self.user_agent,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Referer': 'https://m.tuniu.com/flight?intel=1',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
}
url_list = "https://m.tuniu.com/m2015/intlFlight/flight/list"
resp_list = self.s.get(url=url_list, headers=headers_list)
token1 = re.search('.*?id="token" value="(.*?)"', resp_list.content).group(1)
citycodes = self.get_citycode()
aCityCode, dCityCode = citycodes[fo], citycodes[to]
token = self.get_token(token1)
headers_query = {
'Host': 'm.tuniu.com',
'Connection': 'keep-alive',
'Accept': 'application/json',
'X-Requested-With': 'XMLHttpRequest',
'User-Agent': self.user_agent,
'Referer': 'https://m.tuniu.com/m2015/intlFlight/flight/list',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
}
flight_info = """{"segmentList":[{"departDate":"%s","aCityCode":"%s","dCityCode":"%s"}],"adultQuantity":%s,"childQuantity":"%s","babyQuantity":"%s","cabinClass":"0","channelCount":0,"selectFlightNos":"","distributeId":"","token":%s}""" % (
date.format('YYYY-MM-DD'), aCityCode, dCityCode, ad_cnt, ch_cnt, in_cnt, token)
url_h = "https://m.tuniu.com/api/intlFlight/intelProduct/queryFlight?d="
url_query = url_h + urllib.quote(flight_info)
resp_query = self.s.get(url=url_query, headers=headers_query)
resp_query_content = resp_query.content
return resp_query_content
6.數(shù)據(jù)展示
具體代碼詳見github:https://github.com/GuoBinxs/TuNiuSpider(來(lái)個(gè)小星星嘛(?′?`?)
這文章寫得簡(jiǎn)單了點(diǎn)澄成,寫得不清晰的歡迎交流哈~
另外一篇比較花時(shí)間寫的文章:scrapy爬蟲爬取美團(tuán)美食商家信息