爬取煎蛋網(wǎng)圖片的一種思路

任何一個學(xué)習(xí)的過程谤狡,都需要練手項目。學(xué)網(wǎng)絡(luò)爬蟲就總會想去爬點什么東西南窗。網(wǎng)上更多介紹的就是爬取網(wǎng)站圖片牌里,圖片網(wǎng)站一般都有會自己的一套反爬技術(shù)。昨天遇到有帖子在說爬煎蛋網(wǎng)圖片死嗦,也就去試了試趋距。

其中的反爬技術(shù)分析在 Python爬蟲(15):煎蛋網(wǎng)加密處理方式 博客中已有詳細(xì)解說,思路方法也有說了越除,大家可以仔細(xì)去看看节腐。在這里,我的思路也一樣摘盆,但實現(xiàn)方法不是去將其js方法改造成為python方法(雖然我也覺得這是最佳方法翼雀,無奈我對加密算法不熟悉,代碼理解不了孩擂。接下來還是得去學(xué)學(xué)加密的算法才行狼渊。)這里使用一個偷巧的辦法,把js解密方法直接拿出來構(gòu)造一個html文件类垦,再把抓到的圖片hash值放進(jìn)去狈邑,讓它來給我解密還原成地址。有了地址你想怎么下載就很容易了蚤认,我使用的是用迅雷米苹。(爬取圖片的hash:把個含圖片的網(wǎng)頁都下載,直接抓取各個<span class="img-hash">***</span>值)烙懦。

構(gòu)造html文件時驱入,我是截取jandan_load_img()中有關(guān)的兩行代碼,jdXFKzuIDxRVqKYQfswJ5elNfow1x0JrJH()就全照原樣拷出來運行氯析,然后打開開發(fā)者工具亏较,邊運行邊看出現(xiàn)什么錯誤,需要什么方法就去原網(wǎng)站的js中尋找并補齊掩缓。除了hex_md5()外雪情,其它方法都可以在原網(wǎng)站的js中找到。百度了一下你辣,hex_md5()函數(shù)是在md5.js中巡通,我下邊也給我整個md5.js文件。(hex_md5()本來也是想拷貝出來用就好舍哄,可是看到md5.js里邊好多參數(shù)宴凉,若是拷出來不知會涉及多少其它東西,所以就干脆直接引用md5.js)表悬。

先上圖:


html.png

抓取圖片hash值的py代碼如下:

圖片hash都存放到img_hash.txt中

# -*- coding:utf-8 -*-
from lxml import etree
import requests, time

urls = ['http://jandan.net/ooxx/page-{}#comments'.format(i) for i in range(1, 41)]
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko)'
                         ' Chrome/22.0.1207.1 Safari/537.1'}
i = 1
img_hash = []
print('Downloading:', end='')
for url in urls:
    html = requests.get(url, headers=headers).text
    root = etree.HTML(html)
    span_img_hashs = root.xpath('//span[@class="img-hash"]')
    for span_img_hash in span_img_hashs:
        img_hash.append(span_img_hash.text)
    print(i, '\t', end='')
    i += 1
    time.sleep(3)
print('Download completed!')
with open('img_hash.txt', 'a') as f:
        f.write(str(img_hash))

html文件如下:

  • get_url()函數(shù)是我加上去的弥锄,將hash值作為參數(shù)調(diào)用jandan_load_img()
  • 打開img_hash.txt,將其中的hash值拷貝給get_url()函數(shù)的hashlist變量
<!DOCTYPE html>
<html>
<head>
    <title></title>
    <script type="text/ecmascript" src="md5.js"></script>
    <script type="text/javascript">
        function jandan_load_img(e) {
            var c = jdjDMYMvK51QlNY6NdLY1OkZw6dpQvspIM(e, "aPz8sQnzRxiHfhgesalhIBhfKZczglYq");
            var a = c.replace(/(\/\/\w+\.sinaimg\.cn\/)(\w+)(\/.+\.(gif|jpg|jpeg))/, "$1large$3");
            return a
        }
        var jdjDMYMvK51QlNY6NdLY1OkZw6dpQvspIM = function(o, y, g) {
            var d = o;
            var l = "DECODE";
            var y = y ? y : "";
            var g = g ? g : 0;
            var h = 4;
            y = md5(y);
            var x = md5(y.substr(0, 16));
            var v = md5(y.substr(16, 16));
            if (h) {
                if (l == "DECODE") {
                    var b = md5(microtime());
                    var e = b.length - h;
                    var u = b.substr(e, h)
                }
            } else {
                var u = ""
            }
            var t = x + md5(x + u);
            var n;
            if (l == "DECODE") {
                g = g ? g + time() : 0;
                tmpstr = g.toString();
                if (tmpstr.length >= 10) {
                    o = tmpstr.substr(0, 10) + md5(o + v).substr(0, 16) + o
                } else {
                    var f = 10 - tmpstr.length;
                    for (var q = 0; q < f; q++) {
                        tmpstr = "0" + tmpstr
                    }
                    o = tmpstr + md5(o + v).substr(0, 16) + o
                }
                n = o
            }
            var k = new Array(256);
            for (var q = 0; q < 256; q++) {
                k[q] = q
            }
            var r = new Array();
            for (var q = 0; q < 256; q++) {
                r[q] = t.charCodeAt(q % t.length)
            }
            for (var p = q = 0; q < 256; q++) {
                p = (p + k[q] + r[q]) % 256;
                tmp = k[q];
                k[q] = k[p];
                k[p] = tmp
            }
            var m = "";
            n = n.split("");
            for (var w = p = q = 0; q < n.length; q++) {
                w = (w + 1) % 256;
                p = (p + k[w]) % 256;
                tmp = k[w];
                k[w] = k[p];
                k[p] = tmp;
                m += chr(ord(n[q]) ^ (k[(k[w] + k[p]) % 256]))
            }
            if (l == "DECODE") {
                m = base64_encode(m);
                var c = new RegExp("=","g");
                m = m.replace(c, "");
                m = u + m;
                m = base64_decode(d)
            }
            return m
        };
        function md5(a) {
            return hex_md5(a)
        }
        function base64_encode(a) {
            return window.btoa(a)
        }
        function base64_decode(a) {
            return window.atob(a)
        }
        function microtime(b) {
            var a = new Date().getTime();
            var c = parseInt(a / 1000);
            return b ? (a / 1000) : (a - (c * 1000)) / 1000 + " " + c
        }
        function chr(a) {
            return String.fromCharCode(a)
        }
        function ord(a) {
            return a.charCodeAt()
        }
        function get_url() {
            var hashlist = ['Ly93eDQuc2luYWltZy5jbi9tdzYwMC8wMDc2QlNTNWx5MWZ1am93MDQyNGJqMzBpYTB0M3dnMi5qcGc=', 'Ly93dzMuc2luYWltZy5jbi9tdzEwMjQvMDA3M29iNlBneTFmdWpvNWdodGNiZzMwNnkwYW11MHkuZ2lm', 'Ly93eDQuc2luYWltZy5jbi9tdzYwMC8wMDc2QlNTNWx5MWZ1am5teGpqdWdqMzExMTFqazRxcC5qcGc='];
            // var urllist = new Array()
            var content = '';
            for (hash in hashlist){
                var url = 'http:' + jandan_load_img(hashlist[hash]);
                // urllist[hash] = url;
                content += '<a href="'+url+'">'+url+'</a>';
                content += '<br>'
            }
            document.getElementById("content").innerHTML = content;
        }
    </script>
</head>
<body>
    <button onclick="get_url()">click here</button>
    <div id="content"></div>
</body>
</html>

md5.js:

/*
 * A JavaScript implementation of the RSA Data Security, Inc. MD5 Message
 * Digest Algorithm, as defined in RFC 1321.
 * Version 2.1 Copyright (C) Paul Johnston 1999 - 2002.
 * Other contributors: Greg Holt, Andrew Kepert, Ydnar, Lostinet
 * Distributed under the BSD License
 * See http://pajhome.org.uk/crypt/md5 for more info.
 */
/*
 * Configurable variables. You may need to tweak these to be compatible with
 * the server-side, but the defaults work in most cases.
 */
var hexcase = 0; /* hex output format. 0 - lowercase; 1 - uppercase  */
var b64pad = ""; /* base-64 pad character. "=" for strict RFC compliance */
var chrsz = 8; /* bits per input character. 8 - ASCII; 16 - Unicode  */
/*
 * These are the functions you'll usually want to call
 * They take string arguments and return either hex or base-64 encoded strings
 */
function hex_md5(s){ return binl2hex(core_md5(str2binl(s), s.length * chrsz));}
function b64_md5(s){ return binl2b64(core_md5(str2binl(s), s.length * chrsz));}
function str_md5(s){ return binl2str(core_md5(str2binl(s), s.length * chrsz));}
function hex_hmac_md5(key, data) { return binl2hex(core_hmac_md5(key, data)); }
function b64_hmac_md5(key, data) { return binl2b64(core_hmac_md5(key, data)); }
function str_hmac_md5(key, data) { return binl2str(core_hmac_md5(key, data)); }
/*
 * Perform a simple self-test to see if the VM is working
 */
function md5_vm_test()
{
 return hex_md5("abc") == "900150983cd24fb0d6963f7d28e17f72";
}
/*
 * Calculate the MD5 of an array of little-endian words, and a bit length
 */
function core_md5(x, len)
{
 /* append padding */
 x[len >> 5] |= 0x80 << ((len) % 32);
 x[(((len + 64) >>> 9) << 4) + 14] = len;
 var a = 1732584193;
 var b = -271733879;
 var c = -1732584194;
 var d = 271733878;
 for(var i = 0; i < x.length; i += 16)
 {
 var olda = a;
 var oldb = b;
 var oldc = c;
 var oldd = d;
 a = md5_ff(a, b, c, d, x[i+ 0], 7 , -680876936);
 d = md5_ff(d, a, b, c, x[i+ 1], 12, -389564586);
 c = md5_ff(c, d, a, b, x[i+ 2], 17, 606105819);
 b = md5_ff(b, c, d, a, x[i+ 3], 22, -1044525330);
 a = md5_ff(a, b, c, d, x[i+ 4], 7 , -176418897);
 d = md5_ff(d, a, b, c, x[i+ 5], 12, 1200080426);
 c = md5_ff(c, d, a, b, x[i+ 6], 17, -1473231341);
 b = md5_ff(b, c, d, a, x[i+ 7], 22, -45705983);
 a = md5_ff(a, b, c, d, x[i+ 8], 7 , 1770035416);
 d = md5_ff(d, a, b, c, x[i+ 9], 12, -1958414417);
 c = md5_ff(c, d, a, b, x[i+10], 17, -42063);
 b = md5_ff(b, c, d, a, x[i+11], 22, -1990404162);
 a = md5_ff(a, b, c, d, x[i+12], 7 , 1804603682);
 d = md5_ff(d, a, b, c, x[i+13], 12, -40341101);
 c = md5_ff(c, d, a, b, x[i+14], 17, -1502002290);
 b = md5_ff(b, c, d, a, x[i+15], 22, 1236535329);
 a = md5_gg(a, b, c, d, x[i+ 1], 5 , -165796510);
 d = md5_gg(d, a, b, c, x[i+ 6], 9 , -1069501632);
 c = md5_gg(c, d, a, b, x[i+11], 14, 643717713);
 b = md5_gg(b, c, d, a, x[i+ 0], 20, -373897302);
 a = md5_gg(a, b, c, d, x[i+ 5], 5 , -701558691);
 d = md5_gg(d, a, b, c, x[i+10], 9 , 38016083);
 c = md5_gg(c, d, a, b, x[i+15], 14, -660478335);
 b = md5_gg(b, c, d, a, x[i+ 4], 20, -405537848);
 a = md5_gg(a, b, c, d, x[i+ 9], 5 , 568446438);
 d = md5_gg(d, a, b, c, x[i+14], 9 , -1019803690);
 c = md5_gg(c, d, a, b, x[i+ 3], 14, -187363961);
 b = md5_gg(b, c, d, a, x[i+ 8], 20, 1163531501);
 a = md5_gg(a, b, c, d, x[i+13], 5 , -1444681467);
 d = md5_gg(d, a, b, c, x[i+ 2], 9 , -51403784);
 c = md5_gg(c, d, a, b, x[i+ 7], 14, 1735328473);
 b = md5_gg(b, c, d, a, x[i+12], 20, -1926607734);
 a = md5_hh(a, b, c, d, x[i+ 5], 4 , -378558);
 d = md5_hh(d, a, b, c, x[i+ 8], 11, -2022574463);
 c = md5_hh(c, d, a, b, x[i+11], 16, 1839030562);
 b = md5_hh(b, c, d, a, x[i+14], 23, -35309556);
 a = md5_hh(a, b, c, d, x[i+ 1], 4 , -1530992060);
 d = md5_hh(d, a, b, c, x[i+ 4], 11, 1272893353);
 c = md5_hh(c, d, a, b, x[i+ 7], 16, -155497632);
 b = md5_hh(b, c, d, a, x[i+10], 23, -1094730640);
 a = md5_hh(a, b, c, d, x[i+13], 4 , 681279174);
 d = md5_hh(d, a, b, c, x[i+ 0], 11, -358537222);
 c = md5_hh(c, d, a, b, x[i+ 3], 16, -722521979);
 b = md5_hh(b, c, d, a, x[i+ 6], 23, 76029189);
 a = md5_hh(a, b, c, d, x[i+ 9], 4 , -640364487);
 d = md5_hh(d, a, b, c, x[i+12], 11, -421815835);
 c = md5_hh(c, d, a, b, x[i+15], 16, 530742520);
 b = md5_hh(b, c, d, a, x[i+ 2], 23, -995338651);
 a = md5_ii(a, b, c, d, x[i+ 0], 6 , -198630844);
 d = md5_ii(d, a, b, c, x[i+ 7], 10, 1126891415);
 c = md5_ii(c, d, a, b, x[i+14], 15, -1416354905);
 b = md5_ii(b, c, d, a, x[i+ 5], 21, -57434055);
 a = md5_ii(a, b, c, d, x[i+12], 6 , 1700485571);
 d = md5_ii(d, a, b, c, x[i+ 3], 10, -1894986606);
 c = md5_ii(c, d, a, b, x[i+10], 15, -1051523);
 b = md5_ii(b, c, d, a, x[i+ 1], 21, -2054922799);
 a = md5_ii(a, b, c, d, x[i+ 8], 6 , 1873313359);
 d = md5_ii(d, a, b, c, x[i+15], 10, -30611744);
 c = md5_ii(c, d, a, b, x[i+ 6], 15, -1560198380);
 b = md5_ii(b, c, d, a, x[i+13], 21, 1309151649);
 a = md5_ii(a, b, c, d, x[i+ 4], 6 , -145523070);
 d = md5_ii(d, a, b, c, x[i+11], 10, -1120210379);
 c = md5_ii(c, d, a, b, x[i+ 2], 15, 718787259);
 b = md5_ii(b, c, d, a, x[i+ 9], 21, -343485551);
 a = safe_add(a, olda);
 b = safe_add(b, oldb);
 c = safe_add(c, oldc);
 d = safe_add(d, oldd);
 }
 return Array(a, b, c, d);
}
/*
 * These functions implement the four basic operations the algorithm uses.
 */
function md5_cmn(q, a, b, x, s, t)
{
 return safe_add(bit_rol(safe_add(safe_add(a, q), safe_add(x, t)), s),b);
}
function md5_ff(a, b, c, d, x, s, t)
{
 return md5_cmn((b & c) | ((~b) & d), a, b, x, s, t);
}
function md5_gg(a, b, c, d, x, s, t)
{
 return md5_cmn((b & d) | (c & (~d)), a, b, x, s, t);
}
function md5_hh(a, b, c, d, x, s, t)
{
 return md5_cmn(b ^ c ^ d, a, b, x, s, t);
}
function md5_ii(a, b, c, d, x, s, t)
{
 return md5_cmn(c ^ (b | (~d)), a, b, x, s, t);
}
/*
 * Calculate the HMAC-MD5, of a key and some data
 */
function core_hmac_md5(key, data)
{
 var bkey = str2binl(key);
 if(bkey.length > 16) bkey = core_md5(bkey, key.length * chrsz);
 var ipad = Array(16), opad = Array(16);
 for(var i = 0; i < 16; i++)
 {
 ipad[i] = bkey[i] ^ 0x36363636;
 opad[i] = bkey[i] ^ 0x5C5C5C5C;
 }
 var hash = core_md5(ipad.concat(str2binl(data)), 512 + data.length * chrsz);
 return core_md5(opad.concat(hash), 512 + 128);
}
/*
 * Add integers, wrapping at 2^32. This uses 16-bit operations internally
 * to work around bugs in some JS interpreters.
 */
function safe_add(x, y)
{
 var lsw = (x & 0xFFFF) + (y & 0xFFFF);
 var msw = (x >> 16) + (y >> 16) + (lsw >> 16);
 return (msw << 16) | (lsw & 0xFFFF);
}
/*
 * Bitwise rotate a 32-bit number to the left.
 */
function bit_rol(num, cnt)
{
 return (num << cnt) | (num >>> (32 - cnt));
}
/*
 * Convert a string to an array of little-endian words
 * If chrsz is ASCII, characters >255 have their hi-byte silently ignored.
 */
function str2binl(str)
{
 var bin = Array();
 var mask = (1 << chrsz) - 1;
 for(var i = 0; i < str.length * chrsz; i += chrsz)
 bin[i>>5] |= (str.charCodeAt(i / chrsz) & mask) << (i%32);
 return bin;
}
/*
 * Convert an array of little-endian words to a string
 */
function binl2str(bin)
{
 var str = "";
 var mask = (1 << chrsz) - 1;
 for(var i = 0; i < bin.length * 32; i += chrsz)
 str += String.fromCharCode((bin[i>>5] >>> (i % 32)) & mask);
 return str;
}
/*
 * Convert an array of little-endian words to a hex string.
 */
function binl2hex(binarray)
{
 var hex_tab = hexcase ? "0123456789ABCDEF" : "0123456789abcdef";
 var str = "";
 for(var i = 0; i < binarray.length * 4; i++)
 {
 str += hex_tab.charAt((binarray[i>>2] >> ((i%4)*8+4)) & 0xF) +
   hex_tab.charAt((binarray[i>>2] >> ((i%4)*8 )) & 0xF);
 }
 return str;
}
/*
 * Convert an array of little-endian words to a base-64 string
 */
function binl2b64(binarray)
{
 var tab = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
 var str = "";
 for(var i = 0; i < binarray.length * 4; i += 3)
 {
 var triplet = (((binarray[i >> 2] >> 8 * ( i %4)) & 0xFF) << 16)
    | (((binarray[i+1 >> 2] >> 8 * ((i+1)%4)) & 0xFF) << 8 )
    | ((binarray[i+2 >> 2] >> 8 * ((i+2)%4)) & 0xFF);
 for(var j = 0; j < 4; j++)
 {
  if(i * 8 + j * 6 > binarray.length * 32) str += b64pad;
  else str += tab.charAt((triplet >> 6*(3-j)) & 0x3F);
 }
 }
 return str;
}

如果本文對您有幫助蟆沫,請給我留個言籽暇。謝謝!

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末饭庞,一起剝皮案震驚了整個濱河市戒悠,隨后出現(xiàn)的幾起案子,更是在濱河造成了極大的恐慌舟山,老刑警劉巖绸狐,帶你破解...
    沈念sama閱讀 222,946評論 6 518
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件,死亡現(xiàn)場離奇詭異捏顺,居然都是意外死亡六孵,警方通過查閱死者的電腦和手機(jī),發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 95,336評論 3 399
  • 文/潘曉璐 我一進(jìn)店門幅骄,熙熙樓的掌柜王于貴愁眉苦臉地迎上來劫窒,“玉大人,你說我怎么就攤上這事拆座≈魑。” “怎么了?”我有些...
    開封第一講書人閱讀 169,716評論 0 364
  • 文/不壞的土叔 我叫張陵挪凑,是天一觀的道長孕索。 經(jīng)常有香客問我,道長躏碳,這世上最難降的妖魔是什么搞旭? 我笑而不...
    開封第一講書人閱讀 60,222評論 1 300
  • 正文 為了忘掉前任,我火速辦了婚禮,結(jié)果婚禮上肄渗,老公的妹妹穿的比我還像新娘镇眷。我一直安慰自己,他們只是感情好翎嫡,可當(dāng)我...
    茶點故事閱讀 69,223評論 6 398
  • 文/花漫 我一把揭開白布欠动。 她就那樣靜靜地躺著,像睡著了一般惑申。 火紅的嫁衣襯著肌膚如雪具伍。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 52,807評論 1 314
  • 那天圈驼,我揣著相機(jī)與錄音人芽,去河邊找鬼。 笑死绩脆,一個胖子當(dāng)著我的面吹牛啼肩,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播衙伶,決...
    沈念sama閱讀 41,235評論 3 424
  • 文/蒼蘭香墨 我猛地睜開眼祈坠,長吁一口氣:“原來是場噩夢啊……” “哼!你這毒婦竟也來了矢劲?” 一聲冷哼從身側(cè)響起赦拘,我...
    開封第一講書人閱讀 40,189評論 0 277
  • 序言:老撾萬榮一對情侶失蹤,失蹤者是張志新(化名)和其女友劉穎芬沉,沒想到半個月后躺同,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 46,712評論 1 320
  • 正文 獨居荒郊野嶺守林人離奇死亡丸逸,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點故事閱讀 38,775評論 3 343
  • 正文 我和宋清朗相戀三年蹋艺,在試婚紗的時候發(fā)現(xiàn)自己被綠了。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片黄刚。...
    茶點故事閱讀 40,926評論 1 353
  • 序言:一個原本活蹦亂跳的男人離奇死亡捎谨,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出憔维,到底是詐尸還是另有隱情涛救,我是刑警寧澤,帶...
    沈念sama閱讀 36,580評論 5 351
  • 正文 年R本政府宣布业扒,位于F島的核電站检吆,受9級特大地震影響,放射性物質(zhì)發(fā)生泄漏程储。R本人自食惡果不足惜蹭沛,卻給世界環(huán)境...
    茶點故事閱讀 42,259評論 3 336
  • 文/蒙蒙 一臂寝、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧摊灭,春花似錦交煞、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 32,750評論 0 25
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽集嵌。三九已至萝挤,卻和暖如春,著一層夾襖步出監(jiān)牢的瞬間根欧,已是汗流浹背怜珍。 一陣腳步聲響...
    開封第一講書人閱讀 33,867評論 1 274
  • 我被黑心中介騙來泰國打工, 沒想到剛下飛機(jī)就差點兒被人妖公主榨干…… 1. 我叫王不留凤粗,地道東北人酥泛。 一個月前我還...
    沈念sama閱讀 49,368評論 3 379
  • 正文 我出身青樓,卻偏偏與公主長得像嫌拣,于是被迫代替她去往敵國和親柔袁。 傳聞我的和親對象是個殘疾皇子,可洞房花燭夜當(dāng)晚...
    茶點故事閱讀 45,930評論 2 361