最近的項(xiàng)目中需要自動(dòng)翻譯一些英文的文章祥国,所以就在簡(jiǎn)書(shū)上找了找有沒(méi)有有道翻譯的小爬蟲(chóng)导狡,結(jié)果就找到了一篇文章 破解有道翻譯反爬蟲(chóng)機(jī)制扰法。
可能是有道那邊有調(diào)整的原因,結(jié)果文章中有些小問(wèn)題决左,在此做一下更正。
瀏覽器打開(kāi) http://fanyi.youdao.com/
走贪,右鍵 -> 檢查 -> Network項(xiàng)佛猛。在翻譯框中輸入 beautiful
后,我們會(huì)發(fā)現(xiàn)請(qǐng)求來(lái)了坠狡。
我們已經(jīng)確定了很多信息:
URL:
http://fanyi.youdao.com/translate_osmartresult=dict&smartresult=rule
請(qǐng)求方法為 POST继找。
請(qǐng)求參數(shù)
我們?cè)僭嚵诵┢渌岛蟀l(fā)現(xiàn),果真變化的只有那幾個(gè)參數(shù):
i : 我們要翻譯的詞或者句子
salt : 加密用到的鹽
sign : 簽名字符串
然后跟著作者的思路逃沿,來(lái)到了哪個(gè)迷迷茫茫的JS文件中婴渡,終于找到:
嗯,沒(méi)有問(wèn)題感挥,和作者說(shuō)的沒(méi)有差錯(cuò)缩搅。
只是秘鑰變了 ebSeFb%=XZ%T[KZ)c(sy!
。但是比較好奇触幼,有道是如何更換秘鑰的硼瓣,策略又是什么?
先用作者文章中的代碼,更換秘鑰堂鲤,直接跑起來(lái)亿傅,BUT,BUT瘟栖,BUT沒(méi)有跑通:
{'errorCode': 50}
靜下心來(lái)葵擎,好好想想“胗矗可能是沒(méi)有加 header
酬滤,有道應(yīng)該會(huì)有防爬蟲(chóng)的簡(jiǎn)單機(jī)制。再試:
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36",
"Accept":"application/json, text/javascript, */*; q=0.01",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
"Content-Type":"application/x-www-form-urlencoded; charset=UTF-8",
"Cookie":"_ntes_nnid=c686062b6d8c9e3f11e2a8413b5bb9a8,1517022642199; OUTFOX_SEARCH_USER_ID_NCOO=1367486017.479911; OUTFOX_SEARCH_USER_ID=722357816@10.168.11.24; DICT_UGC=be3af0da19b5c5e6aa4e17bd8d90b28a|; JSESSIONID=abcCzqE6R9jTv5rTtoWgw; fanyi-ad-id=40789; fanyi-ad-closed=1; ___rl__test__cookies=1519344925194",
"Referer":"http//fanyi.youdao.com/",
"X-Requested-With": "XMLHttpRequest"
}
執(zhí)行還是報(bào)錯(cuò):
Desktop python3 aa.py
請(qǐng)輸入需要翻譯的單詞:hello
Traceback (most recent call last):
File "aa.py", line 94, in <module>
print(YouDaoFanYi().translate(content))
File "aa.py", line 77, in translate
dictResult = json.loads(result)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/__init__.py", line 349, in loads
s = s.decode(detect_encoding(s), 'surrogatepass')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
什么情況寓涨,把響應(yīng)body
打印出來(lái)盯串,更是驚奇:
b"\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03\xabV*)J\xcc+\xceI,I\rJ-.\xcd)Q\xb2\x8a\x8e\xaeV*I\x072\x94\x9e\xec]\xf0t\xe9^%\x1d\xa5\xe2\xa2d 7#5''_\xa966VG)\xb5\xa8(\xbf\xc89?%U\xc9\xca@G\xa9\xa4\xb2\x00\xc8PJ\xcd3\xaa\xca\xd0u\xf6\x08\x06\xe9\xc8M,*\x81\x99X\r\x94*)\xcaL-\x06\x1a\xae\x04\x94\xcc\xd3Sx\xb1p\xc5\xf3%\xbb^N_\xf7\xb4a\xe6\xfb==\n\xcf\x9a\xbb\x9e.m\x7f\xd61\xed\xe9\x94%/\xb6n\x7f\xb6y\xc5\xb3\x96\xfeg\xd3\xb7=\x9f\xd5\xf2|\xca\x8a\x17\xeb\xd7\xc6\x14\xc5\xe4\x01\xb5f\xe6\x95\xe8)<\x9d\xd6\xf4~\xcf\xec\xa7\x93;\x9e\xef\x9d\x0e\x15\x07\x1a\xa9\xe1\x01r\x9f\xe6\x93]\xbb\x9eN\xe8\x05\xcak<\xdb<U\xf3\xe9\xfc\xe6g[f\x83\x15\x01\x9d\rq\xa8am-\x00\x9f\x1b\xb6\x04\xf7\x00\x00\x00"
繼續(xù)檢查,繼續(xù)找原因戒良,突然我看到了一個(gè)令我覺(jué)悟的信息:
嗯体捏,這個(gè)應(yīng)該是問(wèn)題的關(guān)鍵。python3
的urllib
應(yīng)該沒(méi)有實(shí)現(xiàn)這個(gè)機(jī)制糯崎。
def readData(resp):
info = resp.info()
encoding = info.get('Content-Encoding')
transferEncoding = info.get('Transfer-Encoding:')
if transferEncoding != 'chunked' and encoding != 'gzip':
return resp.read()
str = ""
while True:
chunk = resp.read(4096)
if not chunk: break
decomp = zlib.decompressobj(16 + zlib.MAX_WBITS)
data = decomp.decompress(chunk)
str = str + data.decode("utf-8")
return str
嗯几缭,應(yīng)該沒(méi)問(wèn)題了,再試一下:
Desktop python3 aa.py
請(qǐng)輸入需要翻譯的單詞:beautiful
['美麗的']
完成沃呢,沒(méi)有問(wèn)題了年栓。
最后給出完整的代碼:
# -*- coding: utf-8 -*-
import urllib.request
import urllib.parse
import json
import zlib
import time
import random
import hashlib
class YouDaoFanYi:
def __init__(self):
self.url = "http://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule"
self.headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.109 Safari/537.36",
"Accept":"application/json, text/javascript, */*; q=0.01",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
"Content-Type":"application/x-www-form-urlencoded; charset=UTF-8",
"Cookie":"_ntes_nnid=c686062b6d8c9e3f11e2a8413b5bb9a8,1517022642199; OUTFOX_SEARCH_USER_ID_NCOO=1367486017.479911; OUTFOX_SEARCH_USER_ID=722357816@10.168.11.24; DICT_UGC=be3af0da19b5c5e6aa4e17bd8d90b28a|; JSESSIONID=abcCzqE6R9jTv5rTtoWgw; fanyi-ad-id=40789; fanyi-ad-closed=1; ___rl__test__cookies=1519344925194",
"Referer":"http//fanyi.youdao.com/",
"X-Requested-With": "XMLHttpRequest"
}
self.data = {
"from":"AUTO",
"to": "AUTO",
"smartresult": "dict",
"client": "fanyideskweb",
"doctype": "json",
"version": "2.1",
"keyfrom": "fanyi.web",
"action": "FY_BY_REALTIME",
"typoResult": "false"
}
self.client = 'fanyideskweb'
self.secretKey = 'ebSeFb%=XZ%T[KZ)c(sy!'
@staticmethod
def readData(resp):
info = resp.info()
encoding = info.get('Content-Encoding')
transferEncoding = info.get('Transfer-Encoding:')
if transferEncoding != 'chunked' and encoding != 'gzip':
return resp.read()
str = ""
while True:
chunk = resp.read(4096)
if not chunk: break
decomp = zlib.decompressobj(16 + zlib.MAX_WBITS)
data = decomp.decompress(chunk)
str = str + data.decode("utf-8")
return str
def translate(self, content):
data = dict(self.data)
salt = str(int(time.time() * 1000) + random.randint(1, 10))
sign = hashlib.md5((self.client + content + salt + self.secretKey).encode('utf-8')).hexdigest()
data["client"] = self.client
data["salt"] = salt
data["sign"] = sign
data["i"]=content
data = urllib.parse.urlencode(data).encode('utf-8')
request = urllib.request.Request(url=self.url, data=data, headers=self.headers, method='POST')
response = urllib.request.urlopen(request)
#result=response.read()
#print(result)
result = YouDaoFanYi.readData(response)
response.close()
dictResult = json.loads(result)
paragraphs=[]
for paragraph in dictResult["translateResult"]:
line=""
for a in paragraph:
line = line + a["tgt"]
if(len(line) != 0):
paragraphs.append(line)
return paragraphs
#有道翻譯中,是支持翻譯多段落文章的樟插,所以調(diào)用translate之后韵洋,會(huì)返回一個(gè)數(shù)組,數(shù)組里的元素就是翻譯過(guò)后的段落黄锤。輸入幾個(gè)段落搪缨,輸出就有幾個(gè)段落。
content = input('請(qǐng)輸入需要翻譯的單詞:').replace("\\n", "\n")
print(YouDaoFanYi().translate(content))