最近從大佬那里聽(tīng)說(shuō)了許多新知識(shí)两入,其中之一就是這個(gè)字體反爬。字體反爬是目前比較新的反爬技術(shù)敲才,通過(guò)將網(wǎng)頁(yè)中的一些關(guān)鍵信息用自設(shè)的字體渲染出來(lái)來(lái)實(shí)現(xiàn)反爬的效果裹纳。以貓眼這個(gè)網(wǎng)站來(lái)說(shuō)择葡,僅僅是我看到的反爬版本就已不下五個(gè),而且在一直更新剃氧,爬取的難度是越來(lái)越高敏储,本文即是記錄一下我爬取的過(guò)程。
首先呢朋鞍,爬蟲(chóng)得加請(qǐng)求頭已添,不加請(qǐng)求頭封一天(別問(wèn)我怎么知道的)
加了請(qǐng)求頭之后,如果訪問(wèn)頻繁滥酥,依舊是封一天更舞,所以建一個(gè)header池,隨機(jī)選一個(gè)header帶上坎吻。
headers_pool = [
'Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; XH; rv:8.578.498) fr, Gecko/20121021 Camino/8.723+ (Firefox compatible)',
'Mozilla/5.0 (X11; U; Linux i686; nl; rv:1.8.1b2) Gecko/20060821 BonEcho/2.0b2 (Debian-1.99+2.0b2+dfsg-1)',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; Avant Browser; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0)',
'Mozilla/5.0 (X11; U; UNICOS lcLinux; en-US) Gecko/20140730 (KHTML, like Gecko, Safari/419.3) Arora/0.8.0',
'Mozilla/5.0 (compatible; MSIE 9.0; AOL 9.7; AOLBuild 4343.19; Windows NT 6.1; WOW64; Trident/5.0; FunWebProducts)',
'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser 1.98.744; .NET CLR 3.5.30729)'
]
headers = random.choice(headers_pool)
對(duì)網(wǎng)頁(yè)進(jìn)行分析發(fā)現(xiàn)疏哗,像評(píng)分這樣的地方,值是讀不出來(lái)的禾怠,這是因?yàn)檫@些地方采用了網(wǎng)站自己的編碼方式返奉,這個(gè)unicode的碼無(wú)法用utf-8解析出來(lái),所以無(wú)法正常顯示吗氏。(我猜)
進(jìn)一步分析網(wǎng)頁(yè)源碼芽偏,&#就是網(wǎng)站的編碼方式,而x表示是16進(jìn)制編碼弦讽。
通過(guò)看相關(guān)博客污尉,發(fā)現(xiàn)這種網(wǎng)站的字體文件在font-face這個(gè)標(biāo)簽下
通過(guò)鏈接下載字體文件
//vfile.meituan.net/colorstone/7069663754cc604f99f5b05d6791d33d2276.woff
stone_font = soup.select("style")[0]
url = parse.urljoin("https://", re.search("http://.*\.woff", str(stone_font)).group())
res = requests.get(url)
with open("font.woff", "wb") as f:
f.write(res.content)
然后通過(guò)FontCreater可以打開(kāi)字體文件,能夠看到往产,數(shù)字0-9是經(jīng)過(guò)特殊編碼了的被碗,重點(diǎn)來(lái)了,貓眼網(wǎng)站之所以叫動(dòng)態(tài)字體仿村,是因?yàn)樗木幋a規(guī)則是隨時(shí)變化的锐朴,兩次訪問(wèn)獲得的字體文件是不同的,相同的數(shù)字對(duì)應(yīng)不同的編碼蔼囊,變態(tài)的是焚志,連這個(gè)字的形狀饵隙,都在發(fā)生微小的改變速挑,這不影響我們的視覺(jué),但是讓計(jì)算機(jī)來(lái)識(shí)別這些具有微小區(qū)別的字形仆救,就很難云矫。
想要對(duì)字體文件進(jìn)一步分析膳沽,就需要使用FontTool這個(gè)Python庫(kù)。使用FontTool打開(kāi)我們的字體文件(.woff),使用FontCreater肉眼識(shí)別并匹配后挑社,獲得數(shù)字0-9在字體集1中的編碼陨界。
base_woff = TTFont("font.woff")
base_uni_list = base_woff.getGlyphOrder()[2:]
base_val_list = ["0", "9", "8", "1", "6", "3", "5", "2", "4", "7"]
for uni1 in base_uni_list:
obj1 = base_woff['glyf'][uni1]
循環(huán)中的任一obj1都是一個(gè)單獨(dú)的字形文件,前兩個(gè)不是我們要獲取的字形所以舍去滔灶。再刷新網(wǎng)頁(yè)普碎,下載新的字體文件,并打開(kāi):
stone_font = soup.select("style")[0]
url = parse.urljoin("https://", re.search("http://.*\.woff",
str(stone_font)).group())
res = requests.get(url)
with open("font2.woff", "wb") as f:
f.write(res.content)
current_woff = TTFont("font2.woff")
current_uni_list = current_woff.getGlyphOrder()[2:]
for uni2 in current_uni_list:
obj2 = current_woff['glyf'][uni2]
通過(guò)對(duì)glaph對(duì)象使用.coordinates方法录平,獲得字形文件的點(diǎn)集麻车。obj1是我們的基準(zhǔn)集,obj2是待識(shí)別的字體集
[(43, 333), (43, 468), (69, 544), (103, 624), (148, 667), (202, 704), (282, 716), (398, 710), (459, 617), (487, 574), (504, 508), (509, 447), (509, 335), (520, 271), (514, 213), (508, 166), (502, 126), (466, 46), (362, -39), (282, -39), (176, -39), (115, 37), (38, 127), (55, 335), (135, 350), (135, 154), (166, 95), (218, 36), (282, 35), (343, 35), (375, 96), (429, 155), (428, 515), (385, 576), (344, 629), (281, 635), (212, 635), (181, 569), (148, 507), (135, 335)]
[(142, 151), (159, 89), (181, 71), (215, 35), (264, 35), (307, 35), (349, 54), (368, 66), (389, 106), (408, 137), (422, 191), (428, 231), (432, 245), (435, 273), (435, 300), (435, 308), (435, 313), (434, 319), (409, 276), (360, 249), (309, 223), (266, 223), (168, 223), (43, 354), (43, 462), (50, 574), (108, 655), (173, 710), (273, 710), (343, 710), (463, 633), (494, 551), (524, 489), (524, 356), (509, 214), (494, 131), (450, 49), (403, 4), (343, -39), (262, -39), (175, -39), (122, 9), (67, 57), (56, 144), (425, 469), (425, 544), (383, 587), (341, 635), (284, 635), (223, 635), (135, 539), (131, 459), (135, 387), (177, 355), (220, 301), (344, 301), (425, 391), (425, 466)]
[(181, 371), (71, 412), (71, 521), (71, 614), (128, 656), (185, 710), (378, 710), (436, 654), (495, 583), (495, 524), (499, 412), (386, 371), (454, 349), (495, 300), (524, 251), (524, 184), (509, 88), (458, 25), (391, -39), (174, -39), (108, 25), (42, 86), (31, 186), (42, 257), (114, 354), (163, 524), (163, 470), (189, 439), (230, 406), (283, 406), (331, 414), (369, 439), (402, 470), (402, 567), (368, 601), (334, 626), (283, 635), (229, 650), (197, 603), (163, 570), (134, 186), (134, 146), (141, 111), (169, 75), (207, 55), (242, 35), (284, 35), (301, 46), (343, 46), (369, 57), (390, 77), (432, 112), (432, 183), (432, 248), (346, 332), (274, 332), (217, 332), (175, 290), (121, 249)]
[(381, -25), (292, -26), (292, 550), (259, 516), (206, 486), (153, 468), (106, 426), (120, 539), (186, 570), (244, 625), (271, 644), (292, 660), (324, 674), (330, 713), (381, 710), (382, -32)]
[(420, 521), (408, 562), (386, 597), (349, 635), (296, 635), (250, 635), (220, 612), (177, 580), (154, 522), (148, 449), (128, 407), (134, 352), (161, 408), (207, 425), (254, 449), (306, 449), (390, 449), (459, 382), (522, 316), (522, 211), (522, 143), (493, 83), (453, 24), (412, -7), (360, -39), (293, -39), (180, -27), (110, 43), (39, 124), (39, 317), (48, 530), (117, 626), (195, 710), (301, 710), (388, 710), (443, 661), (498, 614), (519, 528), (142, 211), (142, 166), (175, 122), (182, 79), (232, 57), (253, 35), (292, 35), (347, 35), (389, 81), (421, 127), (438, 206), (430, 273), (379, 326), (349, 370), (288, 370), (216, 370), (184, 326), (142, 282)]
[(128, 180), (148, 103), (190, 69), (221, 35), (276, 35), (340, 31), (384, 79), (427, 123), (427, 250), (387, 290), (345, 329), (284, 331), (257, 331), (220, 321), (230, 400), (236, 400), (243, 397), (245, 399), (303, 399), (394, 459), (394, 522), (394, 571), (330, 635), (274, 635), (222, 635), (187, 604), (153, 558), (142, 504), (52, 520), (76, 612), (127, 661), (186, 719), (272, 710), (332, 710), (371, 683), (433, 659), (487, 569), (487, 520), (501, 462), (462, 432), (447, 394), (386, 371), (451, 356), (523, 262), (523, 190), (523, 109), (466, 31), (383, -40), (276, -33), (179, -40), (110, 18), (46, 75), (43, 167), (134, 180)]
[(137, 173), (147, 103), (222, 35), (278, 35), (343, 35), (433, 136), (433, 226), (433, 292), (385, 341), (345, 387), (276, 380), (232, 380), (198, 366), (164, 341), (143, 297), (58, 320), (120, 697), (494, 697), (494, 611), (202, 611), (160, 414), (217, 436), (264, 460), (300, 460), (396, 460), (463, 398), (528, 328), (519, 223), (514, 124), (470, 51), (399, -39), (286, -39), (186, -39), (126, 17), (56, 74), (43, 170), (143, 173)]
[(515, 60), (515, -26), (31, -26), (41, 6), (42, 24), (52, 62), (66, 86), (89, 110), (100, 134), (121, 158), (179, 212), (208, 245), (278, 294), (358, 370), (380, 400), (422, 455), (422, 508), (415, 558), (384, 598), (345, 635), (284, 635), (219, 622), (180, 597), (133, 559), (140, 488), (48, 498), (43, 603), (108, 653), (180, 710), (391, 710), (453, 648), (514, 608), (514, 506), (514, 460), (480, 381), (436, 330), (416, 307), (381, 274), (361, 242), (298, 201), (270, 167), (231, 152), (206, 121), (182, 97), (173, 85), (163, 79), (156, 60)]
[(331, -13), (336, 161), (-1, 149), (16, 246), (347, 713), (421, 701), (412, 232), (534, 231), (511, 163), (416, 149), (421, -26), (331, -18), (331, 232), (336, 552), (101, 246), (321, 232)]
[(48, 625), (48, 696), (523, 698), (523, 627), (486, 590), (441, 546), (404, 491), (388, 429), (347, 367), (297, 241), (276, 175), (252, 83), (246, -26), (151, -11), (152, 17), (162, 69), (169, 114), (173, 183), (209, 306), (278, 419), (310, 478), (374, 572), (402, 624), (49, 611)]
[(515, 60), (515, -26), (31, -26), (31, 6), (42, 36), (41, 62), (66, 86), (80, 110), (121, 158), (147, 185), (218, 245), (278, 294), (333, 332), (358, 370), (385, 400), (422, 457), (422, 508), (422, 566), (384, 598), (345, 635), (284, 635), (219, 635), (181, 596), (141, 559), (151, 488), (48, 498), (73, 603), (110, 656), (180, 710), (391, 710), (514, 593), (514, 500), (514, 467), (497, 420), (480, 381), (436, 330), (429, 307), (381, 274), (347, 242), (298, 201), (261, 167), (245, 144), (206, 121), (183, 122), (182, 104), (173, 74), (164, 73), (140, 60), (515, 57)]
[(410, 521), (408, 574), (386, 597), (349, 650), (296, 635), (253, 635), (220, 612), (177, 580), (154, 522), (134, 492), (121, 449), (128, 407), (128, 352), (161, 406), (207, 425), (257, 449), (306, 449), (395, 449), (522, 316), (527, 219), (522, 143), (493, 83), (463, 12), (412, -7), (360, -39), (293, -39), (180, -39), (39, 124), (39, 317), (39, 530), (117, 626), (185, 710), (301, 710), (388, 710), (443, 661), (498, 614), (510, 533), (420, 521), (142, 209), (142, 166), (182, 79), (226, 57), (253, 35), (292, 35), (347, 50), (389, 81), (426, 127), (430, 206), (430, 281), (395, 321), (344, 372), (288, 370), (228, 370), (180, 326), (142, 291), (142, 211)]
[(374, -26), (280, -26), (288, 547), (265, 529), (206, 486), (156, 462), (125, 439), (112, 526), (190, 545), (234, 617), (258, 637), (292, 662), (303, 686), (325, 720), (367, 719), (382, -22)]
[(137, 173), (147, 103), (185, 69), (219, 44), (278, 35), (343, 35), (388, 98), (433, 136), (439, 214), (428, 292), (399, 336), (345, 380), (276, 392), (232, 380), (198, 361), (159, 332), (143, 310), (58, 320), (129, 689), (494, 697), (494, 622), (202, 611), (162, 414), (195, 437), (264, 460), (305, 450), (396, 460), (463, 394), (539, 328), (519, 223), (528, 134), (465, 51), (399, -34), (278, -39), (177, -39), (115, 17), (65, 74), (43, 166)]
[(48, 615), (48, 698), (523, 707), (523, 627), (495, 580), (461, 541), (420, 491), (383, 429), (344, 367), (338, 295), (297, 241), (278, 175), (252, 83), (244, -26), (151, -26), (152, 17), (161, 69), (169, 109), (185, 183), (217, 310), (265, 406), (296, 478), (341, 525), (374, 583), (418, 611), (48, 596)]
[(134, 151), (154, 89), (215, 35), (264, 35), (307, 35), (368, 73), (408, 137), (422, 191), (428, 217), (435, 273), (435, 296), (435, 308), (435, 299), (434, 319), (409, 276), (360, 249), (315, 223), (259, 220), (168, 223), (43, 354), (43, 570), (108, 641), (173, 710), (288, 710), (343, 710), (463, 633), (508, 562), (525, 489), (524, 356), (524, 214), (464, 49), (403, 4), (343, -32), (262, -29), (175, -39), (122, 9), (67, 57), (56, 144), (142, 151), (425, 466), (421, 544), (389, 589), (341, 635), (290, 635), (223, 648), (179, 586), (135, 539), (135, 459), (135, 387), (185, 345), (220, 301), (344, 301), (384, 345), (428, 391)]
[(133, 173), (148, 103), (185, 69), (221, 43), (276, 35), (340, 35), (398, 79), (427, 123), (427, 250), (387, 290), (345, 331), (284, 331), (257, 331), (228, 321), (230, 395), (236, 400), (239, 399), (245, 399), (303, 399), (348, 428), (408, 459), (394, 522), (394, 571), (346, 604), (319, 635), (274, 635), (222, 635), (153, 572), (142, 504), (52, 520), (69, 605), (127, 660), (188, 724), (272, 710), (332, 700), (383, 683), (447, 664), (460, 614), (487, 569), (487, 520), (487, 481), (462, 432), (436, 394), (386, 371), (451, 356), (487, 313), (523, 262), (523, 190), (523, 95), (453, 27), (397, -52), (276, -42), (184, -40), (116, 18), (52, 75), (43, 167), (133, 180)]
[(43, 335), (41, 468), (69, 544), (95, 624), (202, 710), (282, 710), (398, 722), (459, 617), (487, 574), (520, 442), (509, 349), (520, 271), (514, 222), (508, 166), (494, 126), (466, 55), (401, 4), (362, -39), (282, -49), (190, -49), (115, 29), (43, 127), (135, 335), (135, 154), (218, 35), (282, 35), (341, 29), (385, 96), (418, 155), (428, 335), (428, 515), (385, 561), (344, 635), (281, 643), (218, 635), (181, 583), (135, 515), (135, 348)]
[(323, -26), (335, 151), (22, 149), (14, 230), (332, 703), (416, 707), (435, 227), (525, 233), (520, 149), (421, 155), (412, -41), (331, -25), (331, 232), (331, 551), (96, 232), (331, 241)]
[(181, 371), (71, 412), (71, 521), (71, 603), (185, 710), (378, 710), (436, 647), (495, 596), (495, 519), (481, 412), (386, 373), (454, 349), (489, 291), (523, 251), (524, 184), (524, 98), (458, 25), (379, -52), (283, -39), (174, -39), (108, 25), (42, 86), (42, 186), (42, 257), (114, 354), (181, 386), (163, 524), (173, 470), (197, 439), (230, 406), (284, 406), (333, 406), (369, 439), (402, 470), (402, 567), (334, 635), (283, 642), (229, 635), (163, 570), (134, 198), (134, 140), (166, 111), (169, 75), (220, 55), (242, 35), (284, 35), (316, 35), (369, 57), (390, 77), (432, 117), (432, 248), (346, 332), (281, 332), (232, 330), (175, 290), (148, 249), (134, 188)]
可以看到斗这,obj1[0]和obj2[7]是非常類似的动猬,但是又有區(qū)別,將這條摘出來(lái)表箭,
[(43, 333), (43, 468), (69, 544), (103, 624), (148, 667), (202, 704), (282, 716), (398, 710), (459, 617), (487, 574), (504, 508), (509, 447), (509, 335), (520, 271), (514, 213), (508, 166), (502, 126), (466, 46), (362, -39), (282, -39), (176, -39), (115, 37), (38, 127), (55, 335), (135, 350), (135, 154), (166, 95), (218, 36), (282, 35), (343, 35), (375, 96), (429, 155), (428, 515), (385, 576), (344, 629), (281, 635), (212, 635), (181, 569), (148, 507), (135, 335)]
[(43, 335), (41, 468), (69, 544), (95, 624), (202, 710), (282, 710), (398, 722), (459, 617), (487, 574), (520, 442), (509, 349), (520, 271), (514, 222), (508, 166), (494, 126), (466, 55), (401, 4), (362, -39), (282, -49), (190, -49), (115, 29), (43, 127), (135, 335), (135, 154), (218, 35), (282, 35), (341, 29), (385, 96), (418, 155), (428, 335), (428, 515), (385, 561), (344, 635), (281, 643), (218, 635), (181, 583), (135, 515), (135, 348)]
數(shù)據(jù)非常類似赁咙,但又有些許不同,而這兩條數(shù)據(jù)的長(zhǎng)度也是不同的免钻,意味著彼水,這些用來(lái)描繪數(shù)字的點(diǎn)的數(shù)目也是不同的,所以對(duì)他進(jìn)行相似性的判斷就很麻煩极舔。目前網(wǎng)絡(luò)上的博客上的解決辦法似乎都是改版前的通過(guò)比較字體對(duì)象凤覆,或是字體點(diǎn)數(shù)一致進(jìn)行的比較,沒(méi)有找到較為完善的方案拆魏。我這里采用的K鄰近算法(才學(xué)的)盯桦,最后得到的判斷效果還不錯(cuò),但沒(méi)有完美的識(shí)別渤刃,供大家參考拥峦。
首先,因?yàn)槲覀冞@里只有一組訓(xùn)練集卖子,而且一集中每一項(xiàng)都是一獨(dú)立狀態(tài)略号,所以k取1。在想多弄幾組訓(xùn)練集會(huì)不會(huì)好一些揪胃,之后再說(shuō)吧璃哟。
import numpy as np
obj1 = TTFont("font.woff")
obj1_uni = obj1.getGlyphOrder()[2:]
obj1_points = [0]*len(obj1_uni)
for i in range(len(obj1_uni)):
obj1_points[i] = list(obj1['glyf'][obj1_uni[i]].coordinates)
for index in range(len(obj1_points[i])):
obj1_points[i][index] = sum(obj1_points[i][index])
obj2 = TTFont("font2.woff")
obj2_uni = obj2.getGlyphOrder()[2:]
obj2_points = [0] * len(obj2_uni)
for i in range(len(obj2_uni)):
obj2_points[i] = list(obj2['glyf'][obj2_uni[i]].coordinates)
for index in range(len(obj2_points[i])):
obj2_points[i][index] = sum(obj2_points[i][index])
因?yàn)樽鴺?biāo)有兩項(xiàng),在這里沒(méi)法把他加到矩陣?yán)锖暗荩郧蠛停詸M縱坐標(biāo)和來(lái)替代這個(gè)點(diǎn)的坐標(biāo)阳似。
len_max = 0
for i in range(len(obj2_uni)):
if len(obj2_points[i])>len_max: len_max = len(obj2_points[i])
for i in range(len(obj1_uni)):
if len(obj1_points[i])>len_max: len_max = len(obj1_points[i])
for i in range(len(obj2_uni)):
obj2_points[i].extend([0]*(len_max-len(obj2_points[i])))
for i in range(len(obj1_uni)):
obj1_points[i].extend([0]*(len_max-len(obj1_points[i])))
獲取點(diǎn)數(shù)最多的那一項(xiàng)長(zhǎng)度骚勘,并把其他項(xiàng)增加到這個(gè)長(zhǎng)度(初始化數(shù)據(jù))接下來(lái)就是通過(guò)K鄰近算法獲得與我們已知編碼對(duì)應(yīng)的電集最相似的未知點(diǎn)集。
for i in range(len(obj2_uni)):
a = np.array(obj1_points)
b = np.array(obj2_points[i])
x = knn(a,b)
print(x)
def knn(trainData, testData):
rowSize = trainData.shape[0]
diff = np.tile(testData, (rowSize, 1)) - trainData
sqrDiff = diff ** 2
sqrDiffSum = sqrDiff.sum(axis=1)
distances = sqrDiffSum ** 0.5
sortDistance = distances.argsort()
return sortDistance[0]
np.array():生成矩陣
array.shape[0]:獲取矩陣行數(shù)
np.tile(testData, (rowSize, 1)) - trainData :將testData在y軸上復(fù)制rowSize倍,再與trainData求差俏讹。
diff ** 2:矩陣按位做平方
sqrDiff.sum(axis=1):求和(axis=1按行求和)
sqrDiffSum ** 0.5:對(duì)和求根號(hào)
distances.argsort():從小到大排序后当宴,輸出對(duì)應(yīng)的指針(最小數(shù)對(duì)應(yīng)的指針最先輸出)
sortDistance[0]:獲取最小數(shù)指針(也就是最相近點(diǎn)集的指針)
最后得到的答案:
7,4泽疆,3户矢,6,9殉疼,5梯浪,5,0瓢娜,8挂洛,2
應(yīng)該為:
7,4眠砾,3虏劲,6,9褒颈,1柒巫,5,0谷丸,8堡掏,2
讓我們看看數(shù)據(jù):
obj2[5],obj2[6]:
[285, 243, 250, 299, 342, 441, 545, 613, 645, 708, 731, 743, 734, 753, 685, 609, 538, 479, 391, 397, 613, 749, 883, 998, 1053, 1096, 1070, 1014, 880, 738, 513, 407, 311, 233, 136, 131, 124, 200, 293, 891, 965, 978, 976, 925, 871, 765, 674, 594, 522, 530, 521, 645, 729, 819, 0, 0, 0, 0, 0]
[306, 251, 254, 264, 311, 375, 477, 550, 677, 677, 676, 615, 588, 549, 625, 636, 638, 644, 702, 776, 867, 916, 965, 950, 954, 909, 857, 725, 646, 572, 674, 787, 912, 982, 1032, 1066, 1111, 1074, 1056, 1007, 968, 894, 830, 757, 807, 800, 785, 713, 618, 480, 345, 234, 144, 134, 127, 210, 313, 0, 0]
obj1[1],obj2[5]
[293, 248, 252, 250, 299, 342, 403, 434, 495, 545, 613, 659, 677, 708, 735, 743, 748, 753, 685, 609, 532, 489, 391, 397, 505, 624, 763, 883, 983, 1053, 1096, 1045, 1013, 880, 723, 625, 499, 407, 304, 223, 136, 131, 124, 200, 894, 969, 970, 976, 919, 858, 674, 590, 522, 532, 521, 645, 816, 891, 0]
[308, 251, 259, 256, 311, 371, 463, 550, 677, 677, 674, 615, 588, 541, 630, 636, 640, 644, 702, 853, 916, 965, 965, 909, 857, 791, 711, 646, 572, 688, 788, 905, 982, 1042, 1054, 1092, 1056, 1007, 963, 894, 841, 757, 807, 785, 713, 632, 497, 343, 243, 139, 128, 121, 210, 314, 0, 0, 0, 0, 0]
可以看到,肉眼看的話能看出obj2[5]和obj1[1]更相似淤井。但由于obj2[5]跟obj1[5]的位數(shù)相同布疼,導(dǎo)致后面的位數(shù)不產(chǎn)生K鄰近法的距離,導(dǎo)致最后的數(shù)加起來(lái)很小币狠。而obj2[5]和obj1[1]相差四位游两,而且后面都是大數(shù),導(dǎo)致這幾位產(chǎn)生的距離非常之大漩绵,雖然兩者其他的數(shù)值接近產(chǎn)生的距離值小贱案,但總距離偏大,答案錯(cuò)誤止吐。
完整代碼:
import requests
from bs4 import BeautifulSoup
import random
from urllib import parse
import re
from fontTools.ttLib import TTFont
import numpy as np
def downloader():
url = "https://maoyan.com/films/344264"
headers_pool = [
'Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; XH; rv:8.578.498) fr, Gecko/20121021 Camino/8.723+ (Firefox compatible)',
'Mozilla/5.0 (X11; U; Linux i686; nl; rv:1.8.1b2) Gecko/20060821 BonEcho/2.0b2 (Debian-1.99+2.0b2+dfsg-1)',
'Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0; Avant Browser; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0)',
'Mozilla/5.0 (X11; U; UNICOS lcLinux; en-US) Gecko/20140730 (KHTML, like Gecko, Safari/419.3) Arora/0.8.0',
'Mozilla/5.0 (compatible; MSIE 9.0; AOL 9.7; AOLBuild 4343.19; Windows NT 6.1; WOW64; Trident/5.0; FunWebProducts)',
'Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0; Acoo Browser 1.98.744; .NET CLR 3.5.30729)'
]
headers = random.choice(headers_pool)
res = requests.get(url=url,headers={"User-Agent": headers})
with open("maoyan.html","wb") as f:
f.write(res.content)
soup = BeautifulSoup(res.text,"html.parser")
stone_font = soup.select("style")[0]
url = parse.urljoin("https://", re.search("http://.*\.woff", str(stone_font)).group())
res = requests.get(url,headers={"User-Agent": headers})
with open("font2.woff", "wb") as f:
f.write(res.content)
def knn(trainData, testData):
rowSize = trainData.shape[0]
diff = np.tile(testData, (rowSize, 1)) - trainData
sqrDiff = diff ** 2
sqrDiffSum = sqrDiff.sum(axis=1)
distances = sqrDiffSum ** 0.5
sortDistance = distances.argsort()
return sortDistance[0]
def prepare():
obj1 = TTFont("font.woff")
obj1_uni = obj1.getGlyphOrder()[2:]
obj1_points = [0] * len(obj1_uni)
for i in range(len(obj1_uni)):
obj1_points[i] = list(obj1['glyf'][obj1_uni[i]].coordinates)
for index in range(len(obj1_points[i])):
obj1_points[i][index] = sum(obj1_points[i][index])
obj2 = TTFont("font2.woff")
obj2_uni = obj2.getGlyphOrder()[2:]
obj2_points = [0] * len(obj2_uni)
for i in range(len(obj2_uni)):
obj2_points[i] = list(obj2['glyf'][obj2_uni[i]].coordinates)
for index in range(len(obj2_points[i])):
obj2_points[i][index] = sum(obj2_points[i][index])
len_max = 0
for i in range(len(obj2_uni)):
if len(obj2_points[i]) > len_max: len_max = len(obj2_points[i])
for i in range(len(obj1_uni)):
if len(obj1_points[i]) > len_max: len_max = len(obj1_points[i])
for i in range(len(obj2_uni)):
obj2_points[i].extend([0] * (len_max - len(obj2_points[i])))
for i in range(len(obj1_uni)):
obj1_points[i].extend([0] * (len_max - len(obj1_points[i])))
return obj1_points,obj2_points
def create_dic(obj1_points,obj2_points):
base_woff = TTFont("font.woff")
base_uni_list = base_woff.getGlyphOrder()[2:]
base_val_list = ["0", "9", "8", "1", "6", "3", "5", "2", "4", "7"]
# base_dic = dict(zip(base_uni_list, base_val_list))
current_woff = TTFont("font2.woff")
current_uni_list = current_woff.getGlyphOrder()[2:]
mapping_dic = {}
a = np.array(obj1_points)
for i in range(len(obj2_points)):
b = np.array(obj2_points[i])
index = knn(a,b)
mapping_dic[current_uni_list[i]] = base_val_list[index]
return mapping_dic
def change_html():
pass
if __name__ == "__main__":
downloader()
obj1_points, obj2_points = prepare()
mapping_dic = create_dic(obj1_points,obj2_points)
print(mapping_dic)
發(fā)現(xiàn)這次又對(duì)了宝踪,得到結(jié)果是:
{'uniE8DB': '4', 'uniF691': '5', 'uniE101': '9', 'uniF072': '7', 'uniEDAF': '3', 'uniEBB2': '0', 'uniE826': '6', 'uniEF4E': '1', 'uniF574': '8', 'uniEEEC': '2'}
很舒服,但是不能保證100%成功
所以最重要的問(wèn)題還是解決相同字形點(diǎn)數(shù)不同這個(gè)問(wèn)題碍扔,我還沒(méi)想到合適的解決方案瘩燥。如果你能解決這個(gè)問(wèn)題,歡迎留言不同,歡迎指教厉膀。
參考文獻(xiàn):http://www.reibang.com/p/79c4272c0969
https://blog.csdn.net/weixin_41861700/article/details/103108239
https://www.cnblogs.com/lyuzt/p/10471617.html