接上篇幼苛,來到第五關(guān),地址:http://www.heibanke.com/lesson/crawler_ex04/
界面依舊熟悉,不過多了個(gè)驗(yàn)證碼
image.png
很明顯涎嚼,這關(guān)主要在考如何過驗(yàn)證碼,先隨便填幾個(gè)字符點(diǎn)擊提交挑秉,結(jié)果提示密碼錯(cuò)誤法梯,F(xiàn)12看看請(qǐng)求:
image.png
可以看到提交了5個(gè)參數(shù),多了captcha_0和captcha_1,captcha_1就是我剛剛填入的驗(yàn)證碼犀概,那么captcha_0是個(gè)什么東西立哑?看看源碼
image.png
看來這個(gè)captcha_0是后臺(tái)動(dòng)態(tài)生成的一個(gè)值,猜測(cè)是后臺(tái)用來匹配驗(yàn)證碼的阱冶,不過這對(duì)我們沒影響刁憋,直接取這個(gè)值提交就行了滥嘴。
整個(gè)頁(yè)面非常簡(jiǎn)單木蹬,重點(diǎn)在怎樣識(shí)別驗(yàn)證碼,能識(shí)別出驗(yàn)證碼的話若皱,密碼一個(gè)個(gè)試就行了(作者已經(jīng)提示密碼全部都是數(shù)字)镊叁。識(shí)別驗(yàn)證碼我使用了pillow+pytesseract,結(jié)果發(fā)現(xiàn)識(shí)別效率非常低走触,我也不知道為啥晦譬,可能我使用姿勢(shì)不對(duì)?考慮到作者不會(huì)搞太難的密碼互广,我干脆手動(dòng)輸入算了(考慮個(gè)屁敛腌,沒辦法的辦法了)。
代碼如下:
# -*- coding: utf-8 -*-
import pytesseract
from PIL import Image
import urllib.request as urllib
from io import BytesIO
from urllib import request
from urllib import parse
from bs4 import BeautifulSoup
def get_page(url, params):
print('get url %s' % url)
data = parse.urlencode(params).encode('utf-8')
header = {
'User-Agent': r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) '
r'Chrome/45.0.2454.85 Safari/537.36 115Browser/6.0.3',
'Connection': 'keep-alive',
'Cookie':r'Hm_lvt_74e694103cf02b31b28db0a346da0b6b=1514366315; csrftoken=1yFgXVZtw2rACmTYDGABYKs9VWLWqbeH; sessionid=m4paft1uuvhm3thrwvdgwut2rvu8uz8d; Hm_lpvt_74e694103cf02b31b28db0a346da0b6b=1514428404',
'Refer':'http://www.heibanke.com/lesson/crawler_ex02/'
}
req = request.Request(url, data, headers=header)
page = request.urlopen(req).read()
page = page.decode('utf-8')
return page
count = 0
url = "http://www.heibanke.com/lesson/crawler_ex04/"
token = '1yFgXVZtw2rACmTYDGABYKs9VWLWqbeH'
username = 'pkxutao'
# 構(gòu)造post參數(shù)
data = {
'csrfmiddlewaretoken': token,
'username': 'pkxutao',
'password': -1
}
# result = get_page(url, data)
h3 = ''
# 這里的恭喜判斷其實(shí)多余惫皱,因?yàn)槔锩嬗袑?duì)是否正確的判斷
while "恭喜" not in h3:
data['password'] = count
result = get_page(url, data)
soup = BeautifulSoup(result, "html.parser")
# 先獲取到captcha_0
temp = soup.find_all('input', id='id_captcha_0')
if len(temp) == 0:
# 說明密碼正確
break
captcha_0=temp[0]['value']
data['captcha_0'] = captcha_0
# 再獲取captcha_1,就是驗(yàn)證碼
captcha=soup.find_all('img', class_='captcha')[0]['src']
resp = urllib.urlopen('http://www.heibanke.com'+captcha)
img = Image.open(BytesIO(resp.read()))
# 展示驗(yàn)證碼
img.show()
code = input()
print('輸入的驗(yàn)證碼為 %s' % code)
# 這里準(zhǔn)備用pytesseract自動(dòng)識(shí)別像樊,但識(shí)別率實(shí)在太低,改為手動(dòng)
# img=img.convert('L')
# img.show()
# code = pytesseract.image_to_string(img)
# print('識(shí)別的驗(yàn)證碼為 %s' % code)
data['captcha_1']=code
# result = get_page(url, data)
# soup = BeautifulSoup(result, "html.parser")
h3 = soup.find_all("h3")[0].text
print(h3)
if h3 not in '驗(yàn)證碼輸入錯(cuò)誤':
count += 1
print("闖關(guān)成功旅敷,密碼為%s" % count)
pytesseract識(shí)別驗(yàn)證碼的代碼被我注釋了生棍,想玩的話可以試試,最后爬到的密碼是22媳谁,到頁(yè)面登錄,bingo!但是涂滴。。晴音。柔纵。發(fā)現(xiàn)這是最后一關(guān)了,可惜了锤躁,這游戲還挺好玩的搁料。