知乎新版登陸已經(jīng)增加了驗(yàn)證碼,此文章不再適用
感謝簡(jiǎn)書作者Andrew_liu
提供的思路,雖然知乎改版后古程,該文章上提供的方法已經(jīng)失效Python爬蟲(七)--Scrapy模擬登錄
利用Scrapy提供的cookie中間價(jià)很容易做到網(wǎng)頁的模擬登陸,下面就來介紹怎么利用這個(gè)cookie中間件來登陸知乎礁鲁。
###****前期分析工作
-
打開https://www.zhihu.com,利用Chrome瀏覽器的Debug功能待侵,定位到登陸框所在的位置陌选,如下圖所示
利用scrapy提供的xpath能很方便的獲取到這個(gè)值(//div[@data-za-module="SignInForm"]//form//input[@name="_xsrf"]/@value') 選中Debug窗口的Network選項(xiàng)替蔬,同時(shí)在輸完賬號(hào)密碼后點(diǎn)擊登陸厨幻,獲取登陸操作后的post請(qǐng)求相嵌,見下圖:
注意:由于知乎頁面在登錄成功后會(huì)跳轉(zhuǎn)到新的頁面,所以使用debug中的 >network功能獲取數(shù)據(jù)包時(shí)一定要在登錄后快速的將recording按鈕停止掉(上圖左上角那個(gè)黑色的圓圈按鈕,recoding狀態(tài)時(shí)紅色,黑色是stop recording狀態(tài))况脆,否則登錄的報(bào)文就沒了饭宾。
點(diǎn)擊這個(gè)鏈接,確認(rèn)這個(gè)鏈接就是提交登陸的url
從FormData里面可以看到https://www.zhihu.com/email/login就是登陸POST請(qǐng)求的url格了,需要提交4個(gè)參數(shù)看铆,其中_xsrf就是首頁可以獲取到的隱藏表單參數(shù),remember_me是是否記住cookie的開關(guān)盛末,email和password對(duì)應(yīng)賬號(hào)和密碼
此處可能有不一樣的地方弹惦,因?yàn)槲业闹踬~號(hào)是email注冊(cè)的,根據(jù)這個(gè)url的特征推測(cè)別的賬號(hào)類型可能存在不一樣的Url
### ****編寫蜘蛛代碼
1.繼承CrawlSpider,并重寫spider的start_request方法悄但,讓spier先訪問登錄頁再去爬取start_urls中的鏈接棠隐,在start_requests方法中,讓spider先去訪問知乎首頁檐嚣,去獲取隱藏的表單項(xiàng)_xsrf
def start_requests(self):
return [Request("https://www.zhihu.com/",headers = self.headers,meta={"cookiejar":1},callback=self.post_login)]
其中header需要自定義助泽,因?yàn)橹鯇?duì)spider做了限制,應(yīng)該是檢測(cè)User-Agent,你可以在setting.py中更改spider的默認(rèn)UserAgent,也可以像我這樣自己自定義一個(gè)
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip,deflate",
"Accept-Language": "en-US,en;q=0.8,zh-TW;q=0.6,zh;q=0.4",
"Connection": "keep-alive",
"Content-Type":" application/x-www-form-urlencoded; charset=UTF-8",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36",
"Referer": "http://www.zhihu.com"
}
meta中的cookiejar是Scrapy的Cookie中間件的關(guān)鍵字嗡贺,具體可參考scrapy文檔隐解,這里因?yàn)橹恍枰4嬉粋€(gè)cookie,所以直接寫1(注意:并不是1個(gè)cookie才寫的1诫睬,僅僅是個(gè)key厢漩,后面通過這個(gè)1這個(gè)key找到cookiejar中保存的cookie
)
2.解析首頁內(nèi)容,獲取到_xsrf的值岩臣,同時(shí)提交登錄請(qǐng)求:
def post_login(self,response):
self.log("preparing login...")
xsrf = Selector(response).xpath('//div[@data-za-module="SignInForm"]//form//input[@name="_xsrf"]/@value').extract()[0]
self.log(xsrf)
return FormRequest("https://www.zhihu.com/login/email",meta={'cookiejar':1},
headers = self.headers,
formdata = {
'_xsrf':xsrf,
'password':'xxxxxxx',
'email':'xxx@gmail.com',
'remember_me':'true',
},
callback = self.after_login,
)
- 將登錄成功后獲取到的cookie傳遞給每一個(gè)start_urls中鏈接的ruquest
def after_login(self,response):
for url in self.start_urls:
yield Request(url,meta={'cookiejar':1},headers = self.headers)
4.由于cookiejar中的cookie并不會(huì)自動(dòng)發(fā)送給每個(gè)鏈接,因此在urls通過Rule獲取到的連接宵膨,也是需要我們手動(dòng)將cookie加上架谎,通過Rule提供的process_request參數(shù)重新創(chuàng)建帶cookie的Request
rules = (
Rule(SgmlLinkExtractor(allow=('/question/\\d*')),process_request="request_question"),
)
同時(shí)提供request_question函數(shù)
def request_question(self,request):
return Request(request.url,meta={'cookiejar':1},headers = self.headers,callback=self.parse_question)
5.由于已經(jīng)有了process_link ,Rule中的callback參數(shù)就不再起作用了,而是調(diào)用新構(gòu)造的Request中的callback函數(shù)辟躏。
def parse_question(self,response):
sel = Selector(response)
item = zhihuItem()
item['qestionTitle'] = sel.xpath("http://div[@id='zh-question-title']//h2/text()").extract_first()
item['image_urls'] = sel.xpath("http://img[@class='origin_image zh-lightbox-thumb lazy']/@data-original").extract()
return item
這個(gè)parse_question方法僅僅是獲取問題名稱和問題下面的所有圖片鏈接谷扣。
### ****完整代碼
import urllib2
import os
import re
import codecs
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import Selector
from MySpider.items import zhihuItem
from scrapy.http import Request
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
class zhihuSpider(CrawlSpider):
name = "zhihu"
allow_domians = ["zhihu.com"]
start_urls = ["https://www.zhihu.com/collection/38624707"]
rules = (
Rule(SgmlLinkExtractor(allow=('/question/\\d*')),process_request="request_question"),
)
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
"Accept-Encoding": "gzip,deflate",
"Accept-Language": "en-US,en;q=0.8,zh-TW;q=0.6,zh;q=0.4",
"Connection": "keep-alive",
"Content-Type":" application/x-www-form-urlencoded; charset=UTF-8",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.97 Safari/537.36",
"Referer": "http://www.zhihu.com"
}
def start_requests(self):
return [Request("https://www.zhihu.com/",headers = self.headers,meta={"cookiejar":1},callback=self.post_login)]
def post_login(self,response):
self.log("preparing login...")
xsrf = Selector(response).xpath('//div[@data-za-module="SignInForm"]//form//input[@name="_xsrf"]/@value').extract()[0]
self.log(xsrf)
return FormRequest("https://www.zhihu.com/login/email",meta={'cookiejar':response.meta['cookiejar']},
headers = self.headers,
formdata = {
'_xsrf':xsrf,
'password':'差點(diǎn)就忘了刪了',
'email':'郵箱也不能暴露',
'remember_me':'true',
},
callback = self.after_login,
)
def after_login(self,response):
for url in self.start_urls:
yield Request(url,meta={'cookiejar':1},headers = self.headers)
def request_question(self,request):
return Request(request.url,meta={'cookiejar':1},headers = self.headers,callback=self.parse_question)
def parse_question(self,response):
sel = Selector(response)
item = zhihuItem()
item['qestionTitle'] = sel.xpath("http://div[@id='zh-question-title']//h2/text()").extract_first()
item['image_urls'] = sel.xpath("http://img[@class='origin_image zh-lightbox-thumb lazy']/@data-original").extract()
return item