一、背景環(huán)境
- 環(huán)境介紹
操作系統(tǒng):Win10
Python版本:Python3.6
Scrapy版本:Scrapy1.5.1
本篇主要目的是在wei.py文件使用scrapy.FormRequest來(lái)進(jìn)行登錄演示篮灼,主要說(shuō)明都寫(xiě)在代碼注釋內(nèi)沽翔。
二、代碼
-
項(xiàng)目目錄結(jié)構(gòu)
image.png - wei.py文件
# -*- coding: utf-8 -*-
import scrapy
import json
class WeiSpider(scrapy.Spider):
name = 'wei'
allowed_domains = ['weibo.cn']
# start_urls = ['http://weibo.cn/']
# 當(dāng)引擎把start_urls中的內(nèi)容放入調(diào)度器中以后局劲,會(huì)調(diào)取下載器發(fā)起get請(qǐng)求,現(xiàn)在如果需發(fā)送post請(qǐng)求奶赠,就需要把start_urls注視掉
# def parse(self, response):
# pass
# 重寫(xiě)一個(gè)方法
def start_requests(self):
# 這個(gè)方法當(dāng)下載器開(kāi)始發(fā)起請(qǐng)求之前被調(diào)用
# 在這個(gè)方法我們可以把下載器截獲鱼填,改變其原來(lái)的請(qǐng)求方式
login_url = "https://passport.weibo.cn/sso/login" # post請(qǐng)求的接口url
# post提交的數(shù)據(jù)
data = {
'username': 'USERNAME',
'password': 'PASSWORD',
'savestate': '1',
'r': 'https://weibo.cn/?luicode=20000174',
'ec': '0',
'pagerefer': 'https://weibo.cn/pub/?vt=',
'entry': 'mweibo',
'wentry': '',
'loginfrom': '',
'client_id': '',
'code': '',
'qq': '',
'mainpageflag': '1',
'hff': '',
'hfp': ''
}
yield scrapy.FormRequest(url=login_url,formdata=data,callback=self.parse_login)
def parse_login(self, response):
# print(response.text)
# 判斷登錄是否成功
if json.loads(response.text)["retcode"] == 20000000:
print("登錄成功!")
# 訪問(wèn)主頁(yè)
main_url = "https://weibo.cn/?since_id=0&max_id=H0moBsJrC&prev_page=1&page=1"
yield scrapy.Request(url=main_url,callback=self.parse_info)
else:
print("登錄失斠愀辍苹丸!")
def parse_info(self, response):
print(response.text)
""" 在這里解析
xxxxxxxxx
"""
- settings.py
# -*- coding: utf-8 -*-
# Scrapy settings for Weibo project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'Weibo'
SPIDER_MODULES = ['Weibo.spiders']
NEWSPIDER_MODULE = 'Weibo.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 2
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
# COOKIES_ENABLED = False
# 在scrapy中會(huì)話處理默認(rèn)是開(kāi)啟的愤惰,在這里可以關(guān)閉
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
DEFAULT_REQUEST_HEADERS = {
'Accept': '*/*',
'Accept-Language': 'zh-CN,zh;q=0.9',
'Content-Type': 'application/x-www-form-urlencoded',
'Connection': 'keep-alive',
# 'Host': 'passport.weibo.cn', # 這個(gè)主機(jī)名必須注釋掉,當(dāng)請(qǐng)求頭中主機(jī)名指定為某個(gè)值的時(shí)候赘理,后面每一次發(fā)起請(qǐng)求都會(huì)把url的主機(jī)名重定向該主機(jī)名下面
'Origin': 'https://passport.weibo.cn',
'Referer': 'https://passport.weibo.cn/signin/login?entry=mweibo&r=https%3A%2F%2Fweibo.cn%2F%3Fluicode%3D20000174&backTitle=%CE%A2%B2%A9&vt='
}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'Weibo.middlewares.WeiboSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'Weibo.middlewares.WeiboDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'Weibo.pipelines.WeiboPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
settings文件我們主要對(duì)請(qǐng)求頭進(jìn)行一些設(shè)置宦言。登錄成功后scrapy是默認(rèn)保存cookies的,如果不需要保存可以在settings文件里進(jìn)行配置商模。
微博會(huì)對(duì)一個(gè)太過(guò)頻繁訪問(wèn)的用戶進(jìn)行凍結(jié)奠旺,而這時(shí),我們可以使用很多個(gè)賬戶進(jìn)行登錄施流,再將登錄后的Cookies保存到Cookies池(推介使用Redis數(shù)據(jù)庫(kù)來(lái)存儲(chǔ)我們的Cookie池)并進(jìn)行存儲(chǔ)并定時(shí)檢測(cè)响疚,拋出檢測(cè)后過(guò)期或不可用的Cookies。