本文承接上一篇爬蟲開篇的說明----上一篇已經(jīng)很好的用到了reqquests,Beautifulsoup等庫英妓,以及爬蟲的常用更簡單框架;本篇內(nèi)容的目的是充分的認(rèn)識(shí)scrapy 框架的各個(gè)組件扇单,以及利用scrapy 框架實(shí)現(xiàn)微博的爬取
開篇之前玉雾,先來概覽一下scrapy 框架的架構(gòu)
1. Engine 引擎,觸發(fā)事務(wù)爆土,是整個(gè)框架的核心部分
2.scheduler 調(diào)度器,將引擎發(fā)來的請(qǐng)求加入到隊(duì)列當(dāng)中
3.Dowloader 下載器 接受調(diào)度器的requests 并將網(wǎng)頁內(nèi)容Response 返回給spider
4.spider 代碼的主要部分诸蚕,用于存儲(chǔ)代碼的主要邏輯步势,網(wǎng)頁的解析規(guī)則氧猬,
5.item Pipline 項(xiàng)目管道 主要的作用是清晰,存儲(chǔ)數(shù)據(jù)
6.Downloader Middlewares 下載器中間件坏瘩,位于引擎和下載器之間的鉤子框架盅抚,主要是處理引擎與下載器之間的請(qǐng)求和相應(yīng)
7. Spider Middlewares 位于引擎和spider 之間的鉤子框架,主要處理Spider 輸入的相應(yīng)和輸出結(jié)果以及新的請(qǐng)求
運(yùn)行環(huán)境
win8, python3倔矾, pycharm, Scrapy框架妄均, MongoDB , PyMongo 庫
后續(xù)的分布式爬蟲和驗(yàn)證碼還會(huì)用到
Redis,PIL
1. 創(chuàng)建scrapy 項(xiàng)目
- 打開pycharm 的Terminal 輸入 命令
scrapy startproject weibo
*給創(chuàng)建的Spider 命名 且定于要爬取的本地網(wǎng)址
輸入scrapy genspider weibospider weibo.cn
前一個(gè)為name,后一個(gè)為要訪問的base_url
這個(gè)時(shí)候哪自,創(chuàng)建的類目下就會(huì)多一個(gè)weibo.py的文件丰包,打開即為如下示意圖
2.scrapy 選擇器 Selector
上一篇爬蟲用到了 BeautifulSoup ,Pyquery 而Scrapy 提供了自己的數(shù)據(jù)提取方法 Selector, 該選擇器主要是基于lxml來構(gòu)建的壤巷,支持xpath 選擇器邑彪,CSS選擇器;接下來用官方文檔的例子來演示選擇器的使用
網(wǎng)址:http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
然后在 pycharm 的 Terminal 里輸入 scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
這樣就進(jìn)入了python 環(huán)境下胧华,且進(jìn)行了請(qǐng)求寄症, response.text 可以輸出html
關(guān)于Selector 選擇器的用法總結(jié)如下
- result=response.selector.xpath('//a') 當(dāng)然省略selector 也是可以的,result=response.xpath('//a')與上邊等效 撑柔;該方法是取出所有根節(jié)點(diǎn)以下的a標(biāo)簽
- result.xpath('./img') 用于取出a標(biāo)簽下的image標(biāo)簽 如果沒有加點(diǎn)瘸爽,則代表從根節(jié)點(diǎn)開始提取您访,此處用了./img铅忿,表示從a 節(jié)點(diǎn)進(jìn)行提取,如果用//img 則表示從根節(jié)點(diǎn)提取
- 上述的result 類型都是selector灵汪, 如何變?yōu)樽址⑶曳旁谝粋€(gè)列表當(dāng)中呢檀训? 用result.xpath('./img').extract() ,隨后則可以進(jìn)行一系列列表的處理
- 怎么樣可以取出一個(gè)標(biāo)簽里面的內(nèi)容呢? result.xpath('//a/text()').extract() 用text()即可
- 怎么樣獲取標(biāo)簽的屬性值呢享言?
result.xpath('//a/@href').extract()- 上述是獲得的所有的屬性峻凫,但是如果想要獲得某一個(gè)屬性怎么做?
result.xpath('//a[@href="image1.html"]').extract()
需要注意的是览露,里面要用雙引號(hào)荧琼,否則報(bào)錯(cuò),其次不建議用extract()來取單個(gè)元素差牛,代替的是用 extract_first()
result.xpath('//a[@href="image1.html"]').extract_first()
想要得到文本命锄?加個(gè)text()即可
result.xpath('//a[@href="image1.html"]/text()').extract_first()下面介紹一下selector中的css 選擇器
如何像上述xpath一樣選取a 標(biāo)簽?zāi)兀?并且以列表的形式存儲(chǔ)呢?
result.css('a').extract()
如何選取 a 標(biāo)簽下的子標(biāo)簽 img 呢偏化?只需要簡單的空格
result.css('a img').extract()
如何選擇 某一個(gè)屬性a 標(biāo)簽下的子標(biāo)簽?zāi)兀?br> result.css('a[href="image1.html"] img').extract_first()
如何選擇屬性和文本內(nèi)容呢脐恩?
result.css('a[href="image1.html"] img::attr(src)').extract_first()
result.css('a[href="image1.html"]::text').extract_first()
除了css,xpath 選擇器之外,還提供了 正則匹配
那如何選取頭一個(gè)元素呢侦讨?
值得注意的是 result 不能直接與re(),re_first() 一起使用驶冒,否則報(bào)錯(cuò)苟翻,必須與css,xpath 連用
3. scrapy 幾個(gè)主要框架的介紹
- spider 模塊
spider 是scrapy 框架當(dāng)中最重要也是最基礎(chǔ)的一個(gè)模塊,其作用就是存儲(chǔ)爬取骗污、解析等主要的代碼邏輯崇猫,該模塊的循環(huán)步驟或者說是運(yùn)轉(zhuǎn)邏輯如下:
1.以初始化的url初始Request,并設(shè)置一個(gè)回調(diào)函數(shù),當(dāng)該Request 成功請(qǐng)求并返回時(shí)身堡,Response 生成并作為參數(shù)傳給該回調(diào)函數(shù)
- 在設(shè)定的回調(diào)函數(shù)內(nèi)分析請(qǐng)求的正常返回內(nèi)容邓尤,有兩種形式,一種為得到的有效結(jié)果返回字典或者item對(duì)象贴谎,他們可以經(jīng)過處理后直接保存汞扎,另一種是解析得到的下一個(gè)鏈接,可以利用此鏈接構(gòu)造Request 并設(shè)置新的回調(diào)函數(shù)
3.如果返回的是字典或者item的化擅这,同時(shí)設(shè)置了pipline的話澈魄,我們可以使用pipline處理并且保存
4.如果返回的是request,那么request 執(zhí)行成功得到response之后,Response會(huì)被傳遞給request中定義的新回調(diào)函數(shù)仲翎,在回調(diào)函數(shù)中我們可以利用上述的selector 選擇器來解析內(nèi)容痹扇,并根據(jù)內(nèi)容生成item
spider 這個(gè)類中最常見的就是scrapy.spiders.spider,它提供了starts_requests()方法,它有以下屬性:
屬性:name 爬蟲名稱溯香, allowed_domains 所要爬取的域名鲫构,starts_urls 是起始的urls
方法:
1.start_requests() 此方法用于生成初始請(qǐng)求,它必須返回一個(gè)可以迭代的對(duì)象玫坛,該方法默認(rèn)使用starts_urls里面的URL來構(gòu)造Requests,而且Request是get 請(qǐng)求方式结笨,如果想要發(fā)起post請(qǐng)求,就必須使用FormRequest即可
- parse() 當(dāng)response沒有指定回調(diào)函數(shù)的時(shí)候湿镀,該方法會(huì)被默認(rèn)調(diào)用
3.closed() 當(dāng)spider g關(guān)閉時(shí)會(huì)被調(diào)用
*Downloader Middleware 模塊
從文章開始的示意圖可以了解到炕吸,當(dāng)scheduler 從隊(duì)列中拿出一個(gè)Request 發(fā)送給Downloader 執(zhí)行下載,這個(gè)過程會(huì)經(jīng)過DownloaderMiddleware 處理勉痴,另外赫模,當(dāng)Downloader 將Request 下載完成得到Response 返回spider的時(shí)候,也會(huì)經(jīng)過該模塊
該模塊主要有兩個(gè)方法
1.process_request()
Requset 被調(diào)度器Scheduler調(diào)度給Downloader 之前蒸矛,Process_request()方法就會(huì)被調(diào)用瀑罗,也就是在Request 從隊(duì)列中調(diào)度出來到Downloader 下載執(zhí)行之前,我們都可以用process_request()方法進(jìn)行處理雏掠,比如說設(shè)定user_agent斩祭,cookies等,但是當(dāng)設(shè)定user_agent 的時(shí)候推薦直接在插件setting.py中加入
USER_AGENT='XXXXXXX'即可
2.process_response()
Downloader 執(zhí)行Request 后磁玉,會(huì)得到Response,Scrapy 引擎便會(huì)將Response 發(fā)送給spider進(jìn)行解析抽莱,發(fā)送之前绊起,可以用process_response()方法來進(jìn)行解析
下面來講一個(gè)demo,其中的user_agent 是隨機(jī)選取的一個(gè)稍计,可以根據(jù)自己爬取的網(wǎng)站自行替換,方法一樣,在middleware.py 中加入下述demo
import random
class randomuseragentmiddleware(object):
@classmethod
def __init__(self):
self.user_agents=['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36']
"""
上述的user_agent 可以是多個(gè)吮铭,放在一個(gè)列表當(dāng)中
"""
def process_request(self,request,spider):
request.headers['User_Agent']=random.choice(self.user_agents)
隨后在settings.py中找到DOWNLOADER_MIDDLEWARES的注釋,去掉注釋颅停,改為下面語句
'weibo.middlewares.randomuseragentmiddleware: 543',
其中谓晌,weibo 就是上述startproject后面的文件名稱,最后的就是類名癞揉,自己根據(jù)項(xiàng)目修改即可
*spider_middleware 模塊
當(dāng)Downloader 生成Response 之后纸肉,Response 會(huì)被發(fā)送給Spider,在發(fā)送給Spider 之前,會(huì)首先經(jīng)過Spider Middleware 處理喊熟,當(dāng)spider 處理生成Item,Request 之后柏肪,Item和Request 還會(huì)經(jīng)過Spider Middleware 的處理
- Item_pipline模塊
Item Pipline 主要的功能有4處
1.清理HTML數(shù)據(jù)
2.驗(yàn)證爬取數(shù)據(jù),檢查爬取字段
3.查重并丟棄重復(fù)字段
4.將爬取的結(jié)果保存到數(shù)據(jù)庫
下面進(jìn)行微博的爬取實(shí)例
因?yàn)槲⒉┑姆磁婪浅柡媾疲谶M(jìn)行爬取之前烦味,需要建立一個(gè)cookies池,這里利用“拿來主義”壁拉,強(qiáng)烈推薦使用崔大佬現(xiàn)成cookies池
網(wǎng)址:https://github.com/Python3WebSpider/CookiesPOOL
將文件下載解壓后谬俄,使用方法如下面文章
網(wǎng)址:https://blog.csdn.net/qq_38661599/article/details/80945233
進(jìn)行了上述的預(yù)備工作后,分析一下網(wǎng)頁結(jié)構(gòu)弃理,首先登陸https://m.weibo.cn,隨后登陸溃论,進(jìn)入一個(gè)用戶的界面,調(diào)度開發(fā)者工具痘昌,觀察XHR,ajax異步存儲(chǔ)(動(dòng)態(tài)頁面json形式)钥勋,getindx開頭的ajax 請(qǐng)求,返回的結(jié)果就是該用戶每頁的信息控汉,preview中可以觀察為json形式
下面是代碼:
第一部分是spider.py
import scrapy
import json
from scrapy import Request,Spider
from weibo.items import *
class WeibospiderSpider(scrapy.Spider):
name = 'weibospider'
allowed_domains = ['m.weibo.cn']
start_urls = ['http://weibo.cn/']
user_url = 'https://m.weibo.cn/api/container/getIndex?uid={uid}&type=uid&value={uid}&containerid=100505{uid}'
follow_url = 'https://m.weibo.cn/api/container/getIndex?containerid=231051_-_followers_-_{uid}&page={page}' #爬取用戶關(guān)注的人
fan_url = 'https://m.weibo.cn/api/container/getIndex?containerid=231051_-_fans_-_{uid}&page={page}' #爬取用戶的粉絲
weibo_url = 'https://m.weibo.cn/api/container/getIndex?uid={uid}&type=uid&page={page}&containerid=107603{uid}'
start_users = ['1977459170','1742566624']#存儲(chǔ)uid 的列表笔诵,可以多個(gè)
def start_requests(self):
for uid in self.start_users:
yield Request(self.user_url.format(uid),callback=self.parse_user)
def parse_user(self, response): # 該response 是由上述所講的downloadmiddle和downloader 而來的
result=json.loads(response.text)
if result.get('data').get('userInfo'):#該用戶的信息就存儲(chǔ)在該key下面
user_info=result.get('data').get('userInfo')
user_item=UserItem()#調(diào)用實(shí)例化item里面的UserItem類
field_map = {
'id': 'id', 'name': 'screen_name', 'avatar': 'profile_image_url', 'cover': 'cover_image_phone',
'gender': 'gender', 'description': 'description', 'fans_count': 'followers_count',
'follows_count': 'follow_count', 'weibos_count': 'statuses_count', 'verified': 'verified',
'verified_reason': 'verified_reason', 'verified_type': 'verified_type'
}#要提取的所有的鍵名稱
for field,attr in field_map.items():
user_item[field]=user_info.get(attr)
yield user_item
uid=user_info.get('id')
yield Request(self.follow_url.format(uid, page=1), callback=self.parse_follows)
yield Request(self.fan_url.format(uid, page=1), callback=self.parse_fans)
yield Request(self.weibo_url.format(uid, page=1), callback=self.parse_weibos)
def parse_follows(self,response):#上述的請(qǐng)求給到downloadermiddleware 后返吻,由downloader 直接返回response
result=json.loads(response.text)#json 解析
if result.get('ok') and result.get('data').get('cards') and len(result.get('data').get('cards')) and \
result.get('data').get('cards')[-1].get(
'card_group'):
# 解析用戶
follows = result.get('data').get('cards')[-1].get('card_group')
for follow in follows:
if follow.get('user'):
uid = follow.get('user').get('id')
yield Request(self.user_url.format(uid=uid), callback=self.parse_user)
uid = response.meta.get('uid')
# 關(guān)注列表
user_relation_item = UserRelationItem()
follows = [{'id':follow.get('user').get('id'), 'name': follow.get('user').get('screen_name')} for follow in follows]
user_relation_item['id'] = uid
user_relation_item['follows'] = follows
user_relation_item['fans'] = []
yield user_relation_item
# 下一頁關(guān)注
page = response.meta.get('page') + 1
yield Request(self.follow_url.format(uid=uid, page=page),callback=self.parse_follows, meta={'page': page, 'uid': uid})
def parse_fans(self, response):
"""
解析用戶粉絲
:param response: Response對(duì)象
"""
result = json.loads(response.text)
if result.get('ok') and result.get('data').get('cards') and len(result.get('data').get('cards')) and \
result.get('data').get('cards')[-1].get(
'card_group'):
# 解析用戶
fans = result.get('data').get('cards')[-1].get('card_group')
for fan in fans:
if fan.get('user'):
uid = fan.get('user').get('id')
yield Request(self.user_url.format(uid=uid), callback=self.parse_user)
uid = response.meta.get('uid')
# 粉絲列表
user_relation_item = UserRelationItem()
fans = [{'id': fan.get('user').get('id'), 'name': fan.get('user').get('screen_name')} for fan in
fans]
user_relation_item['id'] = uid
user_relation_item['fans'] = fans
user_relation_item['follows'] = []
yield user_relation_item
# 下一頁粉絲
page = response.meta.get('page') + 1
yield Request(self.fan_url.format(uid=uid, page=page),
callback=self.parse_fans, meta={'page': page, 'uid': uid})
def parse_weibos(self, response):
"""
解析微博列表
:param response: Response對(duì)象
"""
result = json.loads(response.text)
if result.get('ok') and result.get('data').get('cards'):
weibos = result.get('data').get('cards')
for weibo in weibos:
mblog = weibo.get('mblog')
if mblog:
weibo_item = WeiboItem()
field_map = {
'id': 'id', 'attitudes_count': 'attitudes_count', 'comments_count': 'comments_count',
'reposts_count': 'reposts_count', 'picture': 'original_pic', 'pictures': 'pics',
'created_at': 'created_at', 'source': 'source', 'text': 'text', 'raw_text': 'raw_text',
'thumbnail': 'thumbnail_pic',
}
for field, attr in field_map.items():
weibo_item[field] = mblog.get(attr)
weibo_item['user'] = response.meta.get('uid')
yield weibo_item
# 下一頁微博
uid = response.meta.get('uid')
page = response.meta.get('page') + 1
yield Request(self.weibo_url.format(uid=uid, page=page), callback=self.parse_weibos,
meta={'uid': uid, 'page': page})
middleware.py 用于存放對(duì)接cookies池的代碼
import json
import logging
from scrapy import signals
import requests
class ProxyMiddleware():
def __init__(self, proxy_url):
self.logger = logging.getLogger(__name__)
self.proxy_url = proxy_url
def get_random_proxy(self):
try:
response = requests.get(self.proxy_url)
if response.status_code == 200:
proxy = response.text
return proxy
except requests.ConnectionError:
return False
def process_request(self, request, spider):
if request.meta.get('retry_times'):
proxy = self.get_random_proxy()
if proxy:
uri = 'https://{proxy}'.format(proxy=proxy)
self.logger.debug('使用代理 ' + proxy)
request.meta['proxy'] = uri
@classmethod
def from_crawler(cls, crawler):
settings = crawler.settings
return cls(
proxy_url=settings.get('PROXY_URL')
)
class CookiesMiddleware():
def __init__(self, cookies_url):
self.logger = logging.getLogger(__name__)
self.cookies_url = cookies_url
def get_random_cookies(self):
try:
response = requests.get(self.cookies_url)
if response.status_code == 200:
cookies = json.loads(response.text)
return cookies
except requests.ConnectionError:
return False
def process_request(self, request, spider):
self.logger.debug('正在獲取Cookies')
cookies = self.get_random_cookies()
if cookies:
request.cookies = cookies
self.logger.debug('使用Cookies ' + json.dumps(cookies))
@classmethod
def from_crawler(cls, crawler):
settings = crawler.settings
return cls(
cookies_url=settings.get('COOKIES_URL')
)
pipline.py 存儲(chǔ)的是時(shí)間的清洗代碼姑子,以及存儲(chǔ)代碼,返回spider.py中的item 信息
import re, time
import logging
import pymongo
from weibo.items import *
class TimePipeline():
def process_item(self, item, spider):
if isinstance(item, UserItem) or isinstance(item, WeiboItem):
now = time.strftime('%Y-%m-%d %H:%M', time.localtime())
item['crawled_at'] = now
return item
class WeiboPipeline():
def parse_time(self, date):
if re.match('剛剛', date):
date = time.strftime('%Y-%m-%d %H:%M', time.localtime(time.time()))
if re.match('\d+分鐘前', date):
minute = re.match('(\d+)', date).group(1)
date = time.strftime('%Y-%m-%d %H:%M', time.localtime(time.time() - float(minute) * 60))
if re.match('\d+小時(shí)前', date):
hour = re.match('(\d+)', date).group(1)
date = time.strftime('%Y-%m-%d %H:%M', time.localtime(time.time() - float(hour) * 60 * 60))
if re.match('昨天.*', date):
date = re.match('昨天(.*)', date).group(1).strip()
date = time.strftime('%Y-%m-%d', time.localtime() - 24 * 60 * 60) + ' ' + date
if re.match('\d{2}-\d{2}', date):
date = time.strftime('%Y-', time.localtime()) + date + ' 00:00'
return date
def process_item(self, item, spider):
if isinstance(item, WeiboItem):
if item.get('created_at'):
item['created_at'] = item['created_at'].strip()
item['created_at'] = self.parse_time(item.get('created_at'))
if item.get('pictures'):
item['pictures'] = [pic.get('url') for pic in item.get('pictures')]
return item
class MongoPipeline(object):
def __init__(self, mongo_uri, mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db
@classmethod
def from_crawler(cls, crawler):
return cls(
mongo_uri=crawler.settings.get('MONGO_URI'),
mongo_db=crawler.settings.get('MONGO_DATABASE')
)
def open_spider(self, spider):
self.client = pymongo.MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db]
self.db[UserItem.collection].create_index([('id', pymongo.ASCENDING)])
self.db[WeiboItem.collection].create_index([('id', pymongo.ASCENDING)])
def close_spider(self, spider):
self.client.close()
def process_item(self, item, spider):
if isinstance(item, UserItem) or isinstance(item, WeiboItem):
self.db[item.collection].update({'id': item.get('id')}, {'$set': item}, True)
if isinstance(item, UserRelationItem):
self.db[item.collection].update(
{'id': item.get('id')},
{'$addToSet':
{
'follows': {'$each': item['follows']},
'fans': {'$each': item['fans']}
}
}, True)
return item
最后是setting.py 用于存放聲明
# -*- coding: utf-8 -*-
# Scrapy settings for weibo project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
# http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
# http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'weibo'
SPIDER_MODULES = ['weibo.spiders']
NEWSPIDER_MODULE = 'weibo.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'weibo (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
DEFAULT_REQUEST_HEADERS = {
'Accept': 'application/json, text/plain, */*',
'Accept-Encoding': 'gzip, deflate, sdch',
'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,ja;q=0.4,zh-TW;q=0.2,mt;q=0.2',
'Connection': 'keep-alive',
'Host': 'm.weibo.cn',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
# COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False
# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
# }
# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
# 'weibo.middlewares.WeiboSpiderMiddleware': 543,
# }
# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'weibo.middlewares.CookiesMiddleware': 554,
'weibo.middlewares.ProxyMiddleware': 555,
}
# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
# EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
# }
# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'weibo.pipelines.TimePipeline': 300,
'weibo.pipelines.WeiboPipeline': 301,
'weibo.pipelines.MongoPipeline': 302,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = 'httpcache'
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
MONGO_URI = 'localhost'
MONGO_DATABASE = 'weibo'
COOKIES_URL = 'http://localhost:5000/weibo/random'
PROXY_URL = 'http://localhost:5555/random'
RETRY_HTTP_CODES = [401, 403, 408, 414, 500, 502, 503, 504]
上述內(nèi)容主要是對(duì)于scrapy 框架有一個(gè)概覽测僵,認(rèn)識(shí)各個(gè)模塊的使用方法街佑,作用是什么,下一次會(huì)用scrapy 進(jìn)行一個(gè)完整的典行實(shí)例操作爬取京東