爬蟲實(shí)戰(zhàn)2--微博爬取

本文承接上一篇爬蟲開篇的說明----上一篇已經(jīng)很好的用到了reqquests,Beautifulsoup等庫英妓，以及爬蟲的常用更簡單框架；本篇內(nèi)容的目的是充分的認(rèn)識(shí)scrapy 框架的各個(gè)組件扇单，以及利用scrapy 框架實(shí)現(xiàn)微博的爬取

開篇之前玉雾，先來概覽一下scrapy 框架的架構(gòu)

scrapy 架構(gòu)

1. Engine 引擎，觸發(fā)事務(wù)爆土，是整個(gè)框架的核心部分
2.scheduler   調(diào)度器，將引擎發(fā)來的請(qǐng)求加入到隊(duì)列當(dāng)中
3.Dowloader  下載器 接受調(diào)度器的requests 并將網(wǎng)頁內(nèi)容Response 返回給spider 
4.spider   代碼的主要部分诸蚕，用于存儲(chǔ)代碼的主要邏輯步势，網(wǎng)頁的解析規(guī)則氧猬，
5.item Pipline 項(xiàng)目管道  主要的作用是清晰，存儲(chǔ)數(shù)據(jù)
6.Downloader  Middlewares   下載器中間件坏瘩，位于引擎和下載器之間的鉤子框架盅抚，主要是處理引擎與下載器之間的請(qǐng)求和相應(yīng)
7. Spider Middlewares  位于引擎和spider 之間的鉤子框架，主要處理Spider 輸入的相應(yīng)和輸出結(jié)果以及新的請(qǐng)求

運(yùn)行環(huán)境
win8, python3倔矾， pycharm, Scrapy框架妄均， MongoDB , PyMongo 庫
后續(xù)的分布式爬蟲和驗(yàn)證碼還會(huì)用到
Redis,PIL

1. 創(chuàng)建scrapy 項(xiàng)目

打開pycharm 的Terminal 輸入命令
scrapy startproject weibo
*給創(chuàng)建的Spider 命名且定于要爬取的本地網(wǎng)址
輸入scrapy genspider weibospider weibo.cn
前一個(gè)為name，后一個(gè)為要訪問的base_url

示意圖

這個(gè)時(shí)候哪自，創(chuàng)建的類目下就會(huì)多一個(gè)weibo.py的文件丰包，打開即為如下示意圖

示意圖pycharm

2.scrapy 選擇器 Selector

上一篇爬蟲用到了 BeautifulSoup ，Pyquery 而Scrapy 提供了自己的數(shù)據(jù)提取方法 Selector, 該選擇器主要是基于lxml來構(gòu)建的壤巷，支持xpath 選擇器邑彪，CSS選擇器；接下來用官方文檔的例子來演示選擇器的使用
網(wǎng)址：http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
然后在 pycharm 的 Terminal 里輸入 scrapy shell http://doc.scrapy.org/en/latest/_static/selectors-sample1.html
這樣就進(jìn)入了python 環(huán)境下胧华，且進(jìn)行了請(qǐng)求寄症， response.text 可以輸出html

演示

關(guān)于Selector 選擇器的用法總結(jié)如下

result=response.selector.xpath('//a') 當(dāng)然省略selector 也是可以的，result=response.xpath('//a')與上邊等效撑柔；該方法是取出所有根節(jié)點(diǎn)以下的a標(biāo)簽

result.xpath('./img') 用于取出a標(biāo)簽下的image標(biāo)簽如果沒有加點(diǎn)瘸爽，則代表從根節(jié)點(diǎn)開始提取您访，此處用了./img铅忿，表示從a 節(jié)點(diǎn)進(jìn)行提取，如果用//img 則表示從根節(jié)點(diǎn)提取

上述的result 類型都是selector灵汪，如何變?yōu)樽址⑶曳旁谝粋€(gè)列表當(dāng)中呢檀训？用result.xpath('./img').extract() ,隨后則可以進(jìn)行一系列列表的處理

怎么樣可以取出一個(gè)標(biāo)簽里面的內(nèi)容呢？ result.xpath('//a/text()').extract() 用text()即可

怎么樣獲取標(biāo)簽的屬性值呢享言？
result.xpath('//a/@href').extract()

上述是獲得的所有的屬性峻凫，但是如果想要獲得某一個(gè)屬性怎么做？
result.xpath('//a[@href="image1.html"]').extract()
需要注意的是览露，里面要用雙引號(hào)荧琼，否則報(bào)錯(cuò)，其次不建議用extract()來取單個(gè)元素差牛，代替的是用 extract_first()
result.xpath('//a[@href="image1.html"]').extract_first()
想要得到文本命锄？加個(gè)text()即可
result.xpath('//a[@href="image1.html"]/text()').extract_first()

下面介紹一下selector中的css 選擇器
如何像上述xpath一樣選取a 標(biāo)簽?zāi)兀?并且以列表的形式存儲(chǔ)呢？
result.css('a').extract()
如何選取 a 標(biāo)簽下的子標(biāo)簽 img 呢偏化？只需要簡單的空格
result.css('a img').extract()
如何選擇某一個(gè)屬性a 標(biāo)簽下的子標(biāo)簽?zāi)兀?br> result.css('a[href="image1.html"] img').extract_first()
如何選擇屬性和文本內(nèi)容呢脐恩？
result.css('a[href="image1.html"] img::attr(src)').extract_first()
result.css('a[href="image1.html"]::text').extract_first()
除了css,xpath 選擇器之外，還提供了正則匹配

那如何選取頭一個(gè)元素呢侦讨？

值得注意的是 result 不能直接與re(),re_first() 一起使用驶冒，否則報(bào)錯(cuò)苟翻，必須與css,xpath 連用

3. scrapy 幾個(gè)主要框架的介紹

spider 模塊

spider 是scrapy 框架當(dāng)中最重要也是最基礎(chǔ)的一個(gè)模塊，其作用就是存儲(chǔ)爬取骗污、解析等主要的代碼邏輯崇猫，該模塊的循環(huán)步驟或者說是運(yùn)轉(zhuǎn)邏輯如下：

1.以初始化的url初始Request,并設(shè)置一個(gè)回調(diào)函數(shù)，當(dāng)該Request 成功請(qǐng)求并返回時(shí)身堡，Response 生成并作為參數(shù)傳給該回調(diào)函數(shù)

在設(shè)定的回調(diào)函數(shù)內(nèi)分析請(qǐng)求的正常返回內(nèi)容邓尤，有兩種形式，一種為得到的有效結(jié)果返回字典或者item對(duì)象贴谎，他們可以經(jīng)過處理后直接保存汞扎，另一種是解析得到的下一個(gè)鏈接，可以利用此鏈接構(gòu)造Request 并設(shè)置新的回調(diào)函數(shù)
3.如果返回的是字典或者item的化擅这，同時(shí)設(shè)置了pipline的話澈魄，我們可以使用pipline處理并且保存
4.如果返回的是request,那么request 執(zhí)行成功得到response之后，Response會(huì)被傳遞給request中定義的新回調(diào)函數(shù)仲翎，在回調(diào)函數(shù)中我們可以利用上述的selector 選擇器來解析內(nèi)容痹扇，并根據(jù)內(nèi)容生成item

spider 這個(gè)類中最常見的就是scrapy.spiders.spider,它提供了starts_requests()方法，它有以下屬性：

屬性：name 爬蟲名稱溯香， allowed_domains 所要爬取的域名鲫构，starts_urls 是起始的urls
方法：
1.start_requests() 此方法用于生成初始請(qǐng)求，它必須返回一個(gè)可以迭代的對(duì)象玫坛，該方法默認(rèn)使用starts_urls里面的URL來構(gòu)造Requests,而且Request是get 請(qǐng)求方式结笨，如果想要發(fā)起post請(qǐng)求，就必須使用FormRequest即可

parse() 當(dāng)response沒有指定回調(diào)函數(shù)的時(shí)候湿镀，該方法會(huì)被默認(rèn)調(diào)用
3.closed() 當(dāng)spider g關(guān)閉時(shí)會(huì)被調(diào)用

*Downloader Middleware 模塊
從文章開始的示意圖可以了解到炕吸，當(dāng)scheduler 從隊(duì)列中拿出一個(gè)Request 發(fā)送給Downloader 執(zhí)行下載，這個(gè)過程會(huì)經(jīng)過DownloaderMiddleware 處理勉痴，另外赫模，當(dāng)Downloader 將Request 下載完成得到Response 返回spider的時(shí)候，也會(huì)經(jīng)過該模塊

該模塊主要有兩個(gè)方法

1.process_request()
Requset 被調(diào)度器Scheduler調(diào)度給Downloader 之前蒸矛，Process_request()方法就會(huì)被調(diào)用瀑罗，也就是在Request 從隊(duì)列中調(diào)度出來到Downloader 下載執(zhí)行之前，我們都可以用process_request()方法進(jìn)行處理雏掠，比如說設(shè)定user_agent斩祭，cookies等，但是當(dāng)設(shè)定user_agent 的時(shí)候推薦直接在插件setting.py中加入
USER_AGENT='XXXXXXX'即可
2.process_response()
Downloader 執(zhí)行Request 后磁玉，會(huì)得到Response,Scrapy 引擎便會(huì)將Response 發(fā)送給spider進(jìn)行解析抽莱，發(fā)送之前绊起，可以用process_response（）方法來進(jìn)行解析

下面來講一個(gè)demo,其中的user_agent 是隨機(jī)選取的一個(gè)稍计，可以根據(jù)自己爬取的網(wǎng)站自行替換，方法一樣,在middleware.py 中加入下述demo

import random 

class randomuseragentmiddleware(object):



    @classmethod
    def __init__(self):
        self.user_agents=['Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Safari/537.36']
        """
        上述的user_agent 可以是多個(gè)吮铭，放在一個(gè)列表當(dāng)中
        """
    def process_request(self,request,spider):
        request.headers['User_Agent']=random.choice(self.user_agents)

隨后在settings.py中找到DOWNLOADER_MIDDLEWARES的注釋，去掉注釋颅停，改為下面語句

'weibo.middlewares.randomuseragentmiddleware: 543',
其中谓晌，weibo 就是上述startproject后面的文件名稱，最后的就是類名癞揉，自己根據(jù)項(xiàng)目修改即可

*spider_middleware 模塊
當(dāng)Downloader 生成Response 之后纸肉，Response 會(huì)被發(fā)送給Spider,在發(fā)送給Spider 之前，會(huì)首先經(jīng)過Spider Middleware 處理喊熟，當(dāng)spider 處理生成Item,Request 之后柏肪，Item和Request 還會(huì)經(jīng)過Spider Middleware 的處理

Item_pipline模塊

Item Pipline 主要的功能有4處
1.清理HTML數(shù)據(jù)
2.驗(yàn)證爬取數(shù)據(jù)，檢查爬取字段
3.查重并丟棄重復(fù)字段
4.將爬取的結(jié)果保存到數(shù)據(jù)庫

下面進(jìn)行微博的爬取實(shí)例

因?yàn)槲⒉┑姆磁婪浅柡媾疲谶M(jìn)行爬取之前烦味，需要建立一個(gè)cookies池，這里利用“拿來主義”壁拉，強(qiáng)烈推薦使用崔大佬現(xiàn)成cookies池

網(wǎng)址：https://github.com/Python3WebSpider/CookiesPOOL

將文件下載解壓后谬俄，使用方法如下面文章
網(wǎng)址：https://blog.csdn.net/qq_38661599/article/details/80945233

進(jìn)行了上述的預(yù)備工作后，分析一下網(wǎng)頁結(jié)構(gòu)弃理，首先登陸https://m.weibo.cn,隨后登陸溃论，進(jìn)入一個(gè)用戶的界面，調(diào)度開發(fā)者工具痘昌，觀察XHR,ajax異步存儲(chǔ)（動(dòng)態(tài)頁面json形式）钥勋，getindx開頭的ajax 請(qǐng)求，返回的結(jié)果就是該用戶每頁的信息控汉，preview中可以觀察為json形式

下面是代碼：

第一部分是spider.py

import scrapy
import json
from scrapy import Request,Spider
from weibo.items import *
class WeibospiderSpider(scrapy.Spider):
    name = 'weibospider'
    allowed_domains = ['m.weibo.cn']
    start_urls = ['http://weibo.cn/']
    user_url = 'https://m.weibo.cn/api/container/getIndex?uid={uid}&type=uid&value={uid}&containerid=100505{uid}'

    follow_url = 'https://m.weibo.cn/api/container/getIndex?containerid=231051_-_followers_-_{uid}&page={page}' #爬取用戶關(guān)注的人

    fan_url = 'https://m.weibo.cn/api/container/getIndex?containerid=231051_-_fans_-_{uid}&page={page}'  #爬取用戶的粉絲

    weibo_url = 'https://m.weibo.cn/api/container/getIndex?uid={uid}&type=uid&page={page}&containerid=107603{uid}'

    start_users = ['1977459170','1742566624']#存儲(chǔ)uid 的列表笔诵，可以多個(gè)


    def start_requests(self):
        for uid in self.start_users:
            yield Request(self.user_url.format(uid),callback=self.parse_user)

    def parse_user(self, response):  # 該response 是由上述所講的downloadmiddle和downloader 而來的
        result=json.loads(response.text)
        if result.get('data').get('userInfo'):#該用戶的信息就存儲(chǔ)在該key下面
            user_info=result.get('data').get('userInfo')
            user_item=UserItem()#調(diào)用實(shí)例化item里面的UserItem類
            field_map = {
                'id': 'id', 'name': 'screen_name', 'avatar': 'profile_image_url', 'cover': 'cover_image_phone',
                'gender': 'gender', 'description': 'description', 'fans_count': 'followers_count',
                'follows_count': 'follow_count', 'weibos_count': 'statuses_count', 'verified': 'verified',
                'verified_reason': 'verified_reason', 'verified_type': 'verified_type'
            }#要提取的所有的鍵名稱
            for field,attr in field_map.items():
                user_item[field]=user_info.get(attr)
            yield user_item

            uid=user_info.get('id')
            yield Request(self.follow_url.format(uid, page=1), callback=self.parse_follows)
            yield Request(self.fan_url.format(uid, page=1), callback=self.parse_fans)
            yield Request(self.weibo_url.format(uid, page=1), callback=self.parse_weibos)

    def parse_follows(self,response):#上述的請(qǐng)求給到downloadermiddleware 后返吻，由downloader 直接返回response
        result=json.loads(response.text)#json 解析
        if result.get('ok') and result.get('data').get('cards') and len(result.get('data').get('cards')) and \
                result.get('data').get('cards')[-1].get(
                        'card_group'):
            # 解析用戶
            follows = result.get('data').get('cards')[-1].get('card_group')
            for follow in follows:
                if follow.get('user'):
                    uid = follow.get('user').get('id')
                    yield Request(self.user_url.format(uid=uid), callback=self.parse_user)

            uid = response.meta.get('uid')
        # 關(guān)注列表
            user_relation_item = UserRelationItem()
            follows = [{'id':follow.get('user').get('id'), 'name': follow.get('user').get('screen_name')} for follow in follows]
            user_relation_item['id'] = uid
            user_relation_item['follows'] = follows
            user_relation_item['fans'] = []
            yield user_relation_item
        # 下一頁關(guān)注
            page = response.meta.get('page') + 1
            yield Request(self.follow_url.format(uid=uid, page=page),callback=self.parse_follows, meta={'page': page, 'uid': uid})

    def parse_fans(self, response):
        """
        解析用戶粉絲
        :param response: Response對(duì)象
        """
        result = json.loads(response.text)
        if result.get('ok') and result.get('data').get('cards') and len(result.get('data').get('cards')) and \
                result.get('data').get('cards')[-1].get(
                        'card_group'):
            # 解析用戶
            fans = result.get('data').get('cards')[-1].get('card_group')
            for fan in fans:
                if fan.get('user'):
                    uid = fan.get('user').get('id')
                    yield Request(self.user_url.format(uid=uid), callback=self.parse_user)

            uid = response.meta.get('uid')
            # 粉絲列表
            user_relation_item = UserRelationItem()
            fans = [{'id': fan.get('user').get('id'), 'name': fan.get('user').get('screen_name')} for fan in
                    fans]
            user_relation_item['id'] = uid
            user_relation_item['fans'] = fans
            user_relation_item['follows'] = []
            yield user_relation_item
            # 下一頁粉絲
            page = response.meta.get('page') + 1
            yield Request(self.fan_url.format(uid=uid, page=page),
                          callback=self.parse_fans, meta={'page': page, 'uid': uid})

    def parse_weibos(self, response):
        """
        解析微博列表
        :param response: Response對(duì)象
        """
        result = json.loads(response.text)
        if result.get('ok') and result.get('data').get('cards'):
            weibos = result.get('data').get('cards')
            for weibo in weibos:
                mblog = weibo.get('mblog')
                if mblog:
                    weibo_item = WeiboItem()
                    field_map = {
                        'id': 'id', 'attitudes_count': 'attitudes_count', 'comments_count': 'comments_count',
                        'reposts_count': 'reposts_count', 'picture': 'original_pic', 'pictures': 'pics',
                        'created_at': 'created_at', 'source': 'source', 'text': 'text', 'raw_text': 'raw_text',
                        'thumbnail': 'thumbnail_pic',
                    }
                    for field, attr in field_map.items():
                        weibo_item[field] = mblog.get(attr)
                    weibo_item['user'] = response.meta.get('uid')
                    yield weibo_item
            # 下一頁微博
            uid = response.meta.get('uid')
            page = response.meta.get('page') + 1
            yield Request(self.weibo_url.format(uid=uid, page=page), callback=self.parse_weibos,
                          meta={'uid': uid, 'page': page})

middleware.py 用于存放對(duì)接cookies池的代碼

import json
import logging
from scrapy import signals
import requests


class ProxyMiddleware():
    def __init__(self, proxy_url):
        self.logger = logging.getLogger(__name__)
        self.proxy_url = proxy_url
    
    def get_random_proxy(self):
        try:
            response = requests.get(self.proxy_url)
            if response.status_code == 200:
                proxy = response.text
                return proxy
        except requests.ConnectionError:
            return False
    
    def process_request(self, request, spider):
        if request.meta.get('retry_times'):
            proxy = self.get_random_proxy()
            if proxy:
                uri = 'https://{proxy}'.format(proxy=proxy)
                self.logger.debug('使用代理 ' + proxy)
                request.meta['proxy'] = uri

    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        return cls(
            proxy_url=settings.get('PROXY_URL')
        )


class CookiesMiddleware():
    def __init__(self, cookies_url):
        self.logger = logging.getLogger(__name__)
        self.cookies_url = cookies_url
    
    def get_random_cookies(self):
        try:
            response = requests.get(self.cookies_url)
            if response.status_code == 200:
                cookies = json.loads(response.text)
                return cookies
        except requests.ConnectionError:
            return False
    
    def process_request(self, request, spider):
        self.logger.debug('正在獲取Cookies')
        cookies = self.get_random_cookies()
        if cookies:
            request.cookies = cookies
            self.logger.debug('使用Cookies ' + json.dumps(cookies))
    
    @classmethod
    def from_crawler(cls, crawler):
        settings = crawler.settings
        return cls(
            cookies_url=settings.get('COOKIES_URL')
        )

pipline.py 存儲(chǔ)的是時(shí)間的清洗代碼姑子，以及存儲(chǔ)代碼，返回spider.py中的item 信息

import re, time

import logging
import pymongo

from weibo.items import *


class TimePipeline():
    def process_item(self, item, spider):
        if isinstance(item, UserItem) or isinstance(item, WeiboItem):
            now = time.strftime('%Y-%m-%d %H:%M', time.localtime())
            item['crawled_at'] = now
        return item


class WeiboPipeline():
    def parse_time(self, date):
        if re.match('剛剛', date):
            date = time.strftime('%Y-%m-%d %H:%M', time.localtime(time.time()))
        if re.match('\d+分鐘前', date):
            minute = re.match('(\d+)', date).group(1)
            date = time.strftime('%Y-%m-%d %H:%M', time.localtime(time.time() - float(minute) * 60))
        if re.match('\d+小時(shí)前', date):
            hour = re.match('(\d+)', date).group(1)
            date = time.strftime('%Y-%m-%d %H:%M', time.localtime(time.time() - float(hour) * 60 * 60))
        if re.match('昨天.*', date):
            date = re.match('昨天(.*)', date).group(1).strip()
            date = time.strftime('%Y-%m-%d', time.localtime() - 24 * 60 * 60) + ' ' + date
        if re.match('\d{2}-\d{2}', date):
            date = time.strftime('%Y-', time.localtime()) + date + ' 00:00'
        return date
    
    def process_item(self, item, spider):
        if isinstance(item, WeiboItem):
            if item.get('created_at'):
                item['created_at'] = item['created_at'].strip()
                item['created_at'] = self.parse_time(item.get('created_at'))
            if item.get('pictures'):
                item['pictures'] = [pic.get('url') for pic in item.get('pictures')]
        return item


class MongoPipeline(object):
    def __init__(self, mongo_uri, mongo_db):
        self.mongo_uri = mongo_uri
        self.mongo_db = mongo_db
    
    @classmethod
    def from_crawler(cls, crawler):
        return cls(
            mongo_uri=crawler.settings.get('MONGO_URI'),
            mongo_db=crawler.settings.get('MONGO_DATABASE')
        )
    
    def open_spider(self, spider):
        self.client = pymongo.MongoClient(self.mongo_uri)
        self.db = self.client[self.mongo_db]
        self.db[UserItem.collection].create_index([('id', pymongo.ASCENDING)])
        self.db[WeiboItem.collection].create_index([('id', pymongo.ASCENDING)])
    
    def close_spider(self, spider):
        self.client.close()
    
    def process_item(self, item, spider):
        if isinstance(item, UserItem) or isinstance(item, WeiboItem):
            self.db[item.collection].update({'id': item.get('id')}, {'$set': item}, True)
        if isinstance(item, UserRelationItem):
            self.db[item.collection].update(
                {'id': item.get('id')},
                {'$addToSet':
                    {
                        'follows': {'$each': item['follows']},
                        'fans': {'$each': item['fans']}
                    }
                }, True)
        return item

最后是setting.py 用于存放聲明

# -*- coding: utf-8 -*-

# Scrapy settings for weibo project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     http://doc.scrapy.org/en/latest/topics/settings.html
#     http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
#     http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'weibo'

SPIDER_MODULES = ['weibo.spiders']
NEWSPIDER_MODULE = 'weibo.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'weibo (+http://www.yourdomain.com)'

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

DEFAULT_REQUEST_HEADERS = {
    'Accept': 'application/json, text/plain, */*',
    'Accept-Encoding': 'gzip, deflate, sdch',
    'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,ja;q=0.4,zh-TW;q=0.2,mt;q=0.2',
    'Connection': 'keep-alive',
    'Host': 'm.weibo.cn',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36',
    'X-Requested-With': 'XMLHttpRequest',
}

# Configure maximum concurrent requests performed by Scrapy (default: 16)
# CONCURRENT_REQUESTS = 32

# Configure a delay for requests for the same website (default: 0)
# See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
# DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
# CONCURRENT_REQUESTS_PER_DOMAIN = 16
# CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
# COOKIES_ENABLED = False

# Disable Telnet Console (enabled by default)
# TELNETCONSOLE_ENABLED = False

# Override the default request headers:
# DEFAULT_REQUEST_HEADERS = {
#   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#   'Accept-Language': 'en',
# }

# Enable or disable spider middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html
# SPIDER_MIDDLEWARES = {
#    'weibo.middlewares.WeiboSpiderMiddleware': 543,
# }

# Enable or disable downloader middlewares
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    'weibo.middlewares.CookiesMiddleware': 554,
    'weibo.middlewares.ProxyMiddleware': 555,
}

# Enable or disable extensions
# See http://scrapy.readthedocs.org/en/latest/topics/extensions.html
# EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
# }

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'weibo.pipelines.TimePipeline': 300,
    'weibo.pipelines.WeiboPipeline': 301,
    'weibo.pipelines.MongoPipeline': 302,
}

# Enable and configure the AutoThrottle extension (disabled by default)
# See http://doc.scrapy.org/en/latest/topics/autothrottle.html
# AUTOTHROTTLE_ENABLED = True
# The initial download delay
# AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
# AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
# AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
# AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
# HTTPCACHE_ENABLED = True
# HTTPCACHE_EXPIRATION_SECS = 0
# HTTPCACHE_DIR = 'httpcache'
# HTTPCACHE_IGNORE_HTTP_CODES = []
# HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'


MONGO_URI = 'localhost'

MONGO_DATABASE = 'weibo'

COOKIES_URL = 'http://localhost:5000/weibo/random'

PROXY_URL = 'http://localhost:5555/random'

RETRY_HTTP_CODES = [401, 403, 408, 414, 500, 502, 503, 504]

上述內(nèi)容主要是對(duì)于scrapy 框架有一個(gè)概覽测僵，認(rèn)識(shí)各個(gè)模塊的使用方法街佑，作用是什么，下一次會(huì)用scrapy 進(jìn)行一個(gè)完整的典行實(shí)例操作爬取京東

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者

人面猴
序言：七十年代末捍靠，一起剝皮案震驚了整個(gè)濱河市沐旨，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌榨婆，老刑警劉巖磁携，帶你破解...
沈念sama閱讀 218,858評(píng)論 6贊 508
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場(chǎng)離奇詭異良风，居然都是意外死亡谊迄，警方通過查閱死者的電腦和手機(jī)闷供，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 93,372評(píng)論 3贊 395
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進(jìn)店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來统诺，“玉大人歪脏，你說我怎么就攤上這事×改兀” “怎么了婿失？”我有些...
開封第一講書人閱讀 165,282評(píng)論 0贊 356
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵，是天一觀的道長啄寡。經(jīng)常有香客問我豪硅，道長，這世上最難降的妖魔是什么挺物？我笑而不...
開封第一講書人閱讀 58,842評(píng)論 1贊 295
?港島之戀（遺憾婚禮）
正文為了忘掉前任舟误，我火速辦了婚禮，結(jié)果婚禮上姻乓，老公的妹妹穿的比我還像新娘嵌溢。我一直安慰自己，他們只是感情好蹋岩，可當(dāng)我...
茶點(diǎn)故事閱讀 67,857評(píng)論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布赖草。她就那樣靜靜地躺著，像睡著了一般剪个。火紅的嫁衣襯著肌膚如雪秧骑。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 51,679評(píng)論 1贊 305
城市分裂傳說
那天扣囊，我揣著相機(jī)與錄音乎折，去河邊找鬼。笑死侵歇，一個(gè)胖子當(dāng)著我的面吹牛骂澄，可吹牛的內(nèi)容都是我干的。我是一名探鬼主播惕虑，決...
沈念sama閱讀 40,406評(píng)論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼坟冲，長吁一口氣：“原來是場(chǎng)噩夢(mèng)啊……” “哼！你這毒婦竟也來了溃蔫？” 一聲冷哼從身側(cè)響起健提，我...
開封第一講書人閱讀 39,311評(píng)論 0贊 276
萬榮殺人案實(shí)錄
序言：老撾萬榮一對(duì)情侶失蹤，失蹤者是張志新（化名）和其女友劉穎伟叛，沒想到半個(gè)月后私痹，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體，經(jīng)...
沈念sama閱讀 45,767評(píng)論 1贊 315
?護(hù)林員之死
正文獨(dú)居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點(diǎn)故事閱讀 37,945評(píng)論 3贊 336
?白月光啟示錄
正文我和宋清朗相戀三年紊遵，在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了雹锣。大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
茶點(diǎn)故事閱讀 40,090評(píng)論 1贊 350
活死人
序言：一個(gè)原本活蹦亂跳的男人離奇死亡癞蚕，死狀恐怖蕊爵，靈堂內(nèi)的尸體忽然破棺而出，到底是詐尸還是另有隱情桦山，我是刑警寧澤攒射，帶...
沈念sama閱讀 35,785評(píng)論 5贊 346
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站恒水，受9級(jí)特大地震影響会放，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜钉凌，卻給世界環(huán)境...
茶點(diǎn)故事閱讀 41,420評(píng)論 3贊 331
男人毒藥：我在死后第九天來索命
文/蒙蒙一咧最、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧御雕，春花似錦矢沿、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,988評(píng)論 0贊 22
一樁弒父案捣鲸，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至闽坡，卻和暖如春栽惶，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背疾嗅。一陣腳步聲響...
開封第一講書人閱讀 33,101評(píng)論 1贊 271
情欲美人皮
我被黑心中介騙來泰國打工外厂，沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留，地道東北人代承。一個(gè)月前我還...
沈念sama閱讀 48,298評(píng)論 3贊 372
代替公主和親
正文我出身青樓汁蝶，卻偏偏與公主長得像，于是被迫代替她去往敵國和親次泽。傳聞我的和親對(duì)象是個(gè)殘疾皇子穿仪，可洞房花燭夜當(dāng)晚...
茶點(diǎn)故事閱讀 45,033評(píng)論 2贊 355

爬蟲實(shí)戰(zhàn)2--微博爬取

1. 創(chuàng)建scrapy 項(xiàng)目

2.scrapy 選擇器 Selector

3. scrapy 幾個(gè)主要框架的介紹

推薦閱讀更多精彩內(nèi)容