redis系列之sentinel的故障轉(zhuǎn)移

故障轉(zhuǎn)移

接著上章構(gòu)建的sentinel網(wǎng)絡(luò)構(gòu)建后分析sentinel的故障轉(zhuǎn)移笛丙。sentinel本身做為redis的分布式存儲(chǔ)的高可用方案肮雨,進(jìn)行故障轉(zhuǎn)移就是高可用方案解決的核心問(wèn)題。同樣在分析sentinel的故障轉(zhuǎn)移的方案前忠荞,先理解三個(gè)問(wèn)題:

  1. 如何確認(rèn)故障的發(fā)生贬媒?
  2. 故障發(fā)生后惫谤,誰(shuí)來(lái)進(jìn)行轉(zhuǎn)移操作?
  3. 如何進(jìn)行轉(zhuǎn)移操作近零?

個(gè)人認(rèn)為這個(gè)是所有故障轉(zhuǎn)移方案中不得不解決的三個(gè)問(wèn)題诺核。在redis中,面對(duì)這三個(gè)問(wèn)題的就是sentinel節(jié)點(diǎn)而master和slave則是sentinel操作的對(duì)象,因而sentinel具有監(jiān)督者的身份久信。在實(shí)際應(yīng)用中窖杀,一般是由sentinels集群共同來(lái)監(jiān)控master節(jié)點(diǎn)的,這樣就可以讓sentinel集群具備具一定容錯(cuò)性裙士,當(dāng)某個(gè)sentinel節(jié)點(diǎn)出現(xiàn)問(wèn)題時(shí)入客,sentinels體系結(jié)構(gòu)也能夠繼續(xù)的進(jìn)行服務(wù)。

確認(rèn)故障

主觀下線

在上章的網(wǎng)絡(luò)構(gòu)建代碼中知道通過(guò)時(shí)間事件周期性方法腿椎,sentinel會(huì)向masterslave每10s發(fā)送info命令桌硫、至少每1s發(fā)送ping命令。其中ping命令的作用則啟著探測(cè)master節(jié)點(diǎn)的作用啃炸。

  • 當(dāng)sentinel向master發(fā)送ping命令時(shí)铆隘,如果收到的返回結(jié)果不是有效回復(fù)+PONG、-LOADING南用、-MASTERDOWN中的一種咖驮。當(dāng)sentinel在配置的down-after-milliseconds時(shí)間內(nèi)連續(xù)收到無(wú)效回復(fù),便會(huì)將在對(duì)應(yīng)的sentinelRedisInstanceflags屬性上帶上SRI_S_DOWN的標(biāo)記训枢,認(rèn)為主觀下線托修。
  • flags是int類型占2個(gè)字節(jié)有16位,因此flags的幾個(gè)標(biāo)志位的具體內(nèi)容如下:
#define SRI_MASTER  (1<<0)
#define SRI_SLAVE   (1<<1)
#define SRI_SENTINEL (1<<2)
#define SRI_S_DOWN (1<<3)   /* Subjectively down (no quorum). */
#define SRI_O_DOWN (1<<4)   /* Objectively down (confirmed by others). */
#define SRI_MASTER_DOWN (1<<5) /* A Sentinel with this flag set thinks that
                                   its master is down. */
#define SRI_FAILOVER_IN_PROGRESS (1<<6) /* Failover is in progress for
                                           this master. */
#define SRI_PROMOTED (1<<7)            /* Slave selected for promotion. */
#define SRI_RECONF_SENT (1<<8)     /* SLAVEOF <newmaster> sent. */
#define SRI_RECONF_INPROG (1<<9)   /* Slave synchronization in progress. */
#define SRI_RECONF_DONE (1<<10)     /* Slave synchronized with new master. */
#define SRI_FORCE_FAILOVER (1<<11)  /* Force failover with master up. */
#define SRI_SCRIPT_KILL_SENT (1<<12) /* SCRIPT KILL already sent on -BUSY */

如:

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1

0-12位分別代表著上圖從上到下標(biāo)志的意思恒界,使用這種記錄法不僅可以節(jié)約空間還可以同時(shí)表示多種狀態(tài)睦刃。0和1分別表示是否處于該狀態(tài)。通過(guò)簡(jiǎn)單的或運(yùn)算就可以設(shè)置對(duì)應(yīng)的標(biāo)志位十酣,而不影響其他標(biāo)志位位迂。

  • down-after-milliseconds可通過(guò)配置文件sentinel.conf配置也可以通過(guò)連接了sentinel的客戶端發(fā)送命令設(shè)置洋措,對(duì)于該配置的比較維度是在master維度,對(duì)于sentinel監(jiān)聽(tīng)的不同master可以配置不同的down-after-milliseconds值,而該master對(duì)應(yīng)的slavessentinels同樣繼承該值查刻,使用其來(lái)判斷節(jié)點(diǎn)是否下線。
  • 其實(shí)整個(gè)ping命令的探測(cè)模式不僅是針對(duì)master昭娩,對(duì)于slavesentinel實(shí)例也是如此簇搅。這里以master為例講解。
  • 由于確認(rèn)主觀下線依賴于down-after-milliseconds值,而該值可以配置從而監(jiān)聽(tīng)同一臺(tái)mastersentinels則可以配置不同的主觀下線時(shí)間旭贬。

在sentinel中每個(gè)時(shí)間周期怔接,都會(huì)遍歷檢查對(duì)應(yīng)的節(jié)點(diǎn)是否主觀下線,這個(gè)周期事件在上章中有提及。
sentinelHandleRedisInstance

/* Perform scheduled operations for the specified Redis instance. */
void sentinelHandleRedisInstance(sentinelRedisInstance *ri) {
    /* ========== MONITORING HALF ============ */
    /* Every kind of instance */
    sentinelReconnectInstance(ri);
    sentinelSendPeriodicCommands(ri);

    /* ============== ACTING HALF ============= */
    /* We don't proceed with the acting half if we are in TILT mode.
     * TILT happens when we find something odd with the time, like a
     * sudden change in the clock. */
    if (sentinel.tilt) {
        if (mstime()-sentinel.tilt_start_time < SENTINEL_TILT_PERIOD) return;
        sentinel.tilt = 0;
        sentinelEvent(LL_WARNING,"-tilt",NULL,"#tilt mode exited");
    }

    /* Every kind of instance */
    sentinelCheckSubjectivelyDown(ri);

    /* Masters and slaves */
    if (ri->flags & (SRI_MASTER|SRI_SLAVE)) {
        /* Nothing so far. */
    }

    /* Only masters */
    if (ri->flags & SRI_MASTER) {
        sentinelCheckObjectivelyDown(ri);
        if (sentinelStartFailoverIfNeeded(ri))
            sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_ASK_FORCED);
        sentinelFailoverStateMachine(ri);
        sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_NO_FLAGS);
    }
}

sentinelCheckSubjectivelyDown

/* Is this instance down from our point of view? */
void sentinelCheckSubjectivelyDown(sentinelRedisInstance *ri) {
    mstime_t elapsed = 0;

    if (ri->link->act_ping_time)
        elapsed = mstime() - ri->link->act_ping_time;
    else if (ri->link->disconnected)
        elapsed = mstime() - ri->link->last_avail_time;

    /* Check if we are in need for a reconnection of one of the
     * links, because we are detecting low activity.
     *
     * 1) Check if the command link seems connected, was connected not less
     *    than SENTINEL_MIN_LINK_RECONNECT_PERIOD, but still we have a
     *    pending ping for more than half the timeout. */
    if (ri->link->cc &&
        (mstime() - ri->link->cc_conn_time) >
        SENTINEL_MIN_LINK_RECONNECT_PERIOD &&
        ri->link->act_ping_time != 0 && /* Ther is a pending ping... */
        /* The pending ping is delayed, and we did not received
         * error replies as well. */
        (mstime() - ri->link->act_ping_time) > (ri->down_after_period/2) &&
        (mstime() - ri->link->last_pong_time) > (ri->down_after_period/2))
    {
        instanceLinkCloseConnection(ri->link,ri->link->cc);
    }

    /* 2) Check if the pubsub link seems connected, was connected not less
     *    than SENTINEL_MIN_LINK_RECONNECT_PERIOD, but still we have no
     *    activity in the Pub/Sub channel for more than
     *    SENTINEL_PUBLISH_PERIOD * 3.
     */
    if (ri->link->pc &&
        (mstime() - ri->link->pc_conn_time) >
         SENTINEL_MIN_LINK_RECONNECT_PERIOD &&
        (mstime() - ri->link->pc_last_activity) > (SENTINEL_PUBLISH_PERIOD*3))
    {
        instanceLinkCloseConnection(ri->link,ri->link->pc);
    }

    /* Update the SDOWN flag. We believe the instance is SDOWN if:
     *
     * 1) It is not replying.
     * 2) We believe it is a master, it reports to be a slave for enough time
     *    to meet the down_after_period, plus enough time to get two times
     *    INFO report from the instance. */
    if (elapsed > ri->down_after_period ||
        (ri->flags & SRI_MASTER &&
         ri->role_reported == SRI_SLAVE &&  mstime() - ri->role_reported_time >
          (ri->down_after_period+SENTINEL_INFO_PERIOD*2)))
    {
        /* Is subjectively down */
        if ((ri->flags & SRI_S_DOWN) == 0) {
            sentinelEvent(LL_WARNING,"+sdown",ri,"%@");
            ri->s_down_since_time = mstime();
            ri->flags |= SRI_S_DOWN;
        }
    } else {
        /* Is subjectively up */
        if (ri->flags & SRI_S_DOWN) {
            sentinelEvent(LL_WARNING,"-sdown",ri,"%@");
            ri->flags &= ~(SRI_S_DOWN|SRI_SCRIPT_KILL_SENT);
        }
    }
}
  • 檢查command連接是否需要被關(guān)閉稀轨。
  • 檢查pubsub連接是否需要重被關(guān)閉扼脐。
  • 更新SDOWN標(biāo)志位,規(guī)則:一是沒(méi)有在規(guī)定的時(shí)間(默認(rèn)30s)連續(xù)沒(méi)有回應(yīng)奋刽。二是當(dāng)其slave上報(bào)的連接時(shí)間間隔時(shí)間要大于down_after_period+SENTINEL_INFO_PERIOD*2時(shí)間(即30s+20s)時(shí)也將被認(rèn)為主觀下線瓦侮,因?yàn)?code>slave已經(jīng)長(zhǎng)時(shí)間聯(lián)系不到master了。

客觀下線

當(dāng)一臺(tái)sentinel檢測(cè)到master節(jié)點(diǎn)已經(jīng)掉線佣谐,并已經(jīng)將其在自己維護(hù)的狀態(tài)中設(shè)置為SRI_S_DOWN時(shí)脏榆,由于是在sentinel集群中且每個(gè)節(jié)點(diǎn)判斷master下線的時(shí)間間隔可能不一樣,所以它必須要去詢問(wèn)其他sentinel節(jié)點(diǎn)這臺(tái)監(jiān)督的master節(jié)點(diǎn)是否下線台谍。那么問(wèn)題就來(lái)了:

  1. 怎么廣播命令去詢問(wèn)其他sentinel節(jié)點(diǎn)對(duì)某個(gè)master的下線探測(cè)結(jié)果须喂?
  2. 怎么統(tǒng)計(jì)探測(cè)的結(jié)果?
  3. 怎么讓所有的節(jié)點(diǎn)對(duì)master狀態(tài)的認(rèn)知都保持一致趁蕊?

通過(guò)這三個(gè)問(wèn)題坞生,又發(fā)現(xiàn)了分布式解決方法中兩個(gè)常見(jiàn)的問(wèn)題:

  • 命令的廣播。
  • 如何達(dá)成共識(shí)掷伙,最終保證狀態(tài)的一致性是己。

看到sentinel對(duì)于這一問(wèn)題的解決方案,如果可以任柜,我們也可以自己思考一下對(duì)于這些問(wèn)題自己的解決方案卒废,是否可以比sentinel做的更好。

master下線狀態(tài)信息的詢問(wèn)廣播

入口還是上面那段sentinelHandleRedisInstance代碼

...
 /* Only masters */
    if (ri->flags & SRI_MASTER) {
        sentinelCheckObjectivelyDown(ri);
        if (sentinelStartFailoverIfNeeded(ri))
            sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_ASK_FORCED);
        sentinelFailoverStateMachine(ri);
        sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_NO_FLAGS);
    }

在這段代碼中宙地,sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_NO_FLAGS);觸發(fā)了sentinel在檢測(cè)到自己監(jiān)督的master主觀下線之后去詢問(wèn)其他sentinel的方法摔认。

/* If we think the master is down, we start sending
 * SENTINEL IS-MASTER-DOWN-BY-ADDR requests to other sentinels
 * in order to get the replies that allow to reach the quorum
 * needed to mark the master in ODOWN state and trigger a failover. */
#define SENTINEL_ASK_FORCED (1<<0)
void sentinelAskMasterStateToOtherSentinels(sentinelRedisInstance *master, int flags) {
    dictIterator *di;
    dictEntry *de;

    di = dictGetIterator(master->sentinels);
    while((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *ri = dictGetVal(de);
        mstime_t elapsed = mstime() - ri->last_master_down_reply_time;
        char port[32];
        int retval;

        /* If the master state from other sentinel is too old, we clear it. */
        if (elapsed > SENTINEL_ASK_PERIOD*5) {
            ri->flags &= ~SRI_MASTER_DOWN;
            sdsfree(ri->leader);
            ri->leader = NULL;
        }

        /* Only ask if master is down to other sentinels if:
         *
         * 1) We believe it is down, or there is a failover in progress.
         * 2) Sentinel is connected.
         * 3) We did not received the info within SENTINEL_ASK_PERIOD ms. */
        if ((master->flags & SRI_S_DOWN) == 0) continue;
        if (ri->link->disconnected) continue;
        if (!(flags & SENTINEL_ASK_FORCED) &&
            mstime() - ri->last_master_down_reply_time < SENTINEL_ASK_PERIOD)
            continue;

        /* Ask */
        ll2string(port,sizeof(port),master->addr->port);
        retval = redisAsyncCommand(ri->link->cc,
                    sentinelReceiveIsMasterDownReply, ri,
                    "SENTINEL is-master-down-by-addr %s %s %llu %s",
                    master->addr->ip, port,
                    sentinel.current_epoch,
                    (master->failover_state > SENTINEL_FAILOVER_STATE_NONE) ?
                    sentinel.myid : "*");
        if (retval == C_OK) ri->link->pending_commands++;
    }
    dictReleaseIterator(di);
}

通過(guò)遍歷自己維護(hù)的sentinels dict向其他的sentinel節(jié)點(diǎn)發(fā)送SENTINEL is-master-down-by-addr命令,命令格式如:SENTINEL is-master-down-by-addr <master-ip> <master-port> <current_epoch> <leader_id>,其中leader_id的參數(shù)在第一次詢問(wèn)客觀下線時(shí)宅粥,默認(rèn)*號(hào)参袱。接著看該命令的解析方法sentinelReceiveIsMasterDownReply

/* Receive the SENTINEL is-master-down-by-addr reply, see the
 * sentinelAskMasterStateToOtherSentinels() function for more information. */
void sentinelReceiveIsMasterDownReply(redisAsyncContext *c, void *reply, void *privdata) {
    sentinelRedisInstance *ri = privdata;
    instanceLink *link = c->data;
    redisReply *r;

    if (!reply || !link) return;
    link->pending_commands--;
    r = reply;

    /* Ignore every error or unexpected reply.
     * Note that if the command returns an error for any reason we'll
     * end clearing the SRI_MASTER_DOWN flag for timeout anyway. */
    if (r->type == REDIS_REPLY_ARRAY && r->elements == 3 &&
        r->element[0]->type == REDIS_REPLY_INTEGER &&
        r->element[1]->type == REDIS_REPLY_STRING &&
        r->element[2]->type == REDIS_REPLY_INTEGER)
    {
        ri->last_master_down_reply_time = mstime();
        if (r->element[0]->integer == 1) {
            ri->flags |= SRI_MASTER_DOWN;
        } else {
            ri->flags &= ~SRI_MASTER_DOWN;
        }
        if (strcmp(r->element[1]->str,"*")) {
            /* If the runid in the reply is not "*" the Sentinel actually
             * replied with a vote. */
            sdsfree(ri->leader);
            if ((long long)ri->leader_epoch != r->element[2]->integer)
                serverLog(LL_WARNING,
                    "%s voted for %s %llu", ri->name,
                    r->element[1]->str,
                    (unsigned long long) r->element[2]->integer);
            ri->leader = sdsnew(r->element[1]->str);
            ri->leader_epoch = r->element[2]->integer;
        }
    }
}

該命令返回三個(gè)值如:

127.0.0.1:26380> sentinel is-master-down-by-addr 127.0.0.1 6379 0 *
1) (integer) 0
2) "*"
3) (integer) 0
  1. <down_state> :master的下線狀態(tài),0未下線秽梅,1已下線抹蚀。當(dāng)返回為已下線時(shí),會(huì)同步更新flags的對(duì)應(yīng)的第5位標(biāo)志位SRI_MASTER_DOWN為1企垦。
  2. <leader_runid>:leader sentinel的runid,像第一次的客觀下線檢測(cè)時(shí)返回*环壤,因?yàn)槊畎l(fā)送的時(shí)候<leader_id>*
  3. <leader_epoch>:當(dāng)前投票紀(jì)元钞诡,當(dāng)runid為*時(shí)郑现,該值總為0湃崩。

該命令也是領(lǐng)頭選舉時(shí)發(fā)送的命令,稍后介紹懂酱。在詢問(wèn)完其他sentinel該master的狀態(tài)的后竹习,在下個(gè)周期誊抛,會(huì)進(jìn)行客觀下線檢查列牺。但是在此之前還需要分析一個(gè)邏輯就是sentinel如何處理sentinel is-master-down-by-addr命令的∞智裕回憶起上章初始化時(shí)加載的命令表瞎领,在sentinel命令注冊(cè)的方法sentinelCommand中相關(guān)該命令的部分代碼

...
 else if (!strcasecmp(c->argv[1]->ptr,"is-master-down-by-addr")) {
        /* SENTINEL IS-MASTER-DOWN-BY-ADDR <ip> <port> <current-epoch> <runid>
         *
         * Arguments:
         *
         * ip and port are the ip and port of the master we want to be
         * checked by Sentinel. Note that the command will not check by
         * name but just by master, in theory different Sentinels may monitor
         * differnet masters with the same name.
         *
         * current-epoch is needed in order to understand if we are allowed
         * to vote for a failover leader or not. Each Sentinel can vote just
         * one time per epoch.
         *
         * runid is "*" if we are not seeking for a vote from the Sentinel
         * in order to elect the failover leader. Otherwise it is set to the
         * runid we want the Sentinel to vote if it did not already voted.
         */
        sentinelRedisInstance *ri;
        long long req_epoch;
        uint64_t leader_epoch = 0;
        char *leader = NULL;
        long port;
        int isdown = 0;

        if (c->argc != 6) goto numargserr;
        if (getLongFromObjectOrReply(c,c->argv[3],&port,NULL) != C_OK ||
            getLongLongFromObjectOrReply(c,c->argv[4],&req_epoch,NULL)
                                                              != C_OK)
            return;
        ri = getSentinelRedisInstanceByAddrAndRunID(sentinel.masters,
            c->argv[2]->ptr,port,NULL);

        /* It exists? Is actually a master? Is subjectively down? It's down.
         * Note: if we are in tilt mode we always reply with "0". */
        if (!sentinel.tilt && ri && (ri->flags & SRI_S_DOWN) &&
                                    (ri->flags & SRI_MASTER))
            isdown = 1;

        /* Vote for the master (or fetch the previous vote) if the request
         * includes a runid, otherwise the sender is not seeking for a vote. */
        if (ri && ri->flags & SRI_MASTER && strcasecmp(c->argv[5]->ptr,"*")) {
            leader = sentinelVoteLeader(ri,(uint64_t)req_epoch,
                                            c->argv[5]->ptr,
                                            &leader_epoch);
        }
...

從這個(gè)代碼看,當(dāng)收到其他sentinel節(jié)點(diǎn)的關(guān)于master下線詢問(wèn),是直接讀取對(duì)應(yīng)master實(shí)例對(duì)象中保存的flags的狀態(tài)的随夸,并不會(huì)觸發(fā)一些再次探測(cè)等其他操作九默。

master客觀下線狀態(tài)的檢查

這里只有對(duì)master節(jié)點(diǎn)才會(huì)進(jìn)行客觀下線判斷代碼如下:
sentinelCheckObjectivelyDown

/* Is this instance down according to the configured quorum?
 *
 * Note that ODOWN is a weak quorum, it only means that enough Sentinels
 * reported in a given time range that the instance was not reachable.
 * However messages can be delayed so there are no strong guarantees about
 * N instances agreeing at the same time about the down state. */
void sentinelCheckObjectivelyDown(sentinelRedisInstance *master) {
    dictIterator *di;
    dictEntry *de;
    unsigned int quorum = 0, odown = 0;

    if (master->flags & SRI_S_DOWN) {
        /* Is down for enough sentinels? */
        quorum = 1; /* the current sentinel. */
        /* Count all the other sentinels. */
        di = dictGetIterator(master->sentinels);
        while((de = dictNext(di)) != NULL) {
            sentinelRedisInstance *ri = dictGetVal(de);

            if (ri->flags & SRI_MASTER_DOWN) quorum++;
        }
        dictReleaseIterator(di);
        if (quorum >= master->quorum) odown = 1;
    }

    /* Set the flag accordingly to the outcome. */
    if (odown) {
        if ((master->flags & SRI_O_DOWN) == 0) {
            sentinelEvent(LL_WARNING,"+odown",master,"%@ #quorum %d/%d",
                quorum, master->quorum);
            master->flags |= SRI_O_DOWN;
            master->o_down_since_time = mstime();        }
    } else {
        if (master->flags & SRI_O_DOWN) {
            sentinelEvent(LL_WARNING,"-odown",master,"%@");
            master->flags &= ~SRI_O_DOWN;
        }
    }
}

當(dāng)詢問(wèn)后的結(jié)果都處理更新至對(duì)應(yīng)sentinels結(jié)構(gòu)中時(shí),就可以開(kāi)始查看宾毒,對(duì)應(yīng)的master被判斷下線的數(shù)量是否超過(guò)了在配置文件sentinel.conf中配置的quorum值驼修,如果達(dá)到或者超過(guò)該值即認(rèn)為該master進(jìn)入了客觀下線的狀態(tài),將會(huì)修改其標(biāo)志位SRI_O_DOWN為1正式進(jìn)入接下來(lái)的選舉領(lǐng)頭诈铛。

問(wèn)題:在sentinel監(jiān)督的master由主觀下線狀態(tài)到客觀下線的過(guò)程乙各,從命令廣播和判斷master客觀下線這個(gè)共識(shí),sentinel并沒(méi)有采用什么特殊的算法幢竹,特別是master的客觀下線這個(gè)狀態(tài)耳峦,那么sentinel的選舉領(lǐng)頭會(huì)不會(huì)在一個(gè)所有的sentinel節(jié)點(diǎn)都達(dá)到一致的狀態(tài)后進(jìn)行呢?
答: 從代碼看sentinel幾乎并沒(méi)有刻意的去同步一次master狀態(tài)在sentinel集群中的客觀狀態(tài)焕毫,也就是說(shuō)master的客觀下線蹲坷,需要等到集群中絕大部分節(jié)點(diǎn)都通過(guò)周期性事件判斷出master主觀下線,才有可能形成客觀下線邑飒⊙客觀下線這一狀態(tài)是在通過(guò)命令不斷交互慢慢達(dá)成的一個(gè)共識(shí)。而在由某個(gè)節(jié)點(diǎn)主觀下線到整個(gè)集群客觀下線的整個(gè)共識(shí)形成中疙咸,sentinel is-master-down-by-addr命令大量充斥在sentinel的網(wǎng)絡(luò)結(jié)構(gòu)中懦底。當(dāng)某個(gè)sentinel節(jié)點(diǎn)的客觀條件得到滿足時(shí),選舉故障轉(zhuǎn)移的領(lǐng)頭選舉便也開(kāi)始了罕扎。由于每個(gè)sentinel節(jié)點(diǎn)客觀條件是手動(dòng)可配聚唐,并沒(méi)有什么算法來(lái)支持自動(dòng)調(diào)節(jié),這點(diǎn)確是可以有必要學(xué)習(xí)區(qū)塊鏈腔召,個(gè)人覺(jué)得這個(gè)客觀條件觸發(fā)本身就是調(diào)節(jié)集群達(dá)成共識(shí)快慢的一個(gè)重要因子杆查。最后值得說(shuō)明的一點(diǎn)是,從整個(gè)集群看臀蛛,當(dāng)開(kāi)始進(jìn)入領(lǐng)頭選舉狀態(tài)時(shí)亲桦,集群中可能還有sentinel節(jié)點(diǎn)并沒(méi)有判斷出該master已經(jīng)掉線崖蜜。

選舉leader節(jié)點(diǎn)

當(dāng)sentinel集群中的某個(gè)節(jié)點(diǎn)已經(jīng)識(shí)別到master進(jìn)入客觀下線的狀態(tài),那么開(kāi)始發(fā)起選舉領(lǐng)頭的投票客峭。還是上面那段sentinelHandleRedisInstance代碼

...
 /* Only masters */
    if (ri->flags & SRI_MASTER) {
        sentinelCheckObjectivelyDown(ri);
        if (sentinelStartFailoverIfNeeded(ri))
            sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_ASK_FORCED);
        sentinelFailoverStateMachine(ri);
        sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_NO_FLAGS);
    }

而這次先關(guān)注的是sentinelStartFailoverIfNeeded方法豫领,顧明思義該方法是判斷是否需要開(kāi)始故障轉(zhuǎn)移,如果需要?jiǎng)t開(kāi)始進(jìn)行領(lǐng)頭選舉舔琅。代碼如下:

/* This function checks if there are the conditions to start the failover,
 * that is:
 *
 * 1) Master must be in ODOWN condition.
 * 2) No failover already in progress.
 * 3) No failover already attempted recently.
 *
 * We still don't know if we'll win the election so it is possible that we
 * start the failover but that we'll not be able to act.
 *
 * Return non-zero if a failover was started. */
int sentinelStartFailoverIfNeeded(sentinelRedisInstance *master) {
    /* We can't failover if the master is not in O_DOWN state. */
    if (!(master->flags & SRI_O_DOWN)) return 0;

    /* Failover already in progress? */
    if (master->flags & SRI_FAILOVER_IN_PROGRESS) return 0;

    /* Last failover attempt started too little time ago? */
    if (mstime() - master->failover_start_time <
        master->failover_timeout*2)
    {
        if (master->failover_delay_logged != master->failover_start_time) {
            time_t clock = (master->failover_start_time +
                            master->failover_timeout*2) / 1000;
            char ctimebuf[26];

            ctime_r(&clock,ctimebuf);
            ctimebuf[24] = '\0'; /* Remove newline. */
            master->failover_delay_logged = master->failover_start_time;
            serverLog(LL_WARNING,
                "Next failover delay: I will not start a failover before %s",
                ctimebuf);
        }
        return 0;
    }

    sentinelStartFailover(master);
    return 1;
}
  • master必須滿足客觀下線等恐。
  • master沒(méi)有在故障轉(zhuǎn)移中。
  • master是不是距離上次嘗試故障轉(zhuǎn)移時(shí)間間隔小于2倍故障轉(zhuǎn)移超時(shí)(默認(rèn)超時(shí)是3分鐘)备蚓,意思是如果出現(xiàn)故障轉(zhuǎn)移超時(shí)默認(rèn)至少隔六分鐘再開(kāi)始下一輪课蔬。
  • 如果以上三點(diǎn)都滿足的話執(zhí)行sentinelStartFailover方法。

在開(kāi)始看sentinelStartFailover方法之前又有兩個(gè)問(wèn)題需要我們?cè)谙旅娴拇a分析中得以解決:

  1. 故障轉(zhuǎn)移中郊尝,是在什么時(shí)候設(shè)置的狀態(tài)二跋。這個(gè)狀態(tài)是集群中sentinel都同步的一個(gè)狀態(tài),還是單個(gè)被選舉出來(lái)sentinel節(jié)點(diǎn)的自身內(nèi)部的狀態(tài)流昏?
  2. 故障轉(zhuǎn)移的超時(shí)指的是什么超時(shí)扎即,什么情況會(huì)引起超時(shí),這種超時(shí)會(huì)導(dǎo)致領(lǐng)頭重新選舉嗎况凉?

帶著問(wèn)題看到sentinelStartFailover的代碼

/* Setup the master state to start a failover. */
void sentinelStartFailover(sentinelRedisInstance *master) {
    serverAssert(master->flags & SRI_MASTER);

    master->failover_state = SENTINEL_FAILOVER_STATE_WAIT_START;
    master->flags |= SRI_FAILOVER_IN_PROGRESS;
    master->failover_epoch = ++sentinel.current_epoch;
    sentinelEvent(LL_WARNING,"+new-epoch",master,"%llu",
        (unsigned long long) sentinel.current_epoch);
    sentinelEvent(LL_WARNING,"+try-failover",master,"%@");
    master->failover_start_time = mstime()+rand()%SENTINEL_MAX_DESYNC;
    master->failover_state_change_time = mstime();
}

這里主要是初始化了開(kāi)始故障轉(zhuǎn)移新紀(jì)元的配置:

  • 將狀態(tài)機(jī)置為SENTINEL_FAILOVER_STATE_WAIT_START狀態(tài)谚鄙。
  • flags標(biāo)記為SRI_FAILOVER_IN_PROGRESS表示正在進(jìn)行故障轉(zhuǎn)移,周期事件可以不用重復(fù)進(jìn)行茎刚。
  • 更新故障轉(zhuǎn)移的紀(jì)元襟锐。
  • 設(shè)置故障轉(zhuǎn)移開(kāi)始時(shí)間(不知道為什么要加1000以內(nèi)的隨機(jī)數(shù))。
  • 設(shè)置故障轉(zhuǎn)移狀態(tài)機(jī)的改變時(shí)間膛锭。

從這個(gè)方法中粮坞,解決了第一個(gè)問(wèn)題SRI_FAILOVER_IN_PROGRESS標(biāo)志位的設(shè)置,表示當(dāng)前sentinel節(jié)點(diǎn)正在進(jìn)行故障轉(zhuǎn)移初狰。接著回到之前詢問(wèn)其他sentinel節(jié)點(diǎn)master狀態(tài)的方法sentinelAskMasterStateToOtherSentinels莫杈,此時(shí)入?yún)?code>flags=SENTINEL_ASK_PERIOD,意味著sentinel將再次向其他節(jié)點(diǎn)發(fā)送SENTINEL is-master-down-by-addr命令,只不過(guò)這次<runId>參數(shù)不再是空而是加上了當(dāng)前sentinelrunId,期望其他節(jié)點(diǎn)選舉其為leader節(jié)點(diǎn)奢入。

leader節(jié)點(diǎn)的選規(guī)則

同樣如同客觀下線SENTINEL is-master-down-by-addr命令的處理一樣筝闹,只是會(huì)多調(diào)用sentinelVoteLeader方法

...
 else if (!strcasecmp(c->argv[1]->ptr,"is-master-down-by-addr")) {
   ...
        /* Vote for the master (or fetch the previous vote) if the request
         * includes a runid, otherwise the sender is not seeking for a vote. */
        if (ri && ri->flags & SRI_MASTER && strcasecmp(c->argv[5]->ptr,"*")) {
            leader = sentinelVoteLeader(ri,(uint64_t)req_epoch,
                                            c->argv[5]->ptr,
                                            &leader_epoch);
        }
...

sentinelVoteLeader

/* Vote for the sentinel with 'req_runid' or return the old vote if already
 * voted for the specifed 'req_epoch' or one greater.
 *
 * If a vote is not available returns NULL, otherwise return the Sentinel
 * runid and populate the leader_epoch with the epoch of the vote. */
char *sentinelVoteLeader(sentinelRedisInstance *master, uint64_t req_epoch, char *req_runid, uint64_t *leader_epoch) {
    if (req_epoch > sentinel.current_epoch) {
        sentinel.current_epoch = req_epoch;
        sentinelFlushConfig();
        sentinelEvent(LL_WARNING,"+new-epoch",master,"%llu",
            (unsigned long long) sentinel.current_epoch);
    }

    if (master->leader_epoch < req_epoch && sentinel.current_epoch <= req_epoch)
    {
        sdsfree(master->leader);
        master->leader = sdsnew(req_runid);
        master->leader_epoch = sentinel.current_epoch;
        sentinelFlushConfig();
        sentinelEvent(LL_WARNING,"+vote-for-leader",master,"%s %llu",
            master->leader, (unsigned long long) master->leader_epoch);
        /* If we did not voted for ourselves, set the master failover start
         * time to now, in order to force a delay before we can start a
         * failover for the same master. */
        if (strcasecmp(master->leader,sentinel.myid))
            master->failover_start_time = mstime()+rand()%SENTINEL_MAX_DESYNC;
    }

    *leader_epoch = master->leader_epoch;
    return master->leader ? sdsnew(master->leader) : NULL;
}
  • 同步投票紀(jì)元。
  • 當(dāng)master沒(méi)有設(shè)置leader時(shí)腥光,就將廣播中的runId設(shè)置為leader关顷,這個(gè)runId可能是sentinel自己的。
  • 當(dāng)runId不是自己時(shí)武福,設(shè)置故障轉(zhuǎn)移開(kāi)始的時(shí)間议双。
  • 每次這些狀態(tài)的改動(dòng)都保存至配置文件中去。

在這里我們有看到了sentinel集群中對(duì)于failover狀態(tài)開(kāi)始時(shí)間的一個(gè)統(tǒng)一同步捉片,非leader的sentinel節(jié)點(diǎn)是在收到投票的命令廣播時(shí)認(rèn)為故障轉(zhuǎn)移開(kāi)始平痰。整個(gè)投票的過(guò)程有如下的交互流程如下:

  • sentinel節(jié)點(diǎn)發(fā)送SENTINEL is-master-down-by-addr命令要求接收節(jié)點(diǎn)設(shè)置自己為leader,此時(shí)有兩種情況:1)接收節(jié)點(diǎn)在當(dāng)前投票紀(jì)元中沒(méi)有設(shè)置leader汞舱,便設(shè)置將其設(shè)置為leader。2)接收節(jié)點(diǎn)在當(dāng)前投票紀(jì)元中已設(shè)置了leader宗雇,便將已設(shè)置的leader返回昂芜。
  • sentinel節(jié)點(diǎn),接收到返回結(jié)果后赔蒲,將leader runid結(jié)果更新在對(duì)應(yīng)的sentinelsentinelRedisInstance結(jié)構(gòu)中泌神,以便后續(xù)統(tǒng)計(jì)票數(shù)。
  • 通過(guò)一輪詢問(wèn)嘹履,詢問(wèn)的sentinel節(jié)點(diǎn)就將會(huì)獲得其他sentinel的投票結(jié)果腻扇。
  • 進(jìn)入狀態(tài)機(jī)中的SENTINEL_FAILOVER_STATE_WAIT_START進(jìn)行唱票债热。
void sentinelFailoverWaitStart(sentinelRedisInstance *ri) {
    char *leader;
    int isleader;

    /* Check if we are the leader for the failover epoch. */
    leader = sentinelGetLeader(ri, ri->failover_epoch);
    isleader = leader && strcasecmp(leader,sentinel.myid) == 0;
    sdsfree(leader);

    /* If I'm not the leader, and it is not a forced failover via
     * SENTINEL FAILOVER, then I can't continue with the failover. */
    if (!isleader && !(ri->flags & SRI_FORCE_FAILOVER)) {
        int election_timeout = SENTINEL_ELECTION_TIMEOUT;

        /* The election timeout is the MIN between SENTINEL_ELECTION_TIMEOUT
         * and the configured failover timeout. */
        if (election_timeout > ri->failover_timeout)
            election_timeout = ri->failover_timeout;
        /* Abort the failover if I'm not the leader after some time. */
        if (mstime() - ri->failover_start_time > election_timeout) {
            sentinelEvent(LL_WARNING,"-failover-abort-not-elected",ri,"%@");
            sentinelAbortFailover(ri);
        }
        return;
    }
    sentinelEvent(LL_WARNING,"+elected-leader",ri,"%@");
    if (sentinel.simfailure_flags & SENTINEL_SIMFAILURE_CRASH_AFTER_ELECTION)
        sentinelSimFailureCrash();
    ri->failover_state = SENTINEL_FAILOVER_STATE_SELECT_SLAVE;
    ri->failover_state_change_time = mstime();
    sentinelEvent(LL_WARNING,"+failover-state-select-slave",ri,"%@");
}

sentinelGetLeader

/* Scan all the Sentinels attached to this master to check if there
 * is a leader for the specified epoch.
 *
 * To be a leader for a given epoch, we should have the majority of
 * the Sentinels we know (ever seen since the last SENTINEL RESET) that
 * reported the same instance as leader for the same epoch. */
char *sentinelGetLeader(sentinelRedisInstance *master, uint64_t epoch) {
    dict *counters;
    dictIterator *di;
    dictEntry *de;
    unsigned int voters = 0, voters_quorum;
    char *myvote;
    char *winner = NULL;
    uint64_t leader_epoch;
    uint64_t max_votes = 0;

    serverAssert(master->flags & (SRI_O_DOWN|SRI_FAILOVER_IN_PROGRESS));
    counters = dictCreate(&leaderVotesDictType,NULL);

    voters = dictSize(master->sentinels)+1; /* All the other sentinels and me.*/

    /* Count other sentinels votes */
    di = dictGetIterator(master->sentinels);
    while((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *ri = dictGetVal(de);
        if (ri->leader != NULL && ri->leader_epoch == sentinel.current_epoch)
            sentinelLeaderIncr(counters,ri->leader);
    }
    dictReleaseIterator(di);

    /* Check what's the winner. For the winner to win, it needs two conditions:
     * 1) Absolute majority between voters (50% + 1).
     * 2) And anyway at least master->quorum votes. */
    di = dictGetIterator(counters);
    while((de = dictNext(di)) != NULL) {
        uint64_t votes = dictGetUnsignedIntegerVal(de);

        if (votes > max_votes) {
            max_votes = votes;
            winner = dictGetKey(de);
        }
    }
    dictReleaseIterator(di);

    /* Count this Sentinel vote:
     * if this Sentinel did not voted yet, either vote for the most
     * common voted sentinel, or for itself if no vote exists at all. */
    if (winner)
        myvote = sentinelVoteLeader(master,epoch,winner,&leader_epoch);
    else
        myvote = sentinelVoteLeader(master,epoch,sentinel.myid,&leader_epoch);

    if (myvote && leader_epoch == epoch) {
        uint64_t votes = sentinelLeaderIncr(counters,myvote);

        if (votes > max_votes) {
            max_votes = votes;
            winner = myvote;
        }
    }

    voters_quorum = voters/2+1;
    if (winner && (max_votes < voters_quorum || max_votes < master->quorum))
        winner = NULL;

    winner = winner ? sdsnew(winner) : NULL;
    sdsfree(myvote);
    dictRelease(counters);
    return winner;
}
  • 先統(tǒng)計(jì)選票砾嫉,采用的是redis自己的數(shù)據(jù)結(jié)構(gòu)leaderVotesDictType(本質(zhì)可以理解為一個(gè)k-v的Map),將選票分類整合票數(shù)窒篱。
  • 選出票數(shù)最多的runid焕刮。
  • 然后投出自己的一票,如果有winer墙杯,就將該票投給winer配并,如果沒(méi)有就把票投給自己。
  • 當(dāng)winer的選票大于所有節(jié)點(diǎn)的一半以上或者大于監(jiān)督master時(shí)配置的quorum時(shí)高镐,winner產(chǎn)生溉旋。而該winner就是被選舉出來(lái)的leader。

NOTE: 在leader選舉的一開(kāi)始嫉髓,sentinel節(jié)點(diǎn)是不會(huì)投票給自己的观腊。但sentinel是可以投票給自己的,sentinel的節(jié)點(diǎn)有兩個(gè)投票時(shí)機(jī)算行,但每個(gè)節(jié)點(diǎn)在當(dāng)前紀(jì)元只能投一次票梧油。sentinel節(jié)點(diǎn)拉票的過(guò)程是異步的,所以可能有些詢問(wèn)的結(jié)果都會(huì)得不到及時(shí)的反饋州邢。

從上面整個(gè)選舉的過(guò)程儡陨,發(fā)現(xiàn)要產(chǎn)生leader有幾個(gè)重要的條件,

  • 至少我們會(huì)收到集群中voters/2+1個(gè)節(jié)點(diǎn)的投票(包括節(jié)點(diǎn)自己)量淌,如果設(shè)置的quorum小于voters/2+1,就是quorum個(gè)節(jié)點(diǎn)骗村。
  • 選票最多的節(jié)點(diǎn)得到一定要或者一半節(jié)點(diǎn)以上支持票在成為leader
  • 選票產(chǎn)生的結(jié)果在當(dāng)前紀(jì)元內(nèi)才有效呀枢。

因?yàn)檎麄€(gè)拉票的過(guò)程是異步的胚股,并且如果有節(jié)點(diǎn)掉線的話,或者票數(shù)最多的節(jié)點(diǎn)滿足不了上述的要求的話硫狞,那么當(dāng)前紀(jì)元時(shí)產(chǎn)生不了最終的leader的信轿,只能等待超時(shí)晃痴,然后開(kāi)啟下一輪的新紀(jì)元,直到該次故障轉(zhuǎn)移leader被選舉出來(lái)财忽,進(jìn)入到狀態(tài)機(jī)的下一個(gè)狀態(tài)倘核。

leader選舉小結(jié)

上面通過(guò)代碼解釋了sentinelleader選舉流程,來(lái)總結(jié)一下sentinel是如何達(dá)到一個(gè)共識(shí)的狀態(tài)即彪。
在一個(gè)集群中所有節(jié)點(diǎn)要達(dá)到一個(gè)共識(shí)就需要交互集群維度的狀態(tài)紧唱,sentinel節(jié)點(diǎn)通過(guò)發(fā)送SENTINEL is-master-down-by-addr命令來(lái)交互,獲得其他節(jié)點(diǎn)的內(nèi)容隶校。因?yàn)槊總€(gè)sentinel節(jié)點(diǎn)自身都會(huì)維護(hù)一份基于master維度的數(shù)據(jù)結(jié)構(gòu)漏益,某一方面我們可以把它理解成路由表,而SENTINEL is-master-down-by-addr命令則可以理解為交互的協(xié)議深胳。由于sentinel體系網(wǎng)絡(luò)結(jié)構(gòu)的特殊性绰疤,sentinel節(jié)點(diǎn)是通過(guò)訂閱了共同master節(jié)點(diǎn)的hello頻道間接相互發(fā)現(xiàn)的。這個(gè)master節(jié)點(diǎn)充當(dāng)了媒介舞终。而master節(jié)點(diǎn)和sentinel節(jié)點(diǎn)卻由屬于截然不同的兩種功能的節(jié)點(diǎn)轻庆。
簡(jiǎn)而言之,抽象出來(lái)的分布式集群高可用性敛劝,需要解決的基礎(chǔ)就是:

  • 集群節(jié)點(diǎn)間通信網(wǎng)絡(luò)的構(gòu)建
  • 集群節(jié)點(diǎn)間協(xié)議交互的傳播方式
  • 集群節(jié)點(diǎn)間交互的協(xié)議

sentinel體系中除了解決這些基礎(chǔ)問(wèn)題之外余爆,就是如何達(dá)成共識(shí)。我們知道不管是master的客觀下線還是故障轉(zhuǎn)移的leader選舉都一個(gè)共識(shí)達(dá)成的過(guò)程夸盟。自然形成共識(shí)的規(guī)則蛾方、標(biāo)準(zhǔn)我們希望每個(gè)節(jié)點(diǎn)(這里指的是sentinel節(jié)點(diǎn))是一致的從而來(lái)保證每個(gè)節(jié)點(diǎn)都是公平的。當(dāng)然在sentinel中有手動(dòng)配置的quorum上陕,其實(shí)這個(gè)quorum個(gè)人認(rèn)為它是調(diào)節(jié)整個(gè)sentinel集群達(dá)到共識(shí)狀態(tài)的一個(gè)重要因子桩砰,可惜的是這個(gè)因子每個(gè)節(jié)點(diǎn)可配,并不是整個(gè)集群可配唆垃。這使得單個(gè)節(jié)點(diǎn)獲得了巨大的決定權(quán)五芝,有點(diǎn)破壞了集群的穩(wěn)定性。
言歸正傳辕万,對(duì)于每個(gè)sentinel節(jié)點(diǎn)而言都在進(jìn)行著自己對(duì)master節(jié)點(diǎn)的周期探測(cè)枢步,當(dāng)有一個(gè)節(jié)點(diǎn)探測(cè)到其監(jiān)督的master掉線的狀態(tài)并認(rèn)為其主觀下線的話,那么sentinel體系的第一次共識(shí)決定便開(kāi)始了渐尿。因?yàn)樵摴?jié)點(diǎn)會(huì)開(kāi)始不停的詢問(wèn)其他節(jié)點(diǎn)醉途,是否也認(rèn)為該master已經(jīng)下線。如果已經(jīng)下線的話砖茸,將會(huì)更新對(duì)應(yīng)sentinelRedisInstance中的flags,隨著時(shí)間的推移隘擎,該節(jié)點(diǎn)會(huì)得到越來(lái)越多其他節(jié)點(diǎn)判斷檢測(cè)的master下線的節(jié)點(diǎn),直到某個(gè)臨界值。換個(gè)角度看凉夯,其實(shí)集群的每個(gè)節(jié)點(diǎn)都在自己的周期探測(cè)中逐漸進(jìn)入到判斷master節(jié)點(diǎn)客觀下線的狀態(tài)货葬。因此集群中這一狀態(tài)的獲得采幌,并不需要互相通知,都是靠自感知的震桶。而單個(gè)節(jié)點(diǎn)獲知其他節(jié)點(diǎn)的狀態(tài)也可以看做是輪詢的休傍。此時(shí),當(dāng)集群中的某個(gè)節(jié)點(diǎn)率先滿足了設(shè)置master節(jié)點(diǎn)客觀下線條件時(shí)(>=quorum值)蹲姐,便開(kāi)始第二輪共識(shí)"發(fā)起投票"磨取。前面也有提及就是在集群中這兩輪共識(shí)狀態(tài)并沒(méi)有明顯的界限,都是由每個(gè)節(jié)點(diǎn)自己去獲得柴墩,并不會(huì)被其他節(jié)點(diǎn)狀態(tài)所影響忙厌。也值得一提的就是sentinel節(jié)點(diǎn)選票時(shí)卻是每個(gè)sentinel節(jié)點(diǎn)都有投票權(quán),即便是并沒(méi)有確認(rèn)master節(jié)點(diǎn)已下線的節(jié)點(diǎn)也可以參與江咳。在這一輪的共識(shí)中有一個(gè)條件就是逢净,進(jìn)行故障轉(zhuǎn)移leader的票數(shù)一定至少要超過(guò)集群中節(jié)點(diǎn)的一半。并且這個(gè)選舉是有時(shí)間期限的扎阶,在規(guī)定期限內(nèi)沒(méi)有獲得這個(gè)leader,將會(huì)進(jìn)行下一輪的投票汹胃,直到在這個(gè)期限內(nèi)獲得leader婶芭,這個(gè)共識(shí)便達(dá)成了东臀,因此對(duì)于每一輪的投票,都有一個(gè)epoch紀(jì)元來(lái)控制犀农,都點(diǎn)類似于版本號(hào)惰赋。從兩輪共識(shí)中又可以抽象出來(lái)sentinel節(jié)點(diǎn)滿足達(dá)成共識(shí)必要的五點(diǎn),:

  • 每個(gè)節(jié)點(diǎn)都可以參加選舉和投票,當(dāng)前紀(jì)元有且僅有一票。保證每個(gè)節(jié)點(diǎn)對(duì)此輪選舉計(jì)算得結(jié)果是一致呵哨。
  • 交互選舉的結(jié)果赁濒。
  • 選舉結(jié)果達(dá)成共識(shí)的觸發(fā)規(guī)則(votes/2+1)。
  • 選舉結(jié)果達(dá)成共識(shí)有時(shí)間期限孟害。

故障轉(zhuǎn)移其他操作

狀態(tài)機(jī)

結(jié)束了leader選舉后,被選舉leader節(jié)點(diǎn)拒炎,便開(kāi)始了正式的故障轉(zhuǎn)移。在前面通過(guò)代碼也發(fā)現(xiàn)了挨务,sentinel是通過(guò)一個(gè)狀態(tài)機(jī)來(lái)操作進(jìn)行故障轉(zhuǎn)移击你。

void sentinelFailoverStateMachine(sentinelRedisInstance *ri) {
    serverAssert(ri->flags & SRI_MASTER);

    if (!(ri->flags & SRI_FAILOVER_IN_PROGRESS)) return;

    switch(ri->failover_state) {
        case SENTINEL_FAILOVER_STATE_WAIT_START:
            sentinelFailoverWaitStart(ri);
            break;
        case SENTINEL_FAILOVER_STATE_SELECT_SLAVE:
            sentinelFailoverSelectSlave(ri);
            break;
        case SENTINEL_FAILOVER_STATE_SEND_SLAVEOF_NOONE:
            sentinelFailoverSendSlaveOfNoOne(ri);
            break;
        case SENTINEL_FAILOVER_STATE_WAIT_PROMOTION:
            sentinelFailoverWaitPromotion(ri);
            break;
        case SENTINEL_FAILOVER_STATE_RECONF_SLAVES:
            sentinelFailoverReconfNextSlave(ri);
            break;
    }
}

整個(gè)的狀態(tài)變化圖如:

SENTINEL_FAILOVER_STATE_WAIT_START
                ||
                \/
SENTINEL_FAILOVER_STATE_SELECT_SLAVE
                ||
                \/
SENTINEL_FAILOVER_STATE_SEND_SLAVEOF_NOONE
                ||
                \/
SENTINEL_FAILOVER_STATE_WAIT_PROMOTION
                ||
                \/
SENTINEL_FAILOVER_STATE_RECONF_SLAVES
                ||
                \/
SENTINEL_FAILOVER_STATE_UPDATE_CONFIG
選擇slave

選擇的規(guī)則如下調(diào)用鏈sentinelFailoverSelectSlave->sentinelSelectSlave

sentinelRedisInstance *sentinelSelectSlave(sentinelRedisInstance *master) {
    sentinelRedisInstance **instance =
        zmalloc(sizeof(instance[0])*dictSize(master->slaves));
    sentinelRedisInstance *selected = NULL;
    int instances = 0;
    dictIterator *di;
    dictEntry *de;
    mstime_t max_master_down_time = 0;

    if (master->flags & SRI_S_DOWN)
        max_master_down_time += mstime() - master->s_down_since_time;
    max_master_down_time += master->down_after_period * 10;

    di = dictGetIterator(master->slaves);
    while((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *slave = dictGetVal(de);
        mstime_t info_validity_time;

        if (slave->flags & (SRI_S_DOWN|SRI_O_DOWN)) continue;
        if (slave->link->disconnected) continue;
        if (mstime() - slave->link->last_avail_time > SENTINEL_PING_PERIOD*5) continue;
        if (slave->slave_priority == 0) continue;

        /* If the master is in SDOWN state we get INFO for slaves every second.
         * Otherwise we get it with the usual period so we need to account for
         * a larger delay. */
        if (master->flags & SRI_S_DOWN)
            info_validity_time = SENTINEL_PING_PERIOD*5;
        else
            info_validity_time = SENTINEL_INFO_PERIOD*3;
        if (mstime() - slave->info_refresh > info_validity_time) continue;
        if (slave->master_link_down_time > max_master_down_time) continue;
        instance[instances++] = slave;
    }
    dictReleaseIterator(di);
    if (instances) {
        qsort(instance,instances,sizeof(sentinelRedisInstance*),
            compareSlavesForPromotion);
        selected = instance[0];
    }
    zfree(instance);
    return selected;
}

選擇策略:

  • 排除已經(jīng)判斷主客觀判斷掉線的。
  • 排除已經(jīng)斷開(kāi)連接的谎柄。
  • 排除超過(guò)5*SENTINEL_PING_PERIOD秒(即5s)沒(méi)有獲得ping回應(yīng)的丁侄。
  • 排除優(yōu)先級(jí)為0的。
  • 如果master是SRI_S_DOWN的狀態(tài)sentinel會(huì)每1s發(fā)送info給slave所以此時(shí)排除超過(guò)SENTINEL_PING_PERIOD*5秒(即5s)沒(méi)有獲得info回應(yīng)的朝巫,反之排除超過(guò)3* SENTINEL_INFO_PERIOD秒(即30s)沒(méi)有獲得info回應(yīng)的鸿摇。
  • 排除與master保持連接時(shí)間要大于master客觀下線的時(shí)間或者master->down_after_period * 10。這樣可以盡可能保證slavemaster掉線前是與master保持連接的劈猿。
  • 剩下的slave按照如下規(guī)則選出一個(gè)slave,先按照優(yōu)先級(jí)選優(yōu)先級(jí)最高的拙吉,再按照slave復(fù)制的offset的大小潮孽,盡可能挑offset最大的。表示數(shù)據(jù)的完整度最接近master,最后按照runId的大小筷黔,選擇runId最大的恩商。
int compareSlavesForPromotion(const void *a, const void *b) {
    sentinelRedisInstance **sa = (sentinelRedisInstance **)a,
                          **sb = (sentinelRedisInstance **)b;
    char *sa_runid, *sb_runid;

    if ((*sa)->slave_priority != (*sb)->slave_priority)
        return (*sa)->slave_priority - (*sb)->slave_priority;

    /* If priority is the same, select the slave with greater replication
     * offset (processed more data from the master). */
    if ((*sa)->slave_repl_offset > (*sb)->slave_repl_offset) {
        return -1; /* a < b */
    } else if ((*sa)->slave_repl_offset < (*sb)->slave_repl_offset) {
        return 1; /* a > b */
    }

    /* If the replication offset is the same select the slave with that has
     * the lexicographically smaller runid. Note that we try to handle runid
     * == NULL as there are old Redis versions that don't publish runid in
     * INFO. A NULL runid is considered bigger than any other runid. */
    sa_runid = (*sa)->runid;
    sb_runid = (*sb)->runid;
    if (sa_runid == NULL && sb_runid == NULL) return 0;
    else if (sa_runid == NULL) return 1;  /* a > b */
    else if (sb_runid == NULL) return -1; /* a < b */
    return strcasecmp(sa_runid, sb_runid);
}
發(fā)送將升級(jí)slave至master的命令

調(diào)用鏈sentinelFailoverSendSlaveOfNoOne->sentinelSendSlaveOf

void sentinelFailoverSendSlaveOfNoOne(sentinelRedisInstance *ri) {
    int retval;

    /* We can't send the command to the promoted slave if it is now
     * disconnected. Retry again and again with this state until the timeout
     * is reached, then abort the failover. */
    if (ri->promoted_slave->link->disconnected) {
        if (mstime() - ri->failover_state_change_time > ri->failover_timeout) {
            sentinelEvent(LL_WARNING,"-failover-abort-slave-timeout",ri,"%@");
            sentinelAbortFailover(ri);
        }
        return;
    }

    /* Send SLAVEOF NO ONE command to turn the slave into a master.
     * We actually register a generic callback for this command as we don't
     * really care about the reply. We check if it worked indirectly observing
     * if INFO returns a different role (master instead of slave). */
    retval = sentinelSendSlaveOf(ri->promoted_slave,NULL,0);
    if (retval != C_OK) return;
    sentinelEvent(LL_NOTICE, "+failover-state-wait-promotion",
        ri->promoted_slave,"%@");
    ri->failover_state = SENTINEL_FAILOVER_STATE_WAIT_PROMOTION;
    ri->failover_state_change_time = mstime();
}

向被選出來(lái)的slave發(fā)送一個(gè)slaveof no one的命令將其升級(jí)為master,而且這次發(fā)送命令必逆,并不會(huì)注冊(cè)slave返回結(jié)果處理方法怠堪,而是通過(guò)sentinelslave發(fā)送的info命令,來(lái)獲知slave的角色是否已被改變名眉。當(dāng)然如果在發(fā)送之前發(fā)現(xiàn)與已選擇的slave斷開(kāi)了連接則粟矿,宣告故障轉(zhuǎn)移超時(shí)失敗,重置故障轉(zhuǎn)移损拢,進(jìn)入新一輪的投票選舉陌粹。

等待slave升級(jí)

當(dāng)slaveof no one的命令發(fā)出后,故障轉(zhuǎn)移的狀態(tài)機(jī)便進(jìn)入了SENTINEL_FAILOVER_STATE_WAIT_PROMOTION狀態(tài)福压,處于這個(gè)狀態(tài)的sentinel只是檢查一下failover_state_change_time是否已經(jīng)超時(shí)掏秩,如果超時(shí)則宣告故障轉(zhuǎn)移超時(shí)失敗,重置故障轉(zhuǎn)移荆姆,進(jìn)入新一輪的投票選舉蒙幻。

/* We actually wait for promotion indirectly checking with INFO when the
 * slave turns into a master. */
void sentinelFailoverWaitPromotion(sentinelRedisInstance *ri) {
    /* Just handle the timeout. Switching to the next state is handled
     * by the function parsing the INFO command of the promoted slave. */
    if (mstime() - ri->failover_state_change_time > ri->failover_timeout) {
        sentinelEvent(LL_WARNING,"-failover-abort-slave-timeout",ri,"%@");
        sentinelAbortFailover(ri);
    }
}

前面在發(fā)送slaveof no one命令的時(shí)候有提到,sentinel并沒(méi)有注冊(cè)響應(yīng)回調(diào)方法胆筒,而是通過(guò)周期性的info命令來(lái)探測(cè)slave的角色改變邮破,關(guān)于info命令返回結(jié)果的解析上章也有提到,再次回到這段代碼仆救。

/* Process the INFO output from masters. */
void sentinelRefreshInstanceInfo(sentinelRedisInstance *ri, const char *info) {
    sds *lines;
    int numlines, j;
    int role = 0;

    /* cache full INFO output for instance */
    sdsfree(ri->info);
    ri->info = sdsnew(info);

    /* The following fields must be reset to a given value in the case they
     * are not found at all in the INFO output. */
    ri->master_link_down_time = 0;

    ...
     
    /* Handle slave -> master role switch. */
    if ((ri->flags & SRI_SLAVE) && role == SRI_MASTER) {
        /* If this is a promoted slave we can change state to the
         * failover state machine. */
        if ((ri->flags & SRI_PROMOTED) &&
            (ri->master->flags & SRI_FAILOVER_IN_PROGRESS) &&
            (ri->master->failover_state ==
                SENTINEL_FAILOVER_STATE_WAIT_PROMOTION))
        {
            /* Now that we are sure the slave was reconfigured as a master
             * set the master configuration epoch to the epoch we won the
             * election to perform this failover. This will force the other
             * Sentinels to update their config (assuming there is not
             * a newer one already available). */
            ri->master->config_epoch = ri->master->failover_epoch;
            ri->master->failover_state = SENTINEL_FAILOVER_STATE_RECONF_SLAVES;
            ri->master->failover_state_change_time = mstime();
            sentinelFlushConfig();
            sentinelEvent(LL_WARNING,"+promoted-slave",ri,"%@");
            if (sentinel.simfailure_flags &
                SENTINEL_SIMFAILURE_CRASH_AFTER_PROMOTION)
                sentinelSimFailureCrash();
            sentinelEvent(LL_WARNING,"+failover-state-reconf-slaves",
                ri->master,"%@");
            sentinelCallClientReconfScript(ri->master,SENTINEL_LEADER,
                "start",ri->master->addr,ri->addr);
            sentinelForceHelloUpdateForMaster(ri->master);
        } else {
            /* A slave turned into a master. We want to force our view and
             * reconfigure as slave. Wait some time after the change before
             * going forward, to receive new configs if any. */
            mstime_t wait_time = SENTINEL_PUBLISH_PERIOD*4;

            if (!(ri->flags & SRI_PROMOTED) &&
                 sentinelMasterLooksSane(ri->master) &&
     sentinelRedisInstanceNoDownFor(ri,wait_time) &&
                 mstime() - ri->role_reported_time > wait_time)
            {
                int retval = sentinelSendSlaveOf(ri,
                        ri->master->addr->ip,
                        ri->master->addr->port);
                if (retval == C_OK)
                    sentinelEvent(LL_NOTICE,"+convert-to-slave",ri,"%@");
            }
        }
    }

    /* Handle slaves replicating to a different master address. */
    if ((ri->flags & SRI_SLAVE) &&
        role == SRI_SLAVE &&
        (ri->slave_master_port != ri->master->addr->port ||
         strcasecmp(ri->slave_master_host,ri->master->addr->ip)))
    {
        mstime_t wait_time = ri->master->failover_timeout;

        /* Make sure the master is sane before reconfiguring this instance
         * into a slave. */
        if (sentinelMasterLooksSane(ri->master) &&
            sentinelRedisInstanceNoDownFor(ri,wait_time) &&
            mstime() - ri->slave_conf_change_time > wait_time)
        {
            int retval = sentinelSendSlaveOf(ri,
                    ri->master->addr->ip,
                    ri->master->addr->port);
            if (retval == C_OK)
                sentinelEvent(LL_NOTICE,"+fix-slave-config",ri,"%@");
        }
    }

    /* Detect if the slave that is in the process of being reconfigured
     * changed state. */
    if ((ri->flags & SRI_SLAVE) && role == SRI_SLAVE &&
        (ri->flags & (SRI_RECONF_SENT|SRI_RECONF_INPROG)))
    {
        /* SRI_RECONF_SENT -> SRI_RECONF_INPROG. */
        if ((ri->flags & SRI_RECONF_SENT) &&
            ri->slave_master_host &&
            strcmp(ri->slave_master_host,
                    ri->master->promoted_slave->addr->ip) == 0 &&
            ri->slave_master_port == ri->master->promoted_slave->addr->port)
        {
            ri->flags &= ~SRI_RECONF_SENT;
            ri->flags |= SRI_RECONF_INPROG;
     sentinelEvent(LL_NOTICE,"+slave-reconf-inprog",ri,"%@");
        }

        /* SRI_RECONF_INPROG -> SRI_RECONF_DONE */
        if ((ri->flags & SRI_RECONF_INPROG) &&
            ri->slave_master_link_status == SENTINEL_MASTER_LINK_STATUS_UP)
        {
            ri->flags &= ~SRI_RECONF_INPROG;
            ri->flags |= SRI_RECONF_DONE;
            sentinelEvent(LL_NOTICE,"+slave-reconf-done",ri,"%@");
        }
    }
}

這里只截取了有關(guān)slave-·>master的部分代碼抒和。

  • 先檢驗(yàn) sentinel監(jiān)督的slave是否是正處在這種故障轉(zhuǎn)移的狀態(tài)中。
  • 如果是則更新配置紀(jì)元彤蔽、還有設(shè)置進(jìn)入下一狀態(tài)SENTINEL_FAILOVER_STATE_RECONF_SLAVES摧莽、修改狀態(tài)變更的時(shí)間以及保存至配置文件。
  • 調(diào)用client的重新配置的腳本顿痪。
  • 調(diào)用sentinelForceHelloUpdateForMaster->sentinelForceHelloUpdateDictOfRedisInstances方法,如此來(lái)使得下一個(gè)周期廣播hello msg
/* Reset last_pub_time in all the instances in the specified dictionary
 * in order to force the delivery of an Hello update ASAP. */
void sentinelForceHelloUpdateDictOfRedisInstances(dict *instances) {
    dictIterator *di;
    dictEntry *de;

    di = dictGetSafeIterator(instances);
    while((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *ri = dictGetVal(de);
        if (ri->last_pub_time >= (SENTINEL_PUBLISH_PERIOD+1))
            ri->last_pub_time -= (SENTINEL_PUBLISH_PERIOD+1);
    }
    dictReleaseIterator(di);
}

/* This function forces the delivery of an "Hello" message (see
 * sentinelSendHello() top comment for further information) to all the Redis
 * and Sentinel instances related to the specified 'master'.
 *
 * It is technically not needed since we send an update to every instance
 * with a period of SENTINEL_PUBLISH_PERIOD milliseconds, however when a
 * Sentinel upgrades a configuration it is a good idea to deliever an update
 * to the other Sentinels ASAP. */
int sentinelForceHelloUpdateForMaster(sentinelRedisInstance *master) {
    if (!(master->flags & SRI_MASTER)) return C_ERR;
    if (master->last_pub_time >= (SENTINEL_PUBLISH_PERIOD+1))
        master->last_pub_time -= (SENTINEL_PUBLISH_PERIOD+1);
    sentinelForceHelloUpdateDictOfRedisInstances(master->sentinels);
    sentinelForceHelloUpdateDictOfRedisInstances(master->slaves);
    return C_OK;
}
重新配置slave

在通過(guò)info命令探測(cè)到被選舉的slave已成功變成master后,進(jìn)入SENTINEL_FAILOVER_STATE_RECONF_SLAVES狀態(tài)镊辕,重新配置剩下的slaves

/* Send SLAVE OF <new master address> to all the remaining slaves that
 * still don't appear to have the configuration updated. */
void sentinelFailoverReconfNextSlave(sentinelRedisInstance *master) {
    dictIterator *di;
    dictEntry *de;
    int in_progress = 0;

    di = dictGetIterator(master->slaves);
    while((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *slave = dictGetVal(de);

        if (slave->flags & (SRI_RECONF_SENT|SRI_RECONF_INPROG))
            in_progress++;
    }
    dictReleaseIterator(di);

    di = dictGetIterator(master->slaves);
    while(in_progress < master->parallel_syncs &&
          (de = dictNext(di)) != NULL)
    {
        sentinelRedisInstance *slave = dictGetVal(de);
        int retval;

        /* Skip the promoted slave, and already configured slaves. */
        if (slave->flags & (SRI_PROMOTED|SRI_RECONF_DONE)) continue;

        /* If too much time elapsed without the slave moving forward to
         * the next state, consider it reconfigured even if it is not.
         * Sentinels will detect the slave as misconfigured and fix its
         * configuration later. */
        if ((slave->flags & SRI_RECONF_SENT) &&
            (mstime() - slave->slave_reconf_sent_time) >
            SENTINEL_SLAVE_RECONF_TIMEOUT)
        {
            sentinelEvent(LL_NOTICE,"-slave-reconf-sent-timeout",slave,"%@");
            slave->flags &= ~SRI_RECONF_SENT;
            slave->flags |= SRI_RECONF_DONE;
        }

        /* Nothing to do for instances that are disconnected or already
         * in RECONF_SENT state. */
        if (slave->flags & (SRI_RECONF_SENT|SRI_RECONF_INPROG)) continue;
        if (slave->link->disconnected) continue;

        /* Send SLAVEOF <new master>. */
        retval = sentinelSendSlaveOf(slave,
                master->promoted_slave->addr->ip,
                master->promoted_slave->addr->port);
        if (retval == C_OK) {            
            slave->flags |= SRI_RECONF_SENT;
            slave->slave_reconf_sent_time = mstime();
            sentinelEvent(LL_NOTICE,"+slave-reconf-sent",slave,"%@");
            in_progress++;
        }
    }
    dictReleaseIterator(di);

    /* Check if all the slaves are reconfigured and handle timeout. */
    sentinelFailoverDetectEnd(master);
}

向其他slaves發(fā)送SLAVE OF <new master address>命令员魏,而這個(gè)命令丑蛤,也沒(méi)有狀態(tài)回復(fù),依然是通過(guò)info命令的探測(cè)得知撕阎,每個(gè)slave是否已重新配置了新的master受裹。

/* Process the INFO output from masters. */
void sentinelRefreshInstanceInfo(sentinelRedisInstance *ri, const char *info) {
    sds *lines;
    int numlines, j;
    int role = 0;

    /* cache full INFO output for instance */
    sdsfree(ri->info);
    ri->info = sdsnew(info);

    /* The following fields must be reset to a given value in the case they
     * are not found at all in the INFO output. */
    ri->master_link_down_time = 0;

    ...

    /* Detect if the slave that is in the process of being reconfigured
     * changed state. */
    if ((ri->flags & SRI_SLAVE) && role == SRI_SLAVE &&
        (ri->flags & (SRI_RECONF_SENT|SRI_RECONF_INPROG)))
    {
        /* SRI_RECONF_SENT -> SRI_RECONF_INPROG. */
        if ((ri->flags & SRI_RECONF_SENT) &&
            ri->slave_master_host &&
            strcmp(ri->slave_master_host,
                    ri->master->promoted_slave->addr->ip) == 0 &&
            ri->slave_master_port == ri->master->promoted_slave->addr->port)
        {
            ri->flags &= ~SRI_RECONF_SENT;
            ri->flags |= SRI_RECONF_INPROG;
     sentinelEvent(LL_NOTICE,"+slave-reconf-inprog",ri,"%@");
        }

        /* SRI_RECONF_INPROG -> SRI_RECONF_DONE */
        if ((ri->flags & SRI_RECONF_INPROG) &&
            ri->slave_master_link_status == SENTINEL_MASTER_LINK_STATUS_UP)
        {
            ri->flags &= ~SRI_RECONF_INPROG;
            ri->flags |= SRI_RECONF_DONE;
            sentinelEvent(LL_NOTICE,"+slave-reconf-done",ri,"%@");
        }
    }
}

如上的代碼,其實(shí)在最后還有一個(gè)等待其他slave轉(zhuǎn)向promote slave的狀態(tài)變化過(guò)程。

SRI_RECONF_SENT->SRI_RECONF_INPROG->SRI_RECONF_DONE
  • SRI_RECONF_SENT:就是前面已發(fā)送SLAVE OF <new master address>的狀態(tài)棉饶。
  • SRI_RECONF_INPROG:就是收到SLAVE OF <new master address>命令的slave已經(jīng)配置成新master的從服務(wù)器的狀態(tài)厦章。
  • SRI_RECONF_DONE:就是slave重新配置master結(jié)束的狀態(tài)。達(dá)到這個(gè)狀態(tài)有一個(gè)前提條件是master_link_status:up則表示slave節(jié)點(diǎn)的重新配置master結(jié)束照藻。

最后在sentinelFailoverReconfNextSlave調(diào)用了sentinelFailoverDetectEnd方法來(lái)檢查是否所有的slave都已正常配置了新的master袜啃。如果都已經(jīng)配置完畢,則進(jìn)入到了下一個(gè)狀態(tài)SENTINEL_FAILOVER_STATE_UPDATE_CONFIG幸缕。

void sentinelFailoverDetectEnd(sentinelRedisInstance *master) {
    int not_reconfigured = 0, timeout = 0;
    dictIterator *di;
    dictEntry *de;
    mstime_t elapsed = mstime() - master->failover_state_change_time;

    /* We can't consider failover finished if the promoted slave is
     * not reachable. */
    if (master->promoted_slave == NULL ||
        master->promoted_slave->flags & SRI_S_DOWN) return;

    /* The failover terminates once all the reachable slaves are properly
     * configured. */
    di = dictGetIterator(master->slaves);
    while((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *slave = dictGetVal(de);

        if (slave->flags & (SRI_PROMOTED|SRI_RECONF_DONE)) continue;
        if (slave->flags & SRI_S_DOWN) continue;
        not_reconfigured++;
    }
    dictReleaseIterator(di);

    /* Force end of failover on timeout. */
    if (elapsed > master->failover_timeout) {
        not_reconfigured = 0;
        timeout = 1;
        sentinelEvent(LL_WARNING,"+failover-end-for-timeout",master,"%@");
    }

    if (not_reconfigured == 0) {
        sentinelEvent(LL_WARNING,"+failover-end",master,"%@");
        master->failover_state = SENTINEL_FAILOVER_STATE_UPDATE_CONFIG;
        master->failover_state_change_time = mstime();
    }

    /* If I'm the leader it is a good idea to send a best effort SLAVEOF
     * command to all the slaves still not reconfigured to replicate with
     * the new master. */
    if (timeout) {
        dictIterator *di;
        dictEntry *de;

        di = dictGetIterator(master->slaves);
        while((de = dictNext(di)) != NULL) {
            sentinelRedisInstance *slave = dictGetVal(de);
            int retval;

            if (slave->flags & (SRI_RECONF_DONE|SRI_RECONF_SENT)) continue;
            if (slave->link->disconnected) continue;
                    retval = sentinelSendSlaveOf(slave,
                    master->promoted_slave->addr->ip,
                    master->promoted_slave->addr->port);
            if (retval == C_OK) {
                sentinelEvent(LL_NOTICE,"+slave-reconf-sent-be",slave,"%@");
                slave->flags |= SRI_RECONF_SENT;
            }
        }
        dictReleaseIterator(di);
    }
}
更新master地址

在完成所有的slave轉(zhuǎn)換后,故障轉(zhuǎn)移已變成SENTINEL_FAILOVER_STATE_UPDATE_CONFIG狀態(tài)群发。當(dāng)sentinel處于這種狀態(tài)時(shí),代碼在周期方法中處理該狀態(tài)发乔,而不是在狀態(tài)機(jī)中處理的熟妓。主要是因?yàn)椋幱谶@種狀態(tài)的master將要被選舉的slave替換栏尚,只需要改變?cè)璵aster的地址起愈。且這個(gè)方法是遞歸的,如果不將該狀態(tài)的處理放置在原master及其slavesentinel節(jié)點(diǎn)的周期性事件處理的最后面的話译仗,有可能會(huì)引起一些不必要的問(wèn)題抬虽。重新設(shè)置的代碼如下,重新創(chuàng)建了一個(gè)slaves的dict纵菌。并將原來(lái)的master節(jié)點(diǎn)變?yōu)?code>slave加入到字典表中阐污。而原有的master sentinelRedisInstance的地址將會(huì)被替換各種狀態(tài)和連接都會(huì)被重置sentinelResetMaster

/* Perform scheduled operations for all the instances in the dictionary.
 * Recursively call the function against dictionaries of slaves. */
void sentinelHandleDictOfRedisInstances(dict *instances) {
    dictIterator *di;
    dictEntry *de;
    sentinelRedisInstance *switch_to_promoted = NULL;

    /* There are a number of things we need to perform against every master. */
    di = dictGetIterator(instances);
    while((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *ri = dictGetVal(de);

        sentinelHandleRedisInstance(ri);
        if (ri->flags & SRI_MASTER) {
            sentinelHandleDictOfRedisInstances(ri->slaves);
            sentinelHandleDictOfRedisInstances(ri->sentinels);
            if (ri->failover_state == SENTINEL_FAILOVER_STATE_UPDATE_CONFIG) {
                switch_to_promoted = ri;
            }
        }
    }
    if (switch_to_promoted)
        sentinelFailoverSwitchToPromotedSlave(switch_to_promoted);
    dictReleaseIterator(di);
}

sentinelFailoverSwitchToPromotedSlave->sentinelResetMasterAndChangeAddress

/* Reset the specified master with sentinelResetMaster(), and also change
 * the ip:port address, but take the name of the instance unmodified.
 *
 * This is used to handle the +switch-master event.
 *
 * The function returns C_ERR if the address can't be resolved for some
 * reason. Otherwise C_OK is returned.  */
int sentinelResetMasterAndChangeAddress(sentinelRedisInstance *master, char *ip, int port) {
    sentinelAddr *oldaddr, *newaddr;
    sentinelAddr **slaves = NULL;
    int numslaves = 0, j;
    dictIterator *di;
    dictEntry *de;

    newaddr = createSentinelAddr(ip,port);
    if (newaddr == NULL) return C_ERR;

    /* Make a list of slaves to add back after the reset.
     * Don't include the one having the address we are switching to. */
    di = dictGetIterator(master->slaves);
    while((de = dictNext(di)) != NULL) {
        sentinelRedisInstance *slave = dictGetVal(de);

        if (sentinelAddrIsEqual(slave->addr,newaddr)) continue;
        slaves = zrealloc(slaves,sizeof(sentinelAddr*)*(numslaves+1));
        slaves[numslaves++] = createSentinelAddr(slave->addr->ip,
                                                 slave->addr->port);
    }
    dictReleaseIterator(di);

    /* If we are switching to a different address, include the old address
     * as a slave as well, so that we'll be able to sense / reconfigure
     * the old master. */
    if (!sentinelAddrIsEqual(newaddr,master->addr)) {
        slaves = zrealloc(slaves,sizeof(sentinelAddr*)*(numslaves+1));
        slaves[numslaves++] = createSentinelAddr(master->addr->ip,
                                                 master->addr->port);
    }

    /* Reset and switch address. */
    sentinelResetMaster(master,SENTINEL_RESET_NO_SENTINELS);
    oldaddr = master->addr;
    master->addr = newaddr;
    master->o_down_since_time = 0;
    master->s_down_since_time = 0;

    /* Add slaves back. */
    for (j = 0; j < numslaves; j++) {
        sentinelRedisInstance *slave;
        slave = createSentinelRedisInstance(NULL,SRI_SLAVE,slaves[j]->ip,
                    slaves[j]->port, master->quorum, master);
        releaseSentinelAddr(slaves[j]);
        if (slave) sentinelEvent(LL_NOTICE,"+slave",slave,"%@");
    }
    zfree(slaves);

    /* Release the old address at the end so we are safe even if the function
     * gets the master->addr->ip and master->addr->port as arguments. */
    releaseSentinelAddr(oldaddr);
    sentinelFlushConfig();
    return C_OK;
}

故障轉(zhuǎn)移終于結(jié)束了,但還有一個(gè)遺留的問(wèn)題尚未解決就是产艾,故障轉(zhuǎn)移只有被選舉的leader才能操作疤剑,其他sentinel節(jié)點(diǎn)是如何同步到被選舉的新master并更新對(duì)應(yīng)的結(jié)構(gòu)的呢?還記得在處理info命令中當(dāng)收到的role由slave轉(zhuǎn)為master時(shí)闷堡,代碼會(huì)強(qiáng)制更新hello msg的pub的周期,盡快的廣播hello msg疑故,因此又看回到hello msg的處理方法中的部分代碼杠览。

/* Process an hello message received via Pub/Sub in master or slave instance,
 * or sent directly to this sentinel via the (fake) PUBLISH command of Sentinel.
 *
 * If the master name specified in the message is not known, the message is
 * discarded. */
void sentinelProcessHelloMessage(char *hello, int hello_len) {
    /* Format is composed of 8 tokens:
     * 0=ip,1=port,2=runid,3=current_epoch,4=master_name,
     * 5=master_ip,6=master_port,7=master_config_epoch. */
    int numtokens, port, removed, master_port;
    uint64_t current_epoch, master_config_epoch;
    char **token = sdssplitlen(hello, hello_len, ",", 1, &numtokens);
    sentinelRedisInstance *si, *master;

    if (numtokens == 8) {
        /* Obtain a reference to the master this hello message is about */
        master = sentinelGetMasterByName(token[4]);
        if (!master) goto cleanup; /* Unknown master, skip the message. */

        /* First, try to see if we already have this sentinel. */
        port = atoi(token[1]);
        master_port = atoi(token[6]);
        si = getSentinelRedisInstanceByAddrAndRunID(
                        master->sentinels,token[0],port,token[2]);
        current_epoch = strtoull(token[3],NULL,10);
        master_config_epoch = strtoull(token[7],NULL,10);
        ...
        /* Update master info if received configuration is newer. */
        if (si && master->config_epoch < master_config_epoch) {
            master->config_epoch = master_config_epoch;
            if (master_port != master->addr->port ||
                strcmp(master->addr->ip, token[5]))
            {
                sentinelAddr *old_addr;

                sentinelEvent(LL_WARNING,"+config-update-from",si,"%@");
                sentinelEvent(LL_WARNING,"+switch-master",
                    master,"%s %s %d %s %d",
                    master->name,
                    master->addr->ip, master->addr->port,
                    token[5], master_port);

                old_addr = dupSentinelAddr(master->addr);
                sentinelResetMasterAndChangeAddress(master, token[5], master_port);
                sentinelCallClientReconfScript(master,
                    SENTINEL_OBSERVER,"start",
                    old_addr,master->addr);
                releaseSentinelAddr(old_addr);
            }
        }

        /* Update the state of the Sentinel. */
        if (si) si->last_hello_time = mstime();
    }

cleanup:
    sdsfreesplitres(token,numtokens);
}

其他節(jié)點(diǎn)當(dāng)發(fā)現(xiàn)master的配置紀(jì)元小于廣播的配置紀(jì)元,且masteripport都變了時(shí)纵势,開(kāi)始重置master了踱阿,方法還是上面分析過(guò)的sentinelResetMasterAndChangeAddress。至此最后的謎團(tuán)也解開(kāi)了钦铁,其他sentinel的監(jiān)督狀態(tài)也得到了更新软舌,注意從代碼看master name非常重要,升級(jí)slave的時(shí)候master name依然不變牛曹。
到這里有關(guān)sentinel的故障轉(zhuǎn)移的絕大部分內(nèi)容都已經(jīng)分析完了佛点,基本流程也都串起來(lái)了。

小結(jié)

  1. 確認(rèn)節(jié)點(diǎn)下線,分為了主觀下線和客觀下線超营。
  2. 主觀下線是一段時(shí)間內(nèi)探測(cè)的ping命令返回?zé)o效鸳玩。主觀下線探測(cè)是對(duì)所有節(jié)點(diǎn)都一致的,該時(shí)間可配演闭,且以master為維度配置不跟。
  3. 客觀下線,是只針對(duì)master節(jié)點(diǎn)的,通過(guò)向其他sentinel節(jié)點(diǎn)發(fā)送SENTINEL is-master-down-by-addr命令來(lái)進(jìn)行詢問(wèn)其他節(jié)點(diǎn)master下線的問(wèn)題米碰,并達(dá)成共識(shí)的一個(gè)狀態(tài)窝革。
  4. 選舉leader節(jié)點(diǎn)進(jìn)行故障轉(zhuǎn)移,當(dāng)sentinel集群中有節(jié)點(diǎn)檢測(cè)到某個(gè)master滿足客觀下線的條件(判斷master下線的節(jié)點(diǎn)數(shù)大于配置的quorum)吕座,便觸發(fā)了leader選舉聊闯。
  5. 每個(gè)sentinel節(jié)點(diǎn)都可以參加選舉和進(jìn)行投票,但當(dāng)前紀(jì)元的投票米诉,每個(gè)節(jié)點(diǎn)有且只有1投票菱蔬,整個(gè)選舉有時(shí)間限制(默認(rèn)10s,如果配置的故障轉(zhuǎn)移超時(shí)時(shí)間小于10s,則為故障轉(zhuǎn)移超時(shí)時(shí)間),在一定時(shí)間內(nèi)史侣,沒(méi)有選舉出leader拴泌,便更新紀(jì)元,重新開(kāi)始新一輪的選舉惊橱。
  6. sentinel節(jié)點(diǎn)也是通過(guò)SENTINEL is-master-down-by-addr命令來(lái)進(jìn)行拉票蚪腐,因此該命令在整個(gè)故障轉(zhuǎn)移中有兩種作用。
  7. 最先獲得至少集群節(jié)點(diǎn)數(shù)一半以上投票的節(jié)點(diǎn)當(dāng)選leader税朴。
  8. sentinel使用狀態(tài)機(jī)來(lái)控制故障轉(zhuǎn)移的流程回季。每個(gè)狀態(tài)都是異步且在不同周期被調(diào)用。
  9. leader在slaves節(jié)點(diǎn)中選舉最合適成為新的master正林,并向其發(fā)送slave of no one來(lái)進(jìn)行升級(jí)泡一。且通過(guò)info命令來(lái)獲取新master的轉(zhuǎn)換信息。
  10. leader通過(guò)向其他slave節(jié)點(diǎn)發(fā)送slave of <new master address>來(lái)重新配置master觅廓,也是通過(guò)info命令的探測(cè)來(lái)其他節(jié)點(diǎn)的配置是否已經(jīng)配置完成鼻忠。
    11.當(dāng)slaves的節(jié)點(diǎn)重新構(gòu)建完成,leader開(kāi)始更新master的結(jié)構(gòu)杈绸,重新建立slaves dict几苍,并重置mastersentinelRedisInstance危彩。但一直保持master name不變遵湖。
    12.其他sentinel節(jié)點(diǎn)在leader完成了故障轉(zhuǎn)移后仗哨,通過(guò)訂閱了同一批的slave節(jié)點(diǎn)的hello頻道,收到leader廣播的hello msg而更新自身的master結(jié)構(gòu)數(shù)據(jù)劫侧。
  11. 最后通過(guò)redis的sentinel解決方案就可以更好的去理解Raft算法的內(nèi)容了埋酬。
最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市,隨后出現(xiàn)的幾起案子奇瘦,更是在濱河造成了極大的恐慌棘催,老刑警劉巖,帶你破解...
    沈念sama閱讀 206,602評(píng)論 6 481
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件耳标,死亡現(xiàn)場(chǎng)離奇詭異醇坝,居然都是意外死亡,警方通過(guò)查閱死者的電腦和手機(jī)次坡,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 88,442評(píng)論 2 382
  • 文/潘曉璐 我一進(jìn)店門(mén)呼猪,熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái),“玉大人砸琅,你說(shuō)我怎么就攤上這事宋距。” “怎么了症脂?”我有些...
    開(kāi)封第一講書(shū)人閱讀 152,878評(píng)論 0 344
  • 文/不壞的土叔 我叫張陵谚赎,是天一觀的道長(zhǎng)。 經(jīng)常有香客問(wèn)我诱篷,道長(zhǎng)壶唤,這世上最難降的妖魔是什么? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 55,306評(píng)論 1 279
  • 正文 為了忘掉前任棕所,我火速辦了婚禮闸盔,結(jié)果婚禮上,老公的妹妹穿的比我還像新娘琳省。我一直安慰自己迎吵,他們只是感情好,可當(dāng)我...
    茶點(diǎn)故事閱讀 64,330評(píng)論 5 373
  • 文/花漫 我一把揭開(kāi)白布针贬。 她就那樣靜靜地躺著击费,像睡著了一般。 火紅的嫁衣襯著肌膚如雪坚踩。 梳的紋絲不亂的頭發(fā)上荡灾,一...
    開(kāi)封第一講書(shū)人閱讀 49,071評(píng)論 1 285
  • 那天,我揣著相機(jī)與錄音瞬铸,去河邊找鬼。 笑死础锐,一個(gè)胖子當(dāng)著我的面吹牛嗓节,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播皆警,決...
    沈念sama閱讀 38,382評(píng)論 3 400
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼拦宣,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來(lái)了?” 一聲冷哼從身側(cè)響起鸵隧,我...
    開(kāi)封第一講書(shū)人閱讀 37,006評(píng)論 0 259
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤绸罗,失蹤者是張志新(化名)和其女友劉穎,沒(méi)想到半個(gè)月后豆瘫,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體珊蟀,經(jīng)...
    沈念sama閱讀 43,512評(píng)論 1 300
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 35,965評(píng)論 2 325
  • 正文 我和宋清朗相戀三年外驱,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了育灸。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 38,094評(píng)論 1 333
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡昵宇,死狀恐怖磅崭,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情瓦哎,我是刑警寧澤砸喻,帶...
    沈念sama閱讀 33,732評(píng)論 4 323
  • 正文 年R本政府宣布,位于F島的核電站蒋譬,受9級(jí)特大地震影響割岛,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜羡铲,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 39,283評(píng)論 3 307
  • 文/蒙蒙 一蜂桶、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧也切,春花似錦扑媚、人聲如沸。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 30,286評(píng)論 0 19
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)。三九已至倒槐,卻和暖如春旬痹,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背讨越。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 31,512評(píng)論 1 262
  • 我被黑心中介騙來(lái)泰國(guó)打工两残, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人把跨。 一個(gè)月前我還...
    沈念sama閱讀 45,536評(píng)論 2 354
  • 正文 我出身青樓人弓,卻偏偏與公主長(zhǎng)得像,于是被迫代替她去往敵國(guó)和親着逐。 傳聞我的和親對(duì)象是個(gè)殘疾皇子崔赌,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 42,828評(píng)論 2 345