在sentinelHandleRedisInstance函數(shù)中裳擎,如果是主節(jié)點(diǎn)固该,需要做如下處理:
void sentinelHandleRedisInstance(sentinelRedisInstance *ri) {
// 省略...
// 如果是主節(jié)點(diǎn)
if (ri->flags & SRI_MASTER) {
// 檢查是否主觀下線
sentinelCheckObjectivelyDown(ri);
// 是否需要開始故障切換
if (sentinelStartFailoverIfNeeded(ri))
// 獲取其他哨兵實(shí)例對節(jié)點(diǎn)狀態(tài)的判斷
sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_ASK_FORCED);
// 故障切換狀態(tài)機(jī)
sentinelFailoverStateMachine(ri);
// 獲取其他哨兵實(shí)例對節(jié)點(diǎn)狀態(tài)的判斷
sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_NO_FLAGS);
}
}
節(jié)點(diǎn)的狀態(tài)定義
#define SRI_MASTER (1<<0) /* master主節(jié)點(diǎn) */
#define SRI_SLAVE (1<<1) /* slave從節(jié)點(diǎn) */
#define SRI_SENTINEL (1<<2) /* 哨兵節(jié)點(diǎn) */
#define SRI_S_DOWN (1<<3) /* 主觀下線 */
#define SRI_O_DOWN (1<<4) /* 客觀下線 */
#define SRI_MASTER_DOWN (1<<5) /* 節(jié)點(diǎn)下線 */
#define SRI_FAILOVER_IN_PROGRESS (1<<6) /* 正在執(zhí)行master節(jié)點(diǎn)的故障切換
換 */
客觀下線
sentinelCheckObjectivelyDown
sentinelCheckObjectivelyDown函數(shù)用于判斷master節(jié)點(diǎn)是否客觀下線:
- 首先確認(rèn)主節(jié)點(diǎn)是否已經(jīng)被哨兵標(biāo)記為主觀下線形庭,如果已經(jīng)主觀下線透典,quorum的值置為1,表示當(dāng)前已經(jīng)有1個哨兵認(rèn)為master主觀下線漫雷,進(jìn)行第2步
- master->sentinels存儲了監(jiān)控當(dāng)前master的其他哨兵節(jié)點(diǎn)颂龙,遍歷其他哨兵節(jié)點(diǎn)习蓬,通過flag標(biāo)識中是否有SRI_MASTER_DOWN狀態(tài)判斷其他哨兵對MASTER節(jié)點(diǎn)下線的判斷,如果有則認(rèn)為MASTER下線措嵌,對quorum數(shù)量加1
- 遍歷結(jié)束后判斷quorum的數(shù)量是否大于master->quorum設(shè)置的數(shù)量躲叼,也就是是否有過半的哨兵認(rèn)為主節(jié)點(diǎn)下線,如果是將odown置為1企巢,認(rèn)為主節(jié)點(diǎn)客觀下線
- 根據(jù)odown的值判斷主節(jié)點(diǎn)是否客觀下線
- 如果客觀下線枫慷,確認(rèn)master->flags是否有SRI_O_DOWN狀態(tài),如果沒有,發(fā)布+odown客觀下線事件并將master->flags置為SRI_O_DOWN狀態(tài)
- 如果沒有客觀下線或听,校驗(yàn)master->flags是否有SRI_O_DOWN狀態(tài)探孝,如果有,需要發(fā)布-odown事件誉裆,取消master的客觀下線標(biāo)記(master->flags的SRI_O_DOWN狀態(tài)取消)
可以看到顿颅,master節(jié)點(diǎn)客觀下線需要根據(jù)其他哨兵實(shí)例對主節(jié)點(diǎn)的判斷來共同決定,具體是通過其他哨兵實(shí)例的flag中是否有SRI_MASTER_DOWN狀態(tài)來判斷的足丢,如果認(rèn)為master下線的哨兵個數(shù)超過了master節(jié)點(diǎn)中的quorum設(shè)置粱腻,master節(jié)點(diǎn)將被認(rèn)定為客觀下線,發(fā)布+odown客觀下線事件霎桅,關(guān)于SRI_MASTER_DOWN狀態(tài)是在哪里設(shè)置的在后面會講到。
void sentinelCheckObjectivelyDown(sentinelRedisInstance *master) {
dictIterator *di;
dictEntry *de;
unsigned int quorum = 0, odown = 0;
// 判斷主節(jié)點(diǎn)是否被標(biāo)記為主觀下線SRI_S_DOWN
if (master->flags & SRI_S_DOWN) {
/* quorum初始化為1 */
quorum = 1;
/* 遍歷監(jiān)聽主節(jié)點(diǎn)的其他哨兵實(shí)例. */
di = dictGetIterator(master->sentinels);
while((de = dictNext(di)) != NULL) {
// 獲取哨兵實(shí)例
sentinelRedisInstance *ri = dictGetVal(de);
// 判斷其他哨兵實(shí)例是否把master節(jié)點(diǎn)標(biāo)記為下線
if (ri->flags & SRI_MASTER_DOWN) quorum++;
}
dictReleaseIterator(di);
// 如果大于master->quorum讨永,也就是過半的哨兵認(rèn)為主節(jié)點(diǎn)已經(jīng)下線滔驶,標(biāo)記客觀下線
if (quorum >= master->quorum) odown = 1;
}
// 如果客觀下線
if (odown) {
if ((master->flags & SRI_O_DOWN) == 0) {
// 發(fā)布+odown客觀下線事件
sentinelEvent(LL_WARNING,"+odown",master,"%@ #quorum %d/%d",
quorum, master->quorum);
// 將MASTER節(jié)點(diǎn)標(biāo)記為SRI_O_DOWN
master->flags |= SRI_O_DOWN;
// 標(biāo)記客觀下線的時間
master->o_down_since_time = mstime();
}
} else {
// 非客觀下線,判斷master是否有客觀下線標(biāo)識
if (master->flags & SRI_O_DOWN) {
// 發(fā)布-odown事件
sentinelEvent(LL_WARNING,"-odown",master,"%@");
// 取消master的客觀下線標(biāo)識
master->flags &= ~SRI_O_DOWN;
}
}
}
是否需要執(zhí)行故障切換
sentinelStartFailoverIfNeeded
sentinelStartFailoverIfNeeded用于判斷是否需要執(zhí)行故障切換卿闹,可以開始故障切換的條件有三個:
- master節(jié)點(diǎn)被認(rèn)為客觀下線(SRI_O_DOWN)
- 當(dāng)前沒有在進(jìn)行故障切換(狀態(tài)不是SRI_FAILOVER_IN_PROGRESS)
- 距離上次執(zhí)行故障切換的時間揭糕,超過了故障切換超時時間設(shè)置的2倍,意味著上一次執(zhí)行故障切換的時間已超時锻霎,可以重新進(jìn)行故障切換
同時滿足以上三個條件著角,達(dá)到執(zhí)行故障切換的標(biāo)準(zhǔn),調(diào)用sentinelStartFailover函數(shù)旋恼,將故障切換的狀態(tài)改為待執(zhí)行吏口。
int sentinelStartFailoverIfNeeded(sentinelRedisInstance *master) {
/* 如果MASTER不是客觀下線,直接返回 */
if (!(master->flags & SRI_O_DOWN)) return 0;
/* 如果當(dāng)前已經(jīng)在執(zhí)行故障切換直接返回 */
if (master->flags & SRI_FAILOVER_IN_PROGRESS) return 0;
/* 判斷距離上次執(zhí)行故障切換的時間冰更,如果小于failover_timeout配置項(xiàng)的2倍产徊,表示上次故障切換還未達(dá)到超時設(shè)置,所以本次不能執(zhí)行*/
if (mstime() - master->failover_start_time <
master->failover_timeout*2)
{
if (master->failover_delay_logged != master->failover_start_time) {
time_t clock = (master->failover_start_time +
master->failover_timeout*2) / 1000;
char ctimebuf[26];
ctime_r(&clock,ctimebuf);
ctimebuf[24] = '\0'; /* Remove newline. */
master->failover_delay_logged = master->failover_start_time;
serverLog(LL_WARNING,
"Next failover delay: I will not start a failover before %s",
ctimebuf);
}
return 0;
}
// 開始故障切換(只更改了故障切換狀態(tài))
sentinelStartFailover(master);
return 1;
}
sentinelStartFailover
可以看到sentinelStartFailover函數(shù)并沒有直接進(jìn)行故障切換蜀细,而是更改了一些狀態(tài):
- 將failover_state置為了SENTINEL_FAILOVER_STATE_WAIT_START等待開始執(zhí)行狀態(tài)
- 將master節(jié)點(diǎn)的flags設(shè)置為SRI_FAILOVER_IN_PROGRESS故障切換執(zhí)行中狀態(tài)
- **將master的failover_epoch設(shè)置為當(dāng)前哨兵的投票輪次current_epoch + 1 **舟铜,在選舉leader時會用到
void sentinelStartFailover(sentinelRedisInstance *master) {
serverAssert(master->flags & SRI_MASTER);
// 將狀態(tài)更改為等待開始執(zhí)行故障切換
master->failover_state = SENTINEL_FAILOVER_STATE_WAIT_START;
// 設(shè)置為故障切換執(zhí)行中狀態(tài)
master->flags |= SRI_FAILOVER_IN_PROGRESS;
// 設(shè)置failover_epoch故障切換輪次
master->failover_epoch = ++sentinel.current_epoch;
sentinelEvent(LL_WARNING,"+new-epoch",master,"%llu",
(unsigned long long) sentinel.current_epoch);
// 發(fā)布事件
sentinelEvent(LL_WARNING,"+try-failover",master,"%@");
master->failover_start_time = mstime()+rand()%SENTINEL_MAX_DESYNC;
master->failover_state_change_time = mstime();
}
獲取哨兵實(shí)例對主節(jié)點(diǎn)狀態(tài)判斷
在sentinelHandleRedisInstance函數(shù)中,可以看到sentinelStartFailoverIfNeeded條件成立時以及函數(shù)的最后都調(diào)用了sentinelAskMasterStateToOtherSentinels奠衔,接下來就去看看sentinelAskMasterStateToOtherSentinels里面都做了什么:
if (ri->flags & SRI_MASTER) {
sentinelCheckObjectivelyDown(ri);
if (sentinelStartFailoverIfNeeded(ri))
// 獲取其他哨兵實(shí)例對主節(jié)點(diǎn)的狀態(tài)判斷谆刨,這里傳入的參數(shù)是SENTINEL_ASK_FORCED
sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_ASK_FORCED);
sentinelFailoverStateMachine(ri);
// 獲取其他哨兵實(shí)例對主節(jié)點(diǎn)的狀態(tài)判斷,這里傳入的參數(shù)是SENTINEL_NO_FLAGS
sentinelAskMasterStateToOtherSentinels(ri,SENTINEL_NO_FLAGS);
}
is-master-down-by-addr命令發(fā)送
sentinelAskMasterStateToOtherSentinels
sentinelAskMasterStateToOtherSentinels函數(shù)用于向其他哨兵實(shí)例發(fā)送is-master-down-by-addr命令獲取其他哨兵實(shí)例對主節(jié)點(diǎn)狀態(tài)的判斷归斤,它會遍歷監(jiān)聽同一主節(jié)點(diǎn)的其他哨兵實(shí)例進(jìn)行處理:
- 獲取每一個哨兵實(shí)例
- 計(jì)算距離每個哨兵實(shí)例上一次收到IS-MASTER-DOWN-BY-ADDR命令回復(fù)時間的間隔
- 如果距離上次收到回復(fù)的時間已經(jīng)超過了SENTINEL_ASK_PERIOD周期的5倍痊夭,清空哨兵節(jié)點(diǎn)flag中的SRI_MASTER_DOWN狀態(tài)和leader
- 如果master節(jié)點(diǎn)已經(jīng)是下線狀態(tài)SRI_S_DOWN,不需要進(jìn)行處理脏里,回到第一步處理下一個哨兵
- 如果哨兵節(jié)點(diǎn)用于發(fā)送命令的link連接處于未連接狀態(tài)生兆,不處理,回到第一步處理下一個哨兵
- 如果不是強(qiáng)制發(fā)送命令(入?yún)⒌膄lag是SENTINEL_ASK_FORCED時),并且距離上次收到回復(fù)命令的時間還在SENTINEL_ASK_PERIOD周期內(nèi)鸦难,不處理根吁,回到第一步處理下一個哨兵
- 通過redisAsyncCommand函數(shù)發(fā)送發(fā)送is-master-down-by-addr命令,sentinelReceiveIsMasterDownReply為處理函數(shù)合蔽,redisAsyncCommand函數(shù)有如下參數(shù):
- 用于發(fā)送請求的連接:ri->link->cc
- 收到命令返回結(jié)果時對應(yīng)的處理函數(shù):sentinelReceiveIsMasterDownReply
- master節(jié)點(diǎn)的ip:announceSentinelAddr(master->addr)
- master節(jié)點(diǎn)端口:port
- 當(dāng)前哨兵的投票輪次:sentinel.current_epoch
- 實(shí)例ID:master->failover_state > SENTINEL_FAILOVER_STATE_NONE時表示要執(zhí)行故障切換击敌,此時傳入當(dāng)前哨兵的myid,否則傳入*
void sentinelAskMasterStateToOtherSentinels(sentinelRedisInstance *master, int flags) {
dictIterator *di;
dictEntry *de;
// 遍歷監(jiān)聽主節(jié)點(diǎn)的其他哨兵實(shí)例
di = dictGetIterator(master->sentinels);
while((de = dictNext(di)) != NULL) {
// 獲取每一個哨兵實(shí)例
sentinelRedisInstance *ri = dictGetVal(de);
// 計(jì)算距離上一次收到IS-MASTER-DOWN-BY-ADDR命令回復(fù)時間的間隔
mstime_t elapsed = mstime() - ri->last_master_down_reply_time;
char port[32];
int retval;
/* 如果距離上次收到回復(fù)的時間已經(jīng)超過了SENTINEL_ASK_PERIOD周期的5倍拴事,清空相關(guān)設(shè)置 */
if (elapsed > SENTINEL_ASK_PERIOD*5) {
// 取消SRI_MASTER_DOWN狀態(tài)
ri->flags &= ~SRI_MASTER_DOWN;
sdsfree(ri->leader);
// leader置為null
ri->leader = NULL;
}
// 如果master已經(jīng)是下線狀態(tài)
if ((master->flags & SRI_S_DOWN) == 0) continue;
// 如果已經(jīng)連接中斷
if (ri->link->disconnected) continue;
// 如果不是強(qiáng)制發(fā)送命令狀態(tài)SENTINEL_ASK_FORCED沃斤,并且距離上次收到回復(fù)命令的時間還在SENTINEL_ASK_PERIOD周期內(nèi),不處理刃宵,回到第一步處理下一個哨兵
if (!(flags & SENTINEL_ASK_FORCED) &&
mstime() - ri->last_master_down_reply_time < SENTINEL_ASK_PERIOD)
continue;
ll2string(port,sizeof(port),master->addr->port);
// 發(fā)送is-master-down-by-addr命令衡瓶,sentinelReceiveIsMasterDownReply為處理函數(shù)
retval = redisAsyncCommand(ri->link->cc,
sentinelReceiveIsMasterDownReply, ri,
"%s is-master-down-by-addr %s %s %llu %s",
sentinelInstanceMapCommand(ri,"SENTINEL"),
announceSentinelAddr(master->addr), port,
sentinel.current_epoch,
(master->failover_state > SENTINEL_FAILOVER_STATE_NONE) ?
sentinel.myid : "*");
if (retval == C_OK) ri->link->pending_commands++;
}
dictReleaseIterator(di);
}
is-master-down-by-addr命令處理
sentinelCommand
其他哨兵實(shí)例收到is-master-down-by-addr命令之后的處理邏輯在sentinelCommand函數(shù)中可以找到:
- 根據(jù)請求傳入的ip和端口信息獲取主節(jié)點(diǎn)的sentinelRedisInstance實(shí)例對象(在發(fā)送is-master-down-by-addr命令的redisAsyncCommand函數(shù)中傳入了主節(jié)點(diǎn)的ip和端口)
- 如果不是TILT模式,校驗(yàn)sentinelRedisInstance對象是否是主節(jié)點(diǎn)并且主節(jié)點(diǎn)被標(biāo)記為主觀下線牲证,如果條件都成立表示主節(jié)點(diǎn)已經(jīng)主觀下線哮针,將isdown置為1
- 判斷請求參數(shù)中的runid是否不為*,如果不為*表示當(dāng)前需要進(jìn)行l(wèi)eader選舉坦袍,調(diào)用sentinelVoteLeader選舉哨兵Leader
- 發(fā)送is-master-down-by-addr命令的回復(fù)十厢,將對主節(jié)點(diǎn)主觀下線的判斷、選出的leader節(jié)點(diǎn)的runid捂齐、投票輪次leader_epoch返回給發(fā)送命令哨兵
void sentinelCommand(client *c) {
if (c->argc == 2 && !strcasecmp(c->argv[1]->ptr,"help")) {
}
// 省略其他else if...
// 如果是is-master-down-by-addr命令
else if (!strcasecmp(c->argv[1]->ptr,"is-master-down-by-addr")) {
/* SENTINEL IS-MASTER-DOWN-BY-ADDR <ip> <port> <current-epoch> <runid>
*
* 參數(shù)說明:
* ip和端口:哨兵檢測的主節(jié)點(diǎn)的ip和端口
* current-epoch:是故障切換中當(dāng)前投票的輪次蛮放,每一個哨兵在一輪投票中只能投一次
* runid:如果需要執(zhí)行故障切換,傳入的是哨兵的myid奠宜,否則傳入的是 *
*/
sentinelRedisInstance *ri;
long long req_epoch;
uint64_t leader_epoch = 0; // 默認(rèn)為0
char *leader = NULL;
long port;
int isdown = 0;
if (c->argc != 6) goto numargserr;
if (getLongFromObjectOrReply(c,c->argv[3],&port,NULL) != C_OK ||
getLongLongFromObjectOrReply(c,c->argv[4],&req_epoch,NULL)
!= C_OK)
return;
// 根據(jù)請求傳入的ip和端口信息獲取對應(yīng)的哨兵實(shí)例包颁,也就是監(jiān)控的master節(jié)點(diǎn)
ri = getSentinelRedisInstanceByAddrAndRunID(sentinel.masters,
c->argv[2]->ptr,port,NULL);
/* 如果不是TILT模式,校驗(yàn)是否是主節(jié)點(diǎn)并且主節(jié)點(diǎn)被標(biāo)記為主觀下線 */
if (!sentinel.tilt && ri && (ri->flags & SRI_S_DOWN) &&
(ri->flags & SRI_MASTER))
isdown = 1;// 確定主觀下線
/* 如果是主節(jié)點(diǎn)并且傳入的runid不為*压真,調(diào)用sentinelVoteLeader選舉Leader執(zhí)行故障切換 */
if (ri && ri->flags & SRI_MASTER && strcasecmp(c->argv[5]->ptr,"*")) {
// 選舉leader
leader = sentinelVoteLeader(ri,(uint64_t)req_epoch,
c->argv[5]->ptr,
&leader_epoch);
}
/* 發(fā)送回復(fù)包含三部分:
* 下線狀態(tài), 選出的leader, leader的投票輪次leader_epoch */
addReplyArrayLen(c,3);
// 下線狀態(tài)
addReply(c, isdown ? shared.cone : shared.czero);
// leader不為空傳入leader否則傳入*
addReplyBulkCString(c, leader ? leader : "*");
// 投票輪次
addReplyLongLong(c, (long long)leader_epoch);
if (leader) sdsfree(leader);
}
// 省略其他else if...
else {
addReplySubcommandSyntaxError(c);
}
return;
}
is-master-down-by-addr回復(fù)處理
sentinelReceiveIsMasterDownReply
在sentinelCommand中對命令處理之后發(fā)送了返回?cái)?shù)據(jù)徘六,數(shù)據(jù)里面包含主觀下線的判斷、leader的runid以及投票輪次leader_epoch榴都,對返回?cái)?shù)據(jù)的處理在sentinelReceiveIsMasterDownReply函數(shù)中:
- 如果回復(fù)者也標(biāo)記了節(jié)點(diǎn)主觀下線待锈,將哨兵實(shí)例的flags狀態(tài)置為SRI_MASTER_DOWN下線狀態(tài),SRI_MASTER_DOWN狀態(tài)就是在這里設(shè)置的
- 如果返回的leader runid不是嘴高,意味著哨兵實(shí)例對leader進(jìn)行了投票竿音,需要更新哨兵實(shí)例中的leader和leader_epoch*
// 這里的privdata指向回復(fù)is-master-down-by-addr命令的那個哨兵實(shí)例
void sentinelReceiveIsMasterDownReply(redisAsyncContext *c, void *reply, void *privdata) {
// 每個哨兵實(shí)例監(jiān)控的master節(jié)點(diǎn)中,保存了其他監(jiān)控該主節(jié)點(diǎn)的哨兵實(shí)例拴驮,這里的privdata就指向master節(jié)點(diǎn)存儲的其他哨兵實(shí)例中回復(fù)了is-master-down-by-addr命令的那個哨兵實(shí)例
sentinelRedisInstance *ri = privdata;
instanceLink *link = c->data;
redisReply *r;
if (!reply || !link) return;
link->pending_commands--;
r = reply;
if (r->type == REDIS_REPLY_ARRAY && r->elements == 3 &&
r->element[0]->type == REDIS_REPLY_INTEGER &&
r->element[1]->type == REDIS_REPLY_STRING &&
r->element[2]->type == REDIS_REPLY_INTEGER)
{
ri->last_master_down_reply_time = mstime();
// 如果回復(fù)主觀下線
if (r->element[0]->integer == 1) {
// 將回復(fù)命令的哨兵節(jié)點(diǎn)的flags狀態(tài)改為SRI_MASTER_DOWN
ri->flags |= SRI_MASTER_DOWN;
} else {
ri->flags &= ~SRI_MASTER_DOWN;
}
// 如果runid不是*
if (strcmp(r->element[1]->str,"*")) {
sdsfree(ri->leader);
if ((long long)ri->leader_epoch != r->element[2]->integer)
serverLog(LL_WARNING,
"%s voted for %s %llu", ri->name,
r->element[1]->str,
(unsigned long long) r->element[2]->integer);
// 更新回復(fù)命令的哨兵存儲的leader
ri->leader = sdsnew(r->element[1]->str);
// 更新投票輪次
ri->leader_epoch = r->element[2]->integer;
}
}
}
故障切換狀態(tài)機(jī)
在sentinelHandleRedisInstance函數(shù)中春瞬,判斷是否需要執(zhí)行故障切換之后,就會調(diào)用sentinelFailoverStateMachine函數(shù)進(jìn)入故障切換狀態(tài)機(jī)套啤,根據(jù)failover_state
故障切換狀態(tài)調(diào)用不同的方法宽气,我們先關(guān)注以下兩種狀態(tài):
SENTINEL_FAILOVER_STATE_WAIT_START:等待執(zhí)行狀態(tài)随常,表示需要執(zhí)行故障切換但還未開始,在sentinelStartFailoverIfNeeded函數(shù)中可以看到如果需要執(zhí)行故障切換萄涯,會調(diào)用sentinelStartFailover函數(shù)將狀態(tài)置為SENTINEL_FAILOVER_STATE_WAIT_START绪氛,對應(yīng)的處理函數(shù)為sentinelFailoverWaitStart,sentinelFailoverWaitStart中會判斷是當(dāng)前哨兵節(jié)點(diǎn)是否是執(zhí)行故障切換的leader涝影,如果是將狀態(tài)改為SENTINEL_FAILOVER_STATE_SELECT_SLAVE枣察。
SENTINEL_FAILOVER_STATE_SELECT_SLAVE:從SLAVE節(jié)點(diǎn)中選舉Master節(jié)點(diǎn)的狀態(tài),處于這個狀態(tài)意味著需要從Master的從節(jié)點(diǎn)中選舉出可以替代Master節(jié)點(diǎn)的從節(jié)點(diǎn)燃逻,進(jìn)行故障切換序目,對應(yīng)的處理函數(shù)為sentinelFailoverSelectSlave。
void sentinelFailoverStateMachine(sentinelRedisInstance *ri) {
serverAssert(ri->flags & SRI_MASTER);
// 如果已經(jīng)在故障切換執(zhí)行中伯襟,直接返回
if (!(ri->flags & SRI_FAILOVER_IN_PROGRESS)) return;
// 狀態(tài)機(jī)
switch(ri->failover_state) {
case SENTINEL_FAILOVER_STATE_WAIT_START:// 等待執(zhí)行
sentinelFailoverWaitStart(ri);
break;
case SENTINEL_FAILOVER_STATE_SELECT_SLAVE: // 選舉master節(jié)點(diǎn)
sentinelFailoverSelectSlave(ri);
break;
case SENTINEL_FAILOVER_STATE_SEND_SLAVEOF_NOONE:
sentinelFailoverSendSlaveOfNoOne(ri);
break;
case SENTINEL_FAILOVER_STATE_WAIT_PROMOTION:
sentinelFailoverWaitPromotion(ri);
break;
case SENTINEL_FAILOVER_STATE_RECONF_SLAVES:
sentinelFailoverReconfNextSlave(ri);
break;
}
}
sentinelFailoverWaitStart
sentinelFailoverWaitStart函數(shù)的處理邏輯如下:
- 調(diào)用sentinelGetLeader獲取執(zhí)行故障切換的leader
- 對比當(dāng)前哨兵是與獲取到執(zhí)行故障切換leader的myid是否一致猿涨,判斷當(dāng)前哨兵是否是執(zhí)行故障切換的leader
- 如果當(dāng)前哨兵不是故障切換leader, 并且不是強(qiáng)制執(zhí)行狀態(tài)SRI_FORCE_FAILOVER,當(dāng)前哨兵不能執(zhí)行故障切換
- 如果當(dāng)前哨兵是故障切換的leader節(jié)點(diǎn)姆怪,將故障切換狀態(tài)改為SENTINEL_FAILOVER_STATE_SELECT_SLAVE狀態(tài)叛赚,在下一次執(zhí)行故障切換狀態(tài)機(jī)時會從slave節(jié)點(diǎn)選出master節(jié)點(diǎn)進(jìn)行故障切換
void sentinelFailoverWaitStart(sentinelRedisInstance *ri) {
char *leader;
int isleader;
/* 獲取執(zhí)行故障切換的leader */
leader = sentinelGetLeader(ri, ri->failover_epoch);
// leader不為空并且與當(dāng)前哨兵的myid一致
isleader = leader && strcasecmp(leader,sentinel.myid) == 0;
sdsfree(leader);
/* 如果當(dāng)前哨兵不是leader, 并且不是強(qiáng)制執(zhí)行狀態(tài)SRI_FORCE_FAILOVER,當(dāng)前哨兵不能執(zhí)行故障切換 */
if (!isleader && !(ri->flags & SRI_FORCE_FAILOVER)) {
int election_timeout = SENTINEL_ELECTION_TIMEOUT;
if (election_timeout > ri->failover_timeout)
election_timeout = ri->failover_timeout;
/* 在超時時終止故障切換 */
if (mstime() - ri->failover_start_time > election_timeout) {
sentinelEvent(LL_WARNING,"-failover-abort-not-elected",ri,"%@");
sentinelAbortFailover(ri);
}
return;
}
sentinelEvent(LL_WARNING,"+elected-leader",ri,"%@");
if (sentinel.simfailure_flags & SENTINEL_SIMFAILURE_CRASH_AFTER_ELECTION)
sentinelSimFailureCrash();
// 更改為SENTINEL_FAILOVER_STATE_SELECT_SLAVE狀態(tài)片效,在下一次執(zhí)行故障切換狀態(tài)機(jī)時會從slave節(jié)點(diǎn)選出master節(jié)點(diǎn)
ri->failover_state = SENTINEL_FAILOVER_STATE_SELECT_SLAVE;
ri->failover_state_change_time = mstime();
sentinelEvent(LL_WARNING,"+failover-state-select-slave",ri,"%@");
}
Leader選舉
sentinelGetLeader
sentinelGetLeader函數(shù)用于從指定的投票輪次epoch中獲取Leader節(jié)點(diǎn)红伦,成為一個Leader節(jié)點(diǎn)需要獲取大多數(shù)的投票英古,處理邏輯如下:
- 創(chuàng)建了一個counters字典淀衣,里面記錄了每個哨兵得到的投票數(shù),其中key為哨兵實(shí)例的id
- counters的數(shù)據(jù)來源:遍歷master->sentinels獲取其他哨兵實(shí)例召调,判斷哨兵實(shí)例記錄的leader是否為空并且投票輪次與當(dāng)前指定的epoch一致膨桥,如果一致加入counters中并將投票數(shù)增加一票
- 從counters中獲取投票數(shù)最多的哨兵實(shí)例記為winner,最大投票數(shù)記為max_votes
- 判斷winner是否為空
- 如果不為空唠叛,在master節(jié)點(diǎn)中記錄的leader節(jié)點(diǎn)和winner節(jié)點(diǎn)中只嚣,選出紀(jì)元(投票輪次)最新的節(jié)點(diǎn)記為myvote
- 如果為空,在master節(jié)點(diǎn)中記錄的leader節(jié)點(diǎn)和當(dāng)前哨兵實(shí)例中艺沼,選出紀(jì)元(投票輪次)最新的節(jié)點(diǎn)記為myvote
- 經(jīng)過上一步之后册舞,如果myvote不為空并且leader_epoch與調(diào)用sentinelGetLeader函數(shù)時指定的epoch一致,當(dāng)前哨兵給myvote增加一票障般,然后判斷myvote得到的投票數(shù)是否大于max_votes调鲸,如果是將winner獲勝者更新為myvote
- 到這里,winner中記錄了本輪投票的獲勝者挽荡,也就是得到票數(shù)最多的那個藐石,max_votes記錄了獲得投票數(shù),能否成為leader還需滿足以下兩個條件定拟,于微,保證選舉出的leader得到了過半哨兵的投票:
- 條件一:得到的投票數(shù)max_votes大于voters_quorum,voters_quorum為哨兵實(shí)例個數(shù)的一半+1,也就是需要有過半的哨兵實(shí)例
- 條件二:得到的投票數(shù)max_votes大于master->quorum株依,這個值是在 sentinel.conf 配置文件中設(shè)置的驱证,一般設(shè)置為哨兵總數(shù)的一半+1
- 選舉結(jié)束,返回winner中記錄的實(shí)例id作為leader
char *sentinelGetLeader(sentinelRedisInstance *master, uint64_t epoch) {
dict *counters;
dictIterator *di;
dictEntry *de;
unsigned int voters = 0, voters_quorum;
char *myvote;
char *winner = NULL;
uint64_t leader_epoch;
uint64_t max_votes = 0;
serverAssert(master->flags & (SRI_O_DOWN|SRI_FAILOVER_IN_PROGRESS));
// 創(chuàng)建字典
counters = dictCreate(&leaderVotesDictType,NULL);
// 獲取所有的哨兵實(shí)例個數(shù)包含當(dāng)前哨兵
voters = dictSize(master->sentinels)+1;
/* 遍歷哨兵實(shí)例 */
di = dictGetIterator(master->sentinels);
while((de = dictNext(di)) != NULL) {
// 獲取每一個哨兵節(jié)點(diǎn)
sentinelRedisInstance *ri = dictGetVal(de);
// 如果某個哨兵實(shí)例的leader不為空并且leader的輪次等于指定的輪次
if (ri->leader != NULL && ri->leader_epoch == sentinel.current_epoch)
sentinelLeaderIncr(counters,ri->leader); // ri->leader的投票數(shù)加1
}
dictReleaseIterator(di);
di = dictGetIterator(counters);
while((de = dictNext(di)) != NULL) {
// 獲取投票數(shù)
uint64_t votes = dictGetUnsignedIntegerVal(de);
// 獲取投票數(shù)最多的節(jié)點(diǎn)的runid勺三,記錄在winner中
if (votes > max_votes) {
max_votes = votes;
winner = dictGetKey(de);
}
}
dictReleaseIterator(di);
/* 如果winner不為空雷滚,在master節(jié)點(diǎn)和winner節(jié)點(diǎn)中獲取epoch投票輪次最新的當(dāng)做當(dāng)前節(jié)點(diǎn)的投票者*/
/* 如果winner為空,在master節(jié)點(diǎn)和當(dāng)前哨兵節(jié)點(diǎn)中選取epoch投票輪次最新的節(jié)點(diǎn)*/
if (winner)
myvote = sentinelVoteLeader(master,epoch,winner,&leader_epoch);
else
myvote = sentinelVoteLeader(master,epoch,sentinel.myid,&leader_epoch);
//
if (myvote && leader_epoch == epoch) {
// 為myvote增加一票
uint64_t votes = sentinelLeaderIncr(counters,myvote);
// 如果超過了max_votes
if (votes > max_votes) {
max_votes = votes; // 更新max_votes
winner = myvote; // 更新winner
}
}
// voters_quorum吗坚,為哨兵實(shí)例個數(shù)的一半+1
voters_quorum = voters/2+1;
// 如果投票數(shù)量max_votes小于quorum祈远,說明未達(dá)到過半的投票數(shù)
if (winner && (max_votes < voters_quorum || max_votes < master->quorum))
winner = NULL;
winner = winner ? sdsnew(winner) : NULL;
sdsfree(myvote);
dictRelease(counters);
// 返回獲勝的節(jié)點(diǎn),也就是leader節(jié)點(diǎn)
return winner;
}
投票
sentinelVoteLeader
sentinelVoteLeader函數(shù)入?yún)⒅锌梢钥吹絺魅肓薽aster節(jié)點(diǎn)商源,和候選節(jié)點(diǎn)的IDreq_runid
以及候選節(jié)點(diǎn)的投票紀(jì)元(輪次)req_epoch
车份。master節(jié)點(diǎn)中記錄了當(dāng)前哨兵節(jié)點(diǎn)選舉的執(zhí)行故障切換的leader節(jié)點(diǎn),在master->leader中保存牡彻。sentinelVoteLeader就用于在這兩個節(jié)點(diǎn)中選出獲勝的那個節(jié)點(diǎn):
- 如果候選節(jié)點(diǎn)的req_epoch大于當(dāng)前sentinel實(shí)例的epoch扫沼,將當(dāng)前哨兵實(shí)例的current_epoch置為請求輪次req_epoch
- 如果master節(jié)點(diǎn)記錄的leader_epoch小于候選節(jié)點(diǎn)的req_epoch,并且當(dāng)前實(shí)例的輪次小于等于候選節(jié)點(diǎn)輪次庄吼,將master節(jié)點(diǎn)中的leader改為候選節(jié)點(diǎn)
- 返回master節(jié)點(diǎn)中記錄的leader
char *sentinelVoteLeader(sentinelRedisInstance *master, uint64_t req_epoch, char *req_runid, uint64_t *leader_epoch) {
// 如果請求的輪次req_epoch大于當(dāng)前sentinel實(shí)例的輪次
if (req_epoch > sentinel.current_epoch) {
// 將哨兵實(shí)例的current_epoch置為請求輪次
sentinel.current_epoch = req_epoch;
sentinelFlushConfig();
sentinelEvent(LL_WARNING,"+new-epoch",master,"%llu",
(unsigned long long) sentinel.current_epoch);
}
// 如果master節(jié)點(diǎn)記錄的輪次leader_epoch小于請求輪次缎除,并且當(dāng)前實(shí)例的輪次小于等于請求輪次
if (master->leader_epoch < req_epoch && sentinel.current_epoch <= req_epoch)
{
sdsfree(master->leader);
// 將leader置為傳入的runid
master->leader = sdsnew(req_runid);
// 更改master的leader_epoch
master->leader_epoch = sentinel.current_epoch;
sentinelFlushConfig();
sentinelEvent(LL_WARNING,"+vote-for-leader",master,"%s %llu",
master->leader, (unsigned long long) master->leader_epoch);
if (strcasecmp(master->leader,sentinel.myid))
master->failover_start_time = mstime()+rand()%SENTINEL_MAX_DESYNC;
}
*leader_epoch = master->leader_epoch;
return master->leader ? sdsnew(master->leader) : NULL;
}
總結(jié)
參考
極客時間 - Redis源碼剖析與實(shí)戰(zhàn)(蔣德鈞)
Redis版本:redis-6.2.5