State
/* Persistent state on all servers:
(Updated on stable storage before responding to RPCs)
*/
currentTerm //當(dāng)前任期
votedFor //當(dāng)前任期的候選者編號(hào)痕囱,無(wú)則為null
log[] //日志條目
//Volatile state on all servers,所有服務(wù)器上維護(hù)
commitIndex //已知的最高的可被提交的日志條目的索引娄柳,初始為0
lastApplied //當(dāng)前已提交給state machine執(zhí)行的條目的索引,初始為0
//Volatile state on leaders:(Reinitialized after election)凡人,只在leader節(jié)點(diǎn)上維護(hù)
nextIndex[] //對(duì)于每一臺(tái)服務(wù)器措伐,下一條將要發(fā)給該服務(wù)器的條目的索引线召,初始為leader最后一條條目索引+1
matchIndex[] //每一個(gè)服務(wù)器已知的最高的已復(fù)制的條目的索引驱入,初始為0
RequestVote RPC
//Invoked by candidates to gather votes (§5.2).
Arguments:
term //候選者的term值
candidateId //候選者的id
lastLogIndex //候選者最新的日志索引
lastLogTerm //候選者最新的日志所屬的term
Results:
term
voteGranted //true表示投票給該candidate
Receiver implementation:
1. Reply false if term < currentTerm
//這里投票給候選者的條件是要求候選者的日志至少比自身的要新昆烁,也就是要么lastLogIndex比自身最新的日志條目index要大吊骤。
//要么lastLogIndex和lastLogTerm都和自身最新的日志條目一致
//這里對(duì)選舉的這種限制是為了保證安全性。確保commit的日志一定不會(huì)被重寫静尼。
2. If votedFor is null or candidateId, and candidate’s log is at least as up-to-date as receiver’s log, grant vote
AppendEntries RPC
//Invoked by leader to replicate log entries; also used as heartbeat
Arguments:
term //leader當(dāng)前的term值
leaderId //follower在收到client request時(shí)白粉,可以用該值轉(zhuǎn)發(fā)給leader
prevLogIndex //上一條日志條目的索引
prevLogTerm //上一條日志條目的term
entries[] //日志條目,對(duì)于心跳包則該值為空鼠渺,日志條目可以為多條
leaderCommit //leader服務(wù)器的commitIndex
Results:
term //當(dāng)前任期
success //具體的判斷如下
Receiver implementation:
//任期值比當(dāng)前任期小鸭巴,則該RPC已失效,或當(dāng)前l(fā)eader已變更
1. Reply false if term < currentTerm
//不包含匹配prevLogTerm的prevLogIndex所對(duì)應(yīng)的條目拦盹,通常該情況為節(jié)點(diǎn)掛掉一段時(shí)間奕扣,落后leader節(jié)點(diǎn)
//leader會(huì)重新發(fā)包含較早的prevLogTerm及prevLogIndex的RPC給該節(jié)點(diǎn)
2. Reply false if log doesn’t contain an entry at prevLogIndex whose term matches prevLogTerm
// 以下均返回true
// 若日志條目已有內(nèi)容與entries里的內(nèi)容沖突,則刪除已有及其后的條目
3. If an existing entry conflicts with a new one (same index but different terms), delete the existing entry and all that follow it
// 將新的日志條目追加到日志中
4. Append any new entries not already in the log
//如果leaderCommit比自身commitIndex大掌敬,則更新自身的commitIndex為min(leaderCommit,當(dāng)前最新日志條目索引)
5. If leaderCommit > commitIndex, set commitIndex = min(leaderCommit, index of last new entry)
InstallSnapshot RPC
//Invoked by leader to send chunks of a snapshot to a follower. Leaders always send chunks in order.
Arguments:
term //leader的當(dāng)前term
leaderId //leader的id
lastIncludedIndex //該snapshot中包含的最大的日志的索引值
lastIncludedTerm //該snapshot中包含的最大的日志的所屬的term
offset //用來(lái)定位shapshot文件的偏移量惯豆,snapshot文件可能很大,要分幾次傳奔害,每次稱之為一個(gè)chunk
data[] //snapshot數(shù)據(jù)楷兽,通常為state machine的當(dāng)前狀態(tài)
done //是否為最后一個(gè)chunk
Results:
term //currentTerm
Receiver implementation:
1. Reply immediately if term < currentTerm
//如果是第一個(gè)chunk,則新建snapshot文件
2. Create new snapshot file if first chunk (offset is 0)
//將data的數(shù)據(jù)寫入到snapshot的相應(yīng)位置
3. Write data into snapshot file at given offset
//如果done為false华临,則重復(fù)1-3過(guò)程芯杀,回復(fù)并等待最后一個(gè)chunk
4. Reply and wait for more data chunks if done is false
//保存snapshot文件,丟棄更老的snapshot
5. Save snapshot file, discard any existing or partial snapshot with a smaller index
// 已有的日志處理
6. If existing log entry has same index and term as snapshot’s last included entry, retain log entries following it and reply
// 丟棄老的日志
7. Discard the entire log
// 按照snapshot內(nèi)容重設(shè)state machine
8. Reset state machine using snapshot contents (and load snapshot’s cluster configuration)
Rules for Servers
All Servers:
// commitIndex > lastApplied,證明lastApplied到commitIndex之間的日志條目都可以提交給state machine執(zhí)行
? If commitIndex > lastApplied: increment lastApplied, apply log[lastApplied] to state machine
// 若有新term雅潭,則更新自己的term值
? If RPC request or response contains term T > currentTerm: set currentTerm = T, convert to follower
Followers:
//響應(yīng)leaders和candidates發(fā)來(lái)的RPC揭厚,響應(yīng)規(guī)則參照AppendEntries和RequestVote部分
? Respond to RPCs from candidates and leaders
// 一段時(shí)間內(nèi),沒(méi)有收到AppendEntries或者RequestVote的消息扶供,則轉(zhuǎn)變?yōu)閏andidate
? If election timeout elapses without receiving AppendEntries RPC from current leader or granting vote to candidate: convert to candidate
Candidates :
//開啟選舉筛圆,增加自身term值,投票給自己椿浓,重設(shè)定時(shí)器太援,發(fā)送RequestVote給其他服務(wù)器
? On conversion to candidate, start election:
? Increment currentTerm
? Vote for self
? Reset election timer
? Send RequestVote RPCs to all other servers
//從多數(shù)成員收到true的回應(yīng)闽晦,則轉(zhuǎn)變?yōu)閘eader
? If votes received from majority of servers: become leader
//收到AppendEntries,證明新的leader已產(chǎn)生提岔,則自身轉(zhuǎn)變?yōu)閒ollower
? If AppendEntries RPC received from new leader: convert to follower
//如果超時(shí)仙蛉,則開啟下一輪選舉
? If election timeout elapses: start new election
Leaders:
//發(fā)送心跳包給所有服務(wù)器,防止其他服務(wù)器超時(shí)開啟新的選舉
? Upon election: send initial empty AppendEntries RPCs (heartbeat) to each server; repeat during idle periods to prevent election timeouts
//收到客戶端請(qǐng)求碱蒙,則將條目寫入到日志荠瘪,當(dāng)條目提交之后再回復(fù)給客戶端
? If command received from client: append entry to local log, respond after entry applied to state machine
//如果當(dāng)前的日志條目索引比f(wàn)ollower大(leader自身的last log index 與其nextIndex[]比較),則發(fā)送AppendEntries給相應(yīng)follower
? If last log index ≥ nextIndex for a follower: send AppendEntries RPC with log entries starting at nextIndex
//如果成功,則更新nextIndex數(shù)組及matchIndex數(shù)組中的follower對(duì)應(yīng)的項(xiàng)
? If successful: update nextIndex and matchIndex for follower
//如果因?yàn)槿罩静煌绞∪停瑒t減小該follower對(duì)應(yīng)的nextIndex值然后重試,
//若相應(yīng)的nextIndex值減小到leader節(jié)點(diǎn)已經(jīng)進(jìn)行了snapshot哀墓,則leader會(huì)發(fā)送InstallSnapshot RPC
? If AppendEntries fails because of log inconsistency: decrement nextIndex and retry
//如果更新完matchIndex后,判斷下commitIndex是否可以更新坊秸,更新條件為新的值是多數(shù)同意的麸祷,且該條目的term為當(dāng)前term
? If there exists an N such that N > commitIndex, a majority of matchIndex[i] ≥ N, and log[N].term == currentTerm: set commitIndex = N.