Raft協(xié)議主要分為三個(gè)模塊:Leader election
、Log replication
和Safety
制市。
Raft將服務(wù)器節(jié)點(diǎn)分為Leader
残拐、Candidate
和Follower
三種刹孔,協(xié)調(diào)者被稱為領(lǐng)袖/主(Leader),參與者被稱為群眾(Follower)标沪。相對(duì)于其他的協(xié)議,Raft中的Leader更強(qiáng)嗜傅,這體現(xiàn)在:
- Leader是唯一的金句。
- Log entries只能從Leader發(fā)送給其他服務(wù)器,事實(shí)上Follower不主動(dòng)發(fā)送吕嘀,而只響應(yīng)來自Leader和Candidate的請(qǐng)求违寞。
- 客戶端只能和Leader交互,如果客戶端首先連上了Follower币他,那么會(huì)被Follower轉(zhuǎn)發(fā)給Leader坞靶。
- Raft的獨(dú)特之處還在于其在Leader election的過程中Raft使用了
隨機(jī)計(jì)時(shí)器
進(jìn)行超時(shí)。此外蝴悉,Raft還提供了一個(gè)joint consensus
的算法處理Membership changes的問題彰阴。
raft中的Progress
代表leader看到的followers的進(jìn)度信息。有三種狀態(tài)類型用來跟蹤follower拍冠。
// file: raft/tracker/state.go
// StateType is the state of a tracked follower.
type StateType uint64
const (
// StateProbe indicates a follower whose last index isn't known. Such a
// follower is "probed" (i.e. an append sent periodically) to narrow down
// its last index. In the ideal (and common) case, only one round of probing
// is necessary as the follower will react with a hint. Followers that are
// probed over extended periods of time are often offline.
StateProbe StateType = iota
// StateReplicate is the state steady in which a follower eagerly receives
// log entries to append to its log.
StateReplicate
// StateSnapshot indicates a follower that needs log entries not available
// from the leader's Raft log. Such a follower needs a full snapshot to
// return to StateReplicate.
StateSnapshot
)
Learners
和Voters
不會(huì)有交集尿这。
Joint consensus
joint config
term
is a logic clock in the raft
quorum
Raft協(xié)議中每個(gè)節(jié)點(diǎn)都會(huì)記錄本地Log,etcd使用raftLog表示本地Log
// file: raft/log.go
type raftLog struct {
// storage contains all stable entries since the last snapshot.
storage Storage
// unstable contains all unstable entries and snapshot.
// they will be saved into storage.
unstable unstable
// committed is the highest log position that is known to be in
// stable storage on a quorum of nodes.
committed uint64
// applied is the highest log position that the application has
// been instructed to apply to its state machine.
// Invariant: applied <= committed
applied uint64
logger Logger
// maxNextEntsSize is the maximum number aggregate byte size of the messages
// returned from calls to nextEnts.
maxNextEntsSize uint64
}
applied <= committed
Deep Dive: etcd
Leader election
- Candidate, Follower, Leader
- Term
- Election
- Hearbeat
Log replication
- Only leader manages the replicated logs.
- Leader only append to log.
- Leader keeps trying to replicate its logs to followers.
- Committed index
- Applied index(always smaller than committed index)
Raft implementation
-
Minimalistic design for flexibility, deterministic and performance
- Raft package does not implement network transport between peers.
- Raft package does not implement storage to persist log and state.
-
Raft is modeled as a state machine
- State
- Input, output
- Transition between states
Server's handling loop
for {
select {
...
case rd := <- r.Ready():
r.storage.Save(rd.HardState, rd.Entries, rd.Snapshot)
r.transport.Send(rd.Messages)
s.Apply(rd.CommittedEntries)
....
}
}
Request lifecycle
- Send proposal to Raft
r.Propose(ctx, data)
- If successfully committed, data will appear in
rd.CommittedEntries
- Apply committed entries to MVCC
- Return apply result to client
Add/Remove a node
當(dāng) Leader 收到 Configuration Change
的消息之后庆杜,它就將新的配置(后面叫 C-new射众,舊的叫 C-old) 作為一個(gè)特殊的 Raft Entry 發(fā)送到其他的 Follower 上面,任何節(jié)點(diǎn)只要收到了這個(gè) Entry晃财,就開始直接使用 C-new叨橱。當(dāng) C-new 這個(gè) Log 被 committed,那么這次 Configuration Change 就結(jié)束了。當(dāng)在 TiKV 以及 etcd 里面罗洗,并沒有使用這種方式愉舔,只有當(dāng) C-new 這個(gè) Log 被 committed 以及被 applied 之后,節(jié)點(diǎn)才知道最新的 Configuration 的情況伙菜。這樣做的方式是比較簡(jiǎn)單轩缤,但需要注意幾點(diǎn):
- 當(dāng) Log 里面有一個(gè) Configuration Change 還沒有被 committed,不允許接受新的 Configuration Change 請(qǐng)求贩绕,主要是為了防止出現(xiàn)多 Leader 情況火的。
- 如果只有兩個(gè)節(jié)點(diǎn),需要移除一個(gè)節(jié)點(diǎn)淑倾,如果 Leader 在發(fā)起命令之后馏鹤,另一個(gè)節(jié)點(diǎn)掛了,這時(shí)候系統(tǒng)沒法恢復(fù)了踊淳。
WAL
為了保證數(shù)據(jù)的安全性(crash或者宕機(jī)下的恢復(fù))假瞬,都會(huì)使用WAL,etcd也不例外迂尝。etcd中的每一個(gè)事務(wù)操作(即寫操作)脱茉,都會(huì)預(yù)先寫到事務(wù)文件中。
Snapshot
etcd作為一個(gè)高可用的KV存儲(chǔ)系統(tǒng)垄开,不可能只依靠log replay
來實(shí)現(xiàn)數(shù)據(jù)恢復(fù)琴许。因此,etcd還提供了snapshot
(快照)功能溉躲。snapshot即是定期把整個(gè)數(shù)據(jù)庫(kù)保存成一個(gè)單獨(dú)的快照文件榜田,這樣一來,不但縮短了日志重放的時(shí)間锻梳,也減輕了WAL的存儲(chǔ)量箭券,過早的WAL可以刪除掉。
假設(shè) 3 個(gè)節(jié)點(diǎn)疑枯,然后新加入了一個(gè)節(jié)點(diǎn)辩块,如果 Leader 在給新的 Follower 發(fā)送 Snapshot 的時(shí)候,另一個(gè) Follower 當(dāng)?shù)袅司S溃@時(shí)候整個(gè)系統(tǒng)是沒法工作了废亭,只有等 Follower 完全收完 Snapshot 之后才能恢復(fù)。為了解決這個(gè)問題具钥,我們可以引入 Learner
的狀態(tài)豆村,也就是新加入的 Learner 節(jié)點(diǎn)是不能算 Quorum 的,它不能投票骂删。只有 Leader 確認(rèn)這個(gè) Learner 接受完了 Snapshot掌动,能正常同步 Raft Log 了四啰,才會(huì)考慮將其變成正常的可以 Vote 的節(jié)點(diǎn)。