0. Intro
0.1 raft
raft 是一種分布式一致性算法.
簡(jiǎn)單來(lái)說, raft 的使用場(chǎng)景是 log replication.
關(guān)于分布式一致性算法, paxos, raft 等概念, 網(wǎng)上有大量的文章, 這里不再多做說明.
raft in go
比較成熟的 raft golang 實(shí)現(xiàn)有 hashicorp/raft 和 etcd/raft 兩個(gè)版本.
二者分別被用在 hashicorp/consul 和 coreos/etcd 中, 均有大量生產(chǎn)環(huán)境的使用案例.
有很多優(yōu)秀的開源項(xiàng)目使用二者之一, 比如
- hashicorp/raft:
- etcd/raft:
0.2 etcd/raft
相比而言, hashicorp/raft
比較容易上手; 而 etcd/raft
則基于簡(jiǎn)潔的抽象, 提供了更多可能性.
很多 etcd/raft
的使用者選擇自行實(shí)現(xiàn)高度定制的網(wǎng)絡(luò)傳輸層和持久化層等組件.
etcd/raft README 的 Usage 段落用了很大篇幅來(lái)描述user has a few responsibilities
, 這些定制化的組件也就是在這里做文章
可以參考 cockroach 的開發(fā)博客 Scaling Raft . (文中 Multi-Raft 的鏈接已經(jīng)失效, 這是因?yàn)?cockroach 的開發(fā)者發(fā)現(xiàn)這套實(shí)現(xiàn)很難從使用應(yīng)用中解耦出來(lái) etcd/issues/4932 )
TiDB 開發(fā)過程中遇到了類似的問題, 因此他們的底層存儲(chǔ) tikv 也選擇參考
etcd/raft
的實(shí)現(xiàn).
我們先簡(jiǎn)單介紹一下 etcd/raft
.
raftpb
raftpb 使用 protobuf 定義了基礎(chǔ)數(shù)據(jù)結(jié)構(gòu).
Node
這個(gè)接口定義基本說明了使用者能做哪些事情...
// Node represents a node in a raft cluster.
type Node interface {
// Tick increments the internal logical clock for the Node by a single tick. Election
// timeouts and heartbeat timeouts are in units of ticks.
Tick()
// Campaign causes the Node to transition to candidate state and start campaigning to become leader.
Campaign(ctx context.Context) error
// Propose proposes that data be appended to the log.
Propose(ctx context.Context, data []byte) error
// ProposeConfChange proposes config change.
// At most one ConfChange can be in the process of going through consensus.
// Application needs to call ApplyConfChange when applying EntryConfChange type entry.
ProposeConfChange(ctx context.Context, cc pb.ConfChange) error
// Step advances the state machine using the given message. ctx.Err() will be returned, if any.
Step(ctx context.Context, msg pb.Message) error
// Ready returns a channel that returns the current point-in-time state.
// Users of the Node must call Advance after retrieving the state returned by Ready.
//
// NOTE: No committed entries from the next Ready may be applied until all committed entries
// and snapshots from the previous one have finished.
Ready() <-chan Ready
// Advance notifies the Node that the application has saved progress up to the last Ready.
// It prepares the node to return the next available Ready.
//
// The application should generally call Advance after it applies the entries in last Ready.
//
// However, as an optimization, the application may call Advance while it is applying the
// commands. For example. when the last Ready contains a snapshot, the application might take
// a long time to apply the snapshot data. To continue receiving Ready without blocking raft
// progress, it can call Advance before finishing applying the last ready.
Advance()
// ApplyConfChange applies config change to the local node.
// Returns an opaque ConfState protobuf which must be recorded
// in snapshots. Will never return nil; it returns a pointer only
// to match MemoryStorage.Compact.
ApplyConfChange(cc pb.ConfChange) *pb.ConfState
// TransferLeadership attempts to transfer leadership to the given transferee.
TransferLeadership(ctx context.Context, lead, transferee uint64)
// ReadIndex request a read state. The read state will be set in the ready.
// Read state has a read index. Once the application advances further than the read
// index, any linearizable read requests issued before the read request can be
// processed safely. The read state will have the same rctx attached.
ReadIndex(ctx context.Context, rctx []byte) error
// Status returns the current status of the raft state machine.
Status() Status
// ReportUnreachable reports the given node is not reachable for the last send.
ReportUnreachable(id uint64)
// ReportSnapshot reports the status of the sent snapshot.
ReportSnapshot(id uint64, status SnapshotStatus)
// Stop performs any necessary termination of the Node.
Stop()
}
Storage
這是日志持久化層
但是實(shí)際上可以看到, 沒有要求提供寫的方法.
言下之意是 我只需要讀, 至于該怎么存, 存哪里, 請(qǐng)?jiān)?Node.Ready() 中自行解決
// Storage is an interface that may be implemented by the application
// to retrieve log entries from storage.
//
// If any Storage method returns an error, the raft instance will
// become inoperable and refuse to participate in elections; the
// application is responsible for cleanup and recovery in this case.
type Storage interface {
// InitialState returns the saved HardState and ConfState information.
InitialState() (pb.HardState, pb.ConfState, error)
// Entries returns a slice of log entries in the range [lo,hi).
// MaxSize limits the total size of the log entries returned, but
// Entries returns at least one entry if any.
Entries(lo, hi, maxSize uint64) ([]pb.Entry, error)
// Term returns the term of entry i, which must be in the range
// [FirstIndex()-1, LastIndex()]. The term of the entry before
// FirstIndex is retained for matching purposes even though the
// rest of that entry may not be available.
Term(i uint64) (uint64, error)
// LastIndex returns the index of the last entry in the log.
LastIndex() (uint64, error)
// FirstIndex returns the index of the first log entry that is
// possibly available via Entries (older entries have been incorporated
// into the latest Snapshot; if storage only contains the dummy entry the
// first log entry is not available).
FirstIndex() (uint64, error)
// Snapshot returns the most recent snapshot.
// If snapshot is temporarily unavailable, it should return ErrSnapshotTemporarilyUnavailable,
// so raft state machine could know that Storage needs some time to prepare
// snapshot and call Snapshot later.
Snapshot() (pb.Snapshot, error)
}
網(wǎng)絡(luò)傳輸
etcd/raft
沒有提供任何網(wǎng)絡(luò)傳輸層的接口定義.
與日志的持久化類似, 我只告訴你哪些 message 需要發(fā)出, 怎么發(fā), 發(fā)往哪里請(qǐng)自行解決 ??.
總得有一些開箱即用的東西...
對(duì)于日志持久化層, etcd/raft
提供了一個(gè)內(nèi)存版本的 Storage
實(shí)現(xiàn) MemoryStorage
, 通過 wal 落盤.
而 rafthttp 則提供了節(jié)點(diǎn)尋址和基于 http 的網(wǎng)絡(luò)傳輸能力...
可以參考一下 etcd 官方提供的 demo .
港真, 我在用 hashicorp/raft
寫了一些基本能用的小玩具之后看這個(gè) demo, 還是把我繞暈了.
0.3 dgraph
dgraph 是一款使用 go 語(yǔ)言開發(fā)的分布式圖數(shù)據(jù)庫(kù).
dgraph zero
zero 節(jié)點(diǎn)用于管理 dgraph 集群, 維護(hù)成員信息, 數(shù)據(jù)的 sharding 和 rebalancing.
我們借著閱讀 zero 的實(shí)現(xiàn)代碼來(lái)看一看 etcd/raft
的使用, 以及它的周邊組件的實(shí)現(xiàn)方式.