https://github.com/etcd-io/etcd/blob/master/Documentation/op-guide/
1. 常用命令
#查看集群member情況
etcdctl --endpoints=${exist-advertise-peer-urls} member list
#動態(tài)擴容
etcdctl --endpoints=${exist-advertise-peer-urls} member add infra4 --peer-urls=${new-advertise-peer-urls}
#運行時縮容
etcdctl --endpoints=${exist-advertise-peer-urls} member remove ${cluster_id}
2. 重要啟動參數(shù)說明
2.1. --initial-cluster-state
Initial cluster state ("new" or "existing"). Set to new for all members present during initial static or DNS bootstrapping.
If this option is set to existing, etcd will attempt to join the existing cluster. If the wrong value is set,
etcd will attempt to start but fail safely.
default: "new"
env variable: ETCD_INITIAL_CLUSTER_STATE
- 設(shè)置成existing帅涂,必須確保在啟動時候其他member是存活的(peer端口)钝域,否則啟動失敗。用在擴容新實例的啟動。
- 設(shè)置成new季惩,用在cluster已知member的啟動。
3. 常見操作
3.1. 如何縮容?
使用member remove命令進行縮容
3.2. 如何擴容?
- 使用member add命令進行擴容液肌。控制臺會輸出如下內(nèi)容(新節(jié)點加入集群的重要啟動參數(shù)):
#新節(jié)點加入集群的重要啟動參數(shù)鸥滨,按照參數(shù)去啟動:
ETCD_NAME="infra1"
ETCD_INITIAL_CLUSTER="infra3=http://127.0.0.1:32380,infra2=http://127.0.0.1:22380,infra1=http://127.0.0.1:12380"
ETCD_INITIAL_ADVERTISE_PEER_URLS="http://127.0.0.1:12380"
ETCD_INITIAL_CLUSTER_STATE="existing"
- 啟動新實例的參數(shù):--name嗦哆、--initial-advertise-peer-urls、--initial-cluster-state婿滓、--initial-cluster必須和控制臺輸出保持一致老速,否則啟動失敗。示例:
etcd \
--name ${ETCD_NAME} \
--listen-client-urls http://127.0.0.1:42379 \
--advertise-client-urls http://127.0.0.1:42379 \
--listen-peer-urls http://127.0.0.1:42380 \
--initial-advertise-peer-urls ${ETCD_INITIAL_ADVERTISE_PEER_URLS} \
--initial-cluster-state ${ETCD_INITIAL_CLUSTER_STATE} \
--initial-cluster ${ETCD_INITIAL_CLUSTER}
3.3. 數(shù)據(jù)目錄丟失或被誤刪除凸主,節(jié)點啟動失敗或者加入集群報錯橘券?
3.3.1. 操作步驟
member信息會持久化到磁盤上,數(shù)據(jù)丟失的節(jié)點必須以新的member身份加入,必須嚴格按照如下操作:
- 移除failure節(jié)點:使用member remove命令剔除錯誤節(jié)點旁舰。保證當(dāng)前集群的健康狀況锋华。
- 徹底清理數(shù)據(jù)目錄:錯誤節(jié)點必須停止,然后刪除data dir鬓梅。保證member信息被清理干凈供置。
- 集群擴容:使用member add命令添加步驟1的錯誤節(jié)點谨湘。參考3.2绽快。
- 重新啟動:步驟1的錯誤節(jié)點進行啟動,參考3.2
3.3.2. 操作步驟不正確的各種常見錯誤日志
- 數(shù)據(jù)丟失后紧阔,啟動參數(shù)使用 --initial-cluster-state="new"坊罢,錯誤日志如下,提示:member ddd67b312462fd7b has already been bootstrapped
2019-07-09 00:24:55.880988 I | etcdmain: etcd Version: 3.3.10
2019-07-09 00:24:55.881077 I | etcdmain: Git SHA: 27fc7e2
2019-07-09 00:24:55.881082 I | etcdmain: Go Version: go1.10.4
2019-07-09 00:24:55.881089 I | etcdmain: Go OS/Arch: darwin/amd64
2019-07-09 00:24:55.881093 I | etcdmain: setting maximum number of CPUs to 8, total number of available CPUs is 8
2019-07-09 00:24:55.881099 N | etcdmain: failed to detect default host (default host not supported on darwin_amd64)
2019-07-09 00:24:55.881106 W | etcdmain: no data-dir provided, using default data-dir ./infra1.etcd
2019-07-09 00:24:55.881236 I | embed: listening for peers on http://127.0.0.1:12380
2019-07-09 00:24:55.881254 I | embed: pprof is enabled under /debug/pprof
2019-07-09 00:24:55.881299 I | embed: listening for client requests on 127.0.0.1:2380
2019-07-09 00:24:55.883626 C | etcdmain: member ddd67b312462fd7b has already been bootstrapped
- 數(shù)據(jù)丟失后擅耽,啟動參數(shù)使用 --initial-cluster-state="existing"活孩,錯誤日志如下,提示:Was the raft log corrupted, truncated, or lost?
tocommit(10) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
panic: tocommit(10) is out of range [lastIndex(0)]. Was the raft log corrupted, truncated, or lost?
goroutine 135 [running]:
github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog.(*PackageLogger).Panicf(0xc42000a660, 0x1c0cad8, 0x5d, 0xc42000a160, 0x2, 0x2)
/tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/pkg/capnslog/pkg_logger.go:75 +0x162
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raftLog).commitTo(0xc420277500, 0xa)
/tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/log.go:191 +0x15c
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raft).handleHeartbeat(0xc420244300, 0x8, 0xddd67b312462fd7b, 0x9e737febb6b99eee, 0x5, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:1194 +0x54
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.stepFollower(0xc420244300, 0x8, 0xddd67b312462fd7b, 0x9e737febb6b99eee, 0x5, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:1140 +0x3ff
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*raft).Step(0xc420244300, 0x8, 0xddd67b312462fd7b, 0x9e737febb6b99eee, 0x5, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/raft.go:868 +0x12f1
github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft.(*node).run(0xc4201df320, 0xc420244300)
/tmp/etcd-release-3.3.10/etcd/release/etcd/gopath/src/github.com/coreos/etcd/cmd/vendor/github.com/coreos/etcd/raft/node.go:323 +0x1059
- 3.3.1步驟中1和3正確執(zhí)行乖仇,而遺漏步驟2并且中間有錯誤啟動憾儒,使得磁盤留有錯誤member信息。錯誤日志如下乃沙,提示起趾。
2019-07-09 01:24:19.311630 E | rafthttp: failed to find member 9e737febb6b99eee in cluster 73841b4a9097c907
2019-07-09 01:24:19.311710 E | rafthttp: failed to find member 628170c800dbcee in cluster 73841b4a9097c907
2019-07-09 01:24:19.410573 E | rafthttp: failed to find member 9e737febb6b99eee in cluster 73841b4a9097c907
2019-07-09 01:24:19.410616 E | rafthttp: failed to find member 628170c800dbcee in cluster 73841b4a9097c907
2019-07-09 01:24:19.410678 E | rafthttp: failed to find member 9e737febb6b99eee in cluster 73841b4a9097c907
2019-07-09 01:24:19.410767 E | rafthttp: failed to find member 628170c800dbcee in cluster 73841b4a9097c907