PD Recover 快速指南

前言

PD Recover 是對 PD 進(jìn)行災(zāi)難性恢復(fù)的工具延蟹，用于恢復(fù)無法正常啟動或服務(wù)的 PD 集群。PD Recover 會隨 tidb-ansible 一起下載坏平，位于 resource/bin/pd-recover琢蛤。

快速開始

獲取 Cluster ID

一般在 PD禁漓，TiKV 或 TiDB 的日志中都可以獲取 Cluster ID“梗可以從中控機使用 ansible ad-hoc，也可以直接去服務(wù)器上翻日志懊纳。

（推薦）從 pd 日志獲取 [info] cluster id

ansible -i inventory.ini pd_servers -m shell -a 'cat {{deploy_dir}}/log/pd.log | grep "init cluster id" | head -10'

10.0.1.13 | CHANGED | rc=0 >>
[2019/10/14 10:35:38.880 +00:00] [INFO] [server.go:212] ["init cluster id"] [cluster-id=6747551640615446306]
……

或者也可以從 tidb 或 tikv 的日志中獲取

從 tidb 日志獲取 [info] cluster id

ansible -i inventory.ini tidb_servers -m shell -a 'cat {{deploy_dir}}/log/tidb*.log | grep "init cluster id" | head -10'

10.0.1.15 | CHANGED | rc=0 >>
2019/10/14 19:23:04.688 client.go:161: [info] [pd] init cluster id 6747551640615446306
……

從 tikv 日志獲取 [info] PD cluster

ansible -i inventory.ini tikv_servers -m shell -a 'cat {{deploy_dir}}/log/tikv* | grep "PD cluster" | head -10'

10.0.1.15 | CHANGED | rc=0 >>
[2019/10/14 07:06:35.278 +00:00] [INFO] [tikv-server.rs:464] ["connect to PD cluster 6747551640615446306"]
……

獲取 Alloc ID（TiKV StoreID）

在指定 alloc-id 時需指定一個比當(dāng)前最大的 Alloc ID 更大的值揉抵。可以從中控機使用 ansible ad-hoc长踊，也可以直接去服務(wù)器上翻日志功舀。

（推薦）從 pd 日志獲取 [info] allocates id

ansible -i inventory.ini pd_servers -m shell -a 'cat {{deploy_dir}}/log/pd* | grep "allocates" | head -10'

10.0.1.13 | CHANGED | rc=0 >>
[2019/10/15 03:15:05.824 +00:00] [INFO] [id.go:91] ["idAllocator allocates a new id"] [alloc-id=3000]
[2019/10/15 08:55:01.275 +00:00] [INFO] [id.go:91] ["idAllocator allocates a new id"] [alloc-id=4000]
……

或者也可以從 tikv 的日志中獲取

從 tikv 日志獲取 [info] alloc store id

ansible -i inventory.ini tikv_servers -m shell -a 'cat {{deploy_dir}}/log/tikv* | grep "alloc store" | head -10'

10.0.1.13 | CHANGED | rc=0 >>
[2019/10/14 07:06:35.516 +00:00] [INFO] [node.rs:229] ["alloc store id 4 "]

10.0.1.14 | CHANGED | rc=0 >>
[2019/10/14 07:06:35.734 +00:00] [INFO] [node.rs:229] ["alloc store id 5 "]

10.0.1.15 | CHANGED | rc=0 >>
[2019/10/14 07:06:35.418 +00:00] [INFO] [node.rs:229] ["alloc store id 1 "]

10.0.1.21 | CHANGED | rc=0 >>
[2019/10/15 03:15:05.826 +00:00] [INFO] [node.rs:229] ["alloc store id 2001 "]

10.0.1.20 | CHANGED | rc=0 >>
[2019/10/15 03:15:05.987 +00:00] [INFO] [node.rs:229] ["alloc store id 2002 "]

部署一套新的 PD 集群

ansible-playbook bootsrap.yml --tags=pd

ansible-playbook deploy.yml --tags=pd

ansible-playbook start.yml --tags=pd

舊集群可以通過刪除 data.pd 目錄后，重新啟動 pd 服務(wù)

使用 pd-recover

pd-recover 位于中控服務(wù)器 .../tidb-ansible/resources/bin 目錄下

./pd-recover -endpoints http://10.0.1.13:2379 -cluster-id 6747551640615446306 -alloc-id 10000

重啟 pd 集群

ansible-playbook rolling_update.yml --tags=pd

重啟 tidb/tikv

ansible-playbook rolling_update.yml --tags=tidb,tikv

常見問題

獲取 Cluster ID 時發(fā)現(xiàn)有多個
新建 PD 集群時身弊，會生成新的 Cluster ID辟汰。可以通過日志判斷舊集群的 Cluster ID阱佛。

執(zhí)行 pd-recover 時 dial tcp 10.0.1.13:2379: connect: connection refused
執(zhí)行 pd-recover 時需要 pd 提供服務(wù)帖汞，請先部署并啟動 pd 集群。

最后編輯于：2020.04.07 13:25:12

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者