TiDB 啟動問題記錄

案例1

[問題澄清]

TiDB集群啟動過程中報錯：

[FATAL] [main.go:111] [“run server failed”] [error=“l(fā)isten tcp 192.xxx.73.101:2380: bind: cannot assign requested address”]

????????????[原因分析]

????????????網(wǎng)絡(luò)問題

????????????[解決方案]

1.使用ping命令檢查ip是否可以訪問

2.使用telnel命令測試端口是否可以訪問

3.在tidb集群中盡量避免內(nèi)網(wǎng)和外網(wǎng)ip混用

[參考案例]

PD端口無法啟動

https://asktug.com/t/pd/638

[引申學(xué)習(xí)點]

ping命令

https://blog.csdn.net/hebbely/article/details/54965989

telnet命令

https://blog.csdn.net/swazer_z/article/details/64442730

案例2

[問題澄清]

PD啟動過程中報錯：

[PANIC] [server.go:446] [“failed to recover v3 backend from snapshot”]

[error=“failed to find database snapshot file (snap: snapshot file doesn’t exist)”]

[原因分析]

服務(wù)器掉電厅克，導(dǎo)致操作系統(tǒng)數(shù)據(jù)丟失

[解決方案]

1. 掉電后可能目錄變?yōu)橹蛔x，請運維人員幫助從操作系統(tǒng)層面恢復(fù)只讀文件

2. 如果TiDB集群有PD節(jié)點無法啟動运吓，建議使用pd-recover命令恢復(fù)

https://pingcap.com/docs-cn/stable/reference/tools/pd-recover/#pd-recover-%25E4%25BD%25BF%25E7%2594%25A8%25E6%2596%2587%25E6%25A1%25A3

[參考案例]

系統(tǒng)斷電溃槐，來電后重啟tidb集群鉴分，啟動PD節(jié)點報錯，3個PD節(jié)點有兩個報錯

https://asktug.com/t/tidb-pd-3-pd/1369

[學(xué)習(xí)引申點]

EXT4文件系統(tǒng)學(xué)習(xí)（五）掉電數(shù)據(jù)損壞重啟掛載失敗并修復(fù)宇立，僅限參考非標準步驟舀射，fsck失敗可能導(dǎo)致數(shù)據(jù)損壞

https://blog.csdn.net/TSZ0000/article/details/84664865

案例3

[問題澄清]

TiKV啟動過程中報錯：

ERRO tikv-server.rs:155: failed to create kv engine: “Corruption: Sst file size mismatch: /data/tidb/deploy/data/db/67704904.sst. Size recorded in manifest 325143, actual size 0 ”]

[原因分析]

服務(wù)器重啟，導(dǎo)致數(shù)據(jù)未及時sync

[解決方案]

下線節(jié)點驹碍，重新擴容壁涎，參考擴容縮容步驟

https://pingcap.com/docs-cn/stable/how-to/scale/with-ansible/

會在某個新版本修復(fù)此問題

?https://github.com/tikv/tikv/pull/4807

[參考案例]

TiKV節(jié)點無法啟動

https://asktug.com/t/tikv/1375

[學(xué)習(xí)引申點]

RocksDB - MANIFEST

http://www.reibang.com/p/d1b38ce0d966

案例4

[問題澄清]

TiKV啟動過程中報錯：

ERRO panic_hook.rs:104: thread ‘raftstore-11’ panicked ‘[region 125868]323807 to_commit 181879 is out of range [last_index 181878]’ at "/home/jenkins/.cargo/git/checkouts/raft-rs-841f8a6db665c5c0/b10d74c/src/raft_log.rs:248"3.2019/04/30 18:11:27.625 ERRO panic_hook.rs:104: thread ‘raftstore-11’ panicked ‘[region 125868]323807 to_commit 181879 is out of range [last_index 181878]’ at “/home/jenkins/.cargo/git/checkouts/raft-rs-841f8a6db665c5c0/b10d74c/src/raft_log.rs:248” stack backtrace:stack backtrace:

[原因分析]

to_commit out of range 意味著這個 peer 想要 commit 一條不存在的日志，說明因某些主動操作或者異常情況發(fā)生導(dǎo)致最近的 raft log 丟失了

[解決方案]

1.通過 tikv-ctl 工具定位損壞的region志秃，指定 db 目錄（當前損壞 tikv 節(jié)點的目錄）怔球。

2.通過 tikv-ctl 進行數(shù)據(jù)修復(fù)。

2.1 如果修復(fù)失敗浮还。如下：

set_region_tombstone: StringError("The peer is still in target peers")

使用tikv-ctl 執(zhí)行 region tombstone 需要對損壞節(jié)點 region peer 進行判斷竟坛，需要人工清理。remove 掉異常的 peer钧舌。

2.2 重復(fù)使用 tikv-ctl 工具執(zhí)行修復(fù)即可担汤。

[參考案例]

TiKV 報錯 ERRO panic_hook.rs:104 是什么原因

https://asktug.com/t/tikv-erro-panic-hook-rs-104/165

Tikv節(jié)點掛掉后，啟動報錯“[region 32] 33 to_commit 405937 is out of range [last_index 405933]”

https://asktug.com/t/tikv-region-32-33-to-commit-405937-is-out-of-range-last-index-405933/1922

[學(xué)習(xí)引申點]

Raft 日志復(fù)制 Log replication

http://www.reibang.com/p/b28e73eefa88

案例5

[問題澄清]

PD啟動過程中報錯：

FAILED - RETRYING: wait until the PD health page is available (12 retries left). FAILED - RETRYING: wait until the PD health page is available (12 retries left)

[原因分析]

ip地址異常

[解決方案]

1.檢查是否有內(nèi)外網(wǎng)ip導(dǎo)致不通

2.是否是更換PD ip地址導(dǎo)致延刘，可以采用擴容縮容的方法處理PD.

[參考案例]

節(jié)點IP變化后漫试，如何操作更新

https://asktug.com/t/ip/1106

TiDB集群啟動不起來

https://asktug.com/t/tidb/1563

[學(xué)習(xí)引申點]

TiDB 最佳實踐系列（二）PD 調(diào)度策略最佳實踐

https://pingcap.com/blog-cn/best-practice-pd/

案例6

[問題澄清]

TiDB無法啟動,tidb_stderr.log報錯：

fatal error: runtime: out of memory

[原因分析]

設(shè)置 echo 2 > /proc/sys/vm/overcommit_memory

[解決方案]

設(shè)置echo 0 > /proc/sys/vm/overcommit_memory

[參考案例]

修改內(nèi)存使用策略導(dǎo)致 TiDB自動下線后無法啟動

https://asktug.com/t/tidb/1716

[學(xué)習(xí)引申點]

linux下overcommit_memory的問題

https://blog.csdn.net/houjixin/article/details/46412557

案例7

[問題澄清]

TiDB集群啟動過程中報錯：

Ansible FAILED! => playbook: start.yml; TASK: Check grafana API Key list; message: {“changed”: false, “connection”: “close”, “content”: “{“message”:“Invalid username or password”}”, “content_length”: “42”, “content_type”: “application/json; charset=UTF-8”, “date”: “Wed, 25 Dec 2019 02:22:44 GMT”, “json”: {“message”: “Invalid username or password”}, “msg”: “Status code was 401 and not [200]: HTTP Error 401: Unauthorized”, “redirected”: false, “status”: 401, “url”: “http://192.168.179.112:3000/api/auth/keys”}

[原因分析]

修改過 Grafana 的密碼

[解決方案]

inventory.ini 中配置的用戶名和密碼也需要修改為新的密碼

[參考案例]

啟動集tidb集群出現(xiàn)錯誤

https://asktug.com/t/topic/2253

[學(xué)習(xí)引申點]

Grafana全面瓦解

http://www.reibang.com/p/7e7e0d06709b

案例8

[問題澄清]