案例1
[問題澄清]
TiDB集群啟動過程中報錯:
[FATAL] [main.go:111] [“run server failed”] [error=“l(fā)isten tcp 192.xxx.73.101:2380: bind: cannot assign requested address”]
????????????[原因分析]
????????????網(wǎng)絡(luò)問題
????????????[解決方案]
1.使用ping命令檢查ip是否可以訪問
2.使用telnel命令測試端口是否可以訪問
3.在tidb集群中盡量避免內(nèi)網(wǎng)和外網(wǎng)ip混用
[參考案例]
PD端口無法啟動
[引申學(xué)習(xí)點]
ping命令
https://blog.csdn.net/hebbely/article/details/54965989
telnet命令
https://blog.csdn.net/swazer_z/article/details/64442730
案例2
[問題澄清]
PD啟動過程中報錯:
[PANIC] [server.go:446] [“failed to recover v3 backend from snapshot”]
[error=“failed to find database snapshot file (snap: snapshot file doesn’t exist)”]
[原因分析]
服務(wù)器掉電厅克,導(dǎo)致操作系統(tǒng)數(shù)據(jù)丟失
[解決方案]
1. 掉電后可能目錄變?yōu)橹蛔x,請運維人員幫助從操作系統(tǒng)層面恢復(fù)只讀文件
2. 如果TiDB集群有PD節(jié)點無法啟動运吓,建議使用pd-recover命令恢復(fù)
[參考案例]
系統(tǒng)斷電溃槐,來電后重啟tidb集群鉴分,啟動PD節(jié)點報錯,3個PD節(jié)點有兩個報錯
https://asktug.com/t/tidb-pd-3-pd/1369
[學(xué)習(xí)引申點]
EXT4文件系統(tǒng)學(xué)習(xí)(五)掉電數(shù)據(jù)損壞重啟掛載失敗并修復(fù)宇立,僅限參考非標準步驟舀射,fsck失敗可能導(dǎo)致數(shù)據(jù)損壞
https://blog.csdn.net/TSZ0000/article/details/84664865
案例3
[問題澄清]
TiKV啟動過程中報錯:
ERRO tikv-server.rs:155: failed to create kv engine: “Corruption: Sst file size mismatch: /data/tidb/deploy/data/db/67704904.sst. Size recorded in manifest 325143, actual size 0 ”]
[原因分析]
服務(wù)器重啟,導(dǎo)致數(shù)據(jù)未及時sync
[解決方案]
下線節(jié)點驹碍,重新擴容壁涎,參考擴容縮容步驟
https://pingcap.com/docs-cn/stable/how-to/scale/with-ansible/
會在某個新版本修復(fù)此問題
?https://github.com/tikv/tikv/pull/4807
[參考案例]
https://asktug.com/t/tikv/1375
[學(xué)習(xí)引申點]
RocksDB - MANIFEST
http://www.reibang.com/p/d1b38ce0d966
案例4
[問題澄清]
TiKV啟動過程中報錯:
ERRO panic_hook.rs:104: thread ‘raftstore-11’ panicked ‘[region 125868]323807 to_commit 181879 is out of range [last_index 181878]’ at "/home/jenkins/.cargo/git/checkouts/raft-rs-841f8a6db665c5c0/b10d74c/src/raft_log.rs:248"3.2019/04/30 18:11:27.625 ERRO panic_hook.rs:104: thread ‘raftstore-11’ panicked ‘[region 125868]323807 to_commit 181879 is out of range [last_index 181878]’ at “/home/jenkins/.cargo/git/checkouts/raft-rs-841f8a6db665c5c0/b10d74c/src/raft_log.rs:248” stack backtrace:stack backtrace:
[原因分析]
to_commit out of range 意味著這個 peer 想要 commit 一條不存在的日志,說明因某些主動操作或者異常情況發(fā)生導(dǎo)致最近的 raft log 丟失了
[解決方案]
1.通過 tikv-ctl 工具定位損壞的region志秃,指定 db 目錄(當前損壞 tikv 節(jié)點的目錄)怔球。
2.通過 tikv-ctl 進行數(shù)據(jù)修復(fù)。
2.1 如果修復(fù)失敗浮还。如下:
set_region_tombstone: StringError("The peer is still in target peers")
使用tikv-ctl 執(zhí)行 region tombstone 需要對損壞節(jié)點 region peer 進行判斷竟坛,需要人工清理。remove 掉異常的 peer钧舌。
2.2 重復(fù)使用 tikv-ctl 工具執(zhí)行修復(fù)即可担汤。
[參考案例]
TiKV 報錯 ERRO panic_hook.rs:104 是什么原因
https://asktug.com/t/tikv-erro-panic-hook-rs-104/165
Tikv節(jié)點掛掉后,啟動報錯“[region 32] 33 to_commit 405937 is out of range [last_index 405933]”
https://asktug.com/t/tikv-region-32-33-to-commit-405937-is-out-of-range-last-index-405933/1922
[學(xué)習(xí)引申點]
Raft 日志復(fù)制 Log replication
http://www.reibang.com/p/b28e73eefa88
案例5
[問題澄清]
PD啟動過程中報錯:
FAILED - RETRYING: wait until the PD health page is available (12 retries left). FAILED - RETRYING: wait until the PD health page is available (12 retries left)
[原因分析]
ip地址異常
[解決方案]
1.檢查是否有內(nèi)外網(wǎng)ip導(dǎo)致不通
2.是否是更換PD ip地址導(dǎo)致延刘,可以采用擴容縮容的方法處理PD.
[參考案例]
https://asktug.com/t/tidb/1563
[學(xué)習(xí)引申點]
TiDB 最佳實踐系列(二)PD 調(diào)度策略最佳實踐
https://pingcap.com/blog-cn/best-practice-pd/
案例6
[問題澄清]
TiDB無法啟動,tidb_stderr.log報錯:
fatal error: runtime: out of memory
[原因分析]
設(shè)置 echo 2 > /proc/sys/vm/overcommit_memory
[解決方案]
設(shè)置echo 0 > /proc/sys/vm/overcommit_memory
[參考案例]
修改內(nèi)存使用策略導(dǎo)致 TiDB自動下線后 無法啟動
https://asktug.com/t/tidb/1716
[學(xué)習(xí)引申點]
linux下overcommit_memory的問題
https://blog.csdn.net/houjixin/article/details/46412557
案例7
[問題澄清]
TiDB集群啟動過程中報錯:
Ansible FAILED! => playbook: start.yml; TASK: Check grafana API Key list; message: {“changed”: false, “connection”: “close”, “content”: “{“message”:“Invalid username or password”}”, “content_length”: “42”, “content_type”: “application/json; charset=UTF-8”, “date”: “Wed, 25 Dec 2019 02:22:44 GMT”, “json”: {“message”: “Invalid username or password”}, “msg”: “Status code was 401 and not [200]: HTTP Error 401: Unauthorized”, “redirected”: false, “status”: 401, “url”: “http://192.168.179.112:3000/api/auth/keys”}
[原因分析]
修改過 Grafana 的密碼
[解決方案]
inventory.ini 中配置的用戶名和密碼也需要修改為新的密碼
[參考案例]
https://asktug.com/t/topic/2253
[學(xué)習(xí)引申點]
Grafana全面瓦解
http://www.reibang.com/p/7e7e0d06709b
案例8
[問題澄清]
TiDB集群啟動過程中TiDB日志報錯:
[error="[global:3]critical error write binlog failed, the last error no avaliable pump to write binlog"]
[原因分析]
pump與Draine造成的
[解決方案]
pump錯誤為:fail to notify all living drainer: notify drainer六敬。將drainer啟動碘赖,然后成功下線后,start.yml執(zhí)行成功
[參考案例]
tidb服務(wù)已經(jīng)啟動了外构,但是wait until the TiDB port is up失敗
https://asktug.com/t/topic/2606
[學(xué)習(xí)引申點]
TiDB Binlog 簡介
https://pingcap.com/docs-cn/stable/reference/tidb-binlog/overview/