前言
在使用 K8S 部署 RSS 全套自托管解決方案- RssHub + Tiny Tiny Rss, 我介紹了將 RssHub + Tiny Tiny RSS 部署到 K8s 集群中的方案. 其中 TTRSS 會用到 Postgres 存儲數據, 也一并部署到 K8s 容器中.
但是最近, 由于一次錯誤操作, 導致 Postgres 數據庫的 WAL 損壞, Postgres 的 Pod 頻繁 CrashBackoffLoop. 具體報錯如下:
Postgres shutdown exit code 1:
2023-09-27 02:32:17.127 UTC [1] LOG: received fast shutdown request
2023-09-27 02:32:17.181 UTC [1] LOG: aborting any active transactions
2023-09-27 02:32:17.434 UTC [1] LOG: background worker "logical replication launcher" (PID 26) exited with exit code 1
2023-09-27 02:32:17.481 UTC [21] LOG: shutting down
2023-09-27 02:32:17.880 UTC [1] LOG: database system is shut down
Postgres "invalid resource manager ID in primary checkpoint record" and "could not locate a valid checkpoint record"
2023-09-27 02:33:23.189 UTC [1] LOG: starting PostgreSQL 13.5 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.3.1_git20211027) 10.3.1 20211027, 64-bit
2023-09-27 02:33:23.190 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
2023-09-27 02:33:23.190 UTC [1] LOG: listening on IPv6 address "::", port 5432
2023-09-27 02:33:23.199 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-09-27 02:33:23.210 UTC [21] LOG: database system was shut down at 2023-09-27 02:32:22 UTC
2023-09-27 02:33:23.210 UTC [21] LOG: invalid resource manager ID in primary checkpoint record
2023-09-27 02:33:23.210 UTC [21] PANIC: could not locate a valid checkpoint record
2023-09-27 02:33:24.657 UTC [1] LOG: startup process (PID 21) was terminated by signal 6: Aborted
2023-09-27 02:33:24.657 UTC [1] LOG: aborting startup due to startup process failure
2023-09-27 02:33:24.659 UTC [1] LOG: database system is shut down
如上, WAL文件已損壞, 應該如何恢復?
恢復步驟
??Warning:
目的是啟動 Postgres 恢復應用的正常運行. 數據可能存在丟失.
這是一個 TTRSS feed 應用, 只供我自己使用, 只要能啟動起來, 丟失一點數據無所謂.
首先, Postgres Pod 在 CrashBackoffLoop, 無法進行任何操作, 首要任務是使 Pod 啟動起來, 不要關閉. 這里通過在 Deployment 添加一些命令來實現. 如下:
apiVersion: apps/v1
kind: Deployment
metadata:
...
spec:
...
template:
spec:
containers:
- image: postgres:13-alpine
imagePullPolicy: IfNotPresent
name: postgres
command: ["sh"]
args: ["-c", "tail -f /dev/null"]
...
如上, 通過 sh -c tail -f /dev/null
實現 Pod 運行. 也可以通過類似 while true; do sleep 30; done;
等類似命令來實現.
Pod 穩(wěn)定運行后, 通過 kubectl exec -it
進入該Pod:
k3s kubectl exec -it database-postgres-56cff865bb-92pcx -n rsshub -- /bin/sh
并切換到 postgres
用戶:
su - postgres
??Warning:
切換到
postgres
用戶方可執(zhí)行下面命令.
接下來就順利了, 使用 pg_reset_wal
恢復 WAL:
先用 --dry-run
看看運行結果:
pg_resetwal --dry-run /var/lib/postgresql/data/
如果結果符合預期, 再運行:
pg_resetwal /var/lib/postgresql/data/
Write-ahead log reset
成功后, 退出 Pod. 并移除 Deploy 的 command
和 args
后, postgres 即可正常啟動. 如下:
2023-09-27 04:03:25.172 UTC [1] LOG: starting PostgreSQL 13.5 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.3.1_git20211027) 10.3.1 20211027, 64-bit
2023-09-27 04:03:25.173 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
2023-09-27 04:03:25.173 UTC [1] LOG: listening on IPv6 address "::", port 5432
2023-09-27 04:03:25.179 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2023-09-27 04:03:25.187 UTC [20] LOG: database system was shut down at 2023-09-27 04:02:42 UTC
2023-09-27 04:03:25.210 UTC [1] LOG: database system is ready to accept connections
完成??????
三人行, 必有我?guī)? 知識共享, 天下為公. 本文由東風微鳴技術博客 EWhisper.cn 編寫.