環(huán)境說明
一主兩從:
角色 | 主機(jī) | 組件 |
---|---|---|
pa-pg-1 | 172.22.138.220 | etcd趣避、patroni睹簇、postgresql |
pa-pg-2 | 172.22.138.219 | etcd、patroni、postgresql |
pa-pg-3 | 172.22.138.218 | etcd、patroni创肥、postgresql |
安裝部署好Patroni高可用后,主庫pa-pg-1和從庫pa-pg-3均啟動(dòng)正常值朋,而pa-pg-2的patroni日志一直輸出以下報(bào)錯(cuò):
...
Aug 15 10:09:56 DB-test2 patroni: 2024-08-15 11:09:56,419 INFO: Lock owner: pa-pg-1; I am pa-pg-2
Aug 15 10:09:56 DB-test2 patroni: 2024-08-15 11:09:56,421 INFO: Local timeline=3 lsn=0/DB000110
Aug 15 10:09:56 DB-test2 patroni: 2024-08-15 11:09:56,422 ERROR: Exception when working with leader
Aug 15 10:09:56 DB-test2 patroni: Traceback (most recent call last):
Aug 15 10:09:56 DB-test2 patroni: File "/usr/local/lib/python3.6/site-packages/patroni/postgresql/rewind.py", line 80, in check_leader_is_not_in_recovery
Aug 15 10:09:56 DB-test2 patroni: with get_connection_cursor(connect_timeout=3, options='-c statement_timeout=2000', **conn_kwargs) as cur:
Aug 15 10:09:56 DB-test2 patroni: File "/usr/lib64/python3.6/contextlib.py", line 81, in __enter__
Aug 15 10:09:56 DB-test2 patroni: return next(self.gen)
Aug 15 10:09:56 DB-test2 patroni: File "/usr/local/lib/python3.6/site-packages/patroni/postgresql/connection.py", line 157, in get_connection_cursor
Aug 15 10:09:56 DB-test2 patroni: conn = psycopg.connect(**kwargs)
Aug 15 10:09:56 DB-test2 patroni: File "/usr/local/lib/python3.6/site-packages/patroni/psycopg.py", line 104, in connect
Aug 15 10:09:56 DB-test2 patroni: ret = _connect(*args, **kwargs)
Aug 15 10:09:56 DB-test2 patroni: File "/usr/local/lib64/python3.6/site-packages/psycopg2/__init__.py", line 126, in connect
Aug 15 10:09:56 DB-test2 patroni: File "/usr/local/lib64/python3.6/site-packages/psycopg2/extensions.py", line 175, in make_dsn
Aug 15 10:09:56 DB-test2 patroni: psycopg2.ProgrammingError: invalid dsn: invalid connection option "target_session_attrs"
Aug 15 10:09:56 DB-test2 patroni: 2024-08-15 11:09:56,425 INFO: no action. I am (pa-pg-2), a secondary, and following a leader (pa-pg-1)
...
排查思路
1.首先一直關(guān)注的報(bào)錯(cuò)是"ERROR: Exception when working with leader"所以按照這個(gè)報(bào)錯(cuò)背蟆,首先去排查了patroni各節(jié)點(diǎn)的狀態(tài)难咕,但是通過節(jié)點(diǎn)狀態(tài)并沒有發(fā)現(xiàn)什么明顯異常窑业。
postgres # patronictl -c /etc/patroni.yml list
+ Cluster: pgsql (7400977424401590005) ----------+----+-----------+-----------------+-----------------------------+
| Member | Host | Role | State | TL | Lag in MB | Pending restart | Pending restart reason |
+---------+----------------+---------+-----------+----+-----------+-----------------+-----------------------------+
| pa-pg-1 | 172.22.138.220 | Leader | running | 3 | | | |
| pa-pg-2 | 172.22.138.219 | Replica | streaming | 3 | 0 | | |
| pa-pg-3 | 172.22.138.218 | Replica | streaming | 3 | 0 | * | max_connections: 10000->100 |
+---------+----------------+---------+-----------+----+-----------+-----------------+-----------------------------+
2.排查主庫的pg_stat_replication和從庫的pg_wal_stat_reciver恋沃,檢查后未見明顯異常
postgres=# select * from pg_stat_replication;
+-------+----------+---------+------------------+----------------+-----------------+-------------+-------------------------------+--------------+-----------+------------+------------+------------+------------+-----------+-----------+------------+---------------+------------+-------------------------------+
| pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state | reply_time |
+-------+----------+---------+------------------+----------------+-----------------+-------------+-------------------------------+--------------+-----------+------------+------------+------------+------------+-----------+-----------+------------+---------------+------------+-------------------------------+
| 12368 | 16384 | replica | pa-pg-2 | 172.22.138.219 | | 47170 | 2024-08-15 10:23:36.860733+08 | | streaming | 0/DC000000 | 0/DC000000 | 0/DC000000 | 0/DC000000 | | | | 0 | async | 2024-08-15 11:10:18.855529+08 |
| 11101 | 16384 | replica | pa-pg-3 | 172.22.138.218 | | 55800 | 2024-08-15 09:57:31.073827+08 | | streaming | 0/DC000000 | 0/DC000000 | 0/DC000000 | 0/DC000000 | | | | 0 | async | 2024-08-15 11:10:19.992588+08 |
+-------+----------+---------+------------------+----------------+-----------------+-------------+-------------------------------+--------------+-----------+------------+------------+------------+------------+-----------+-----------+------------+---------------+------------+-------------------------------+
...
postgres=# select * from pg_stat_wal_receiver;
-[ RECORD 1 ]---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
pid | 18720
status | streaming
receive_start_lsn | 0/DB000000
receive_start_tli | 3
written_lsn | 0/DC000000
flushed_lsn | 0/DC000000
received_tli | 3
last_msg_send_time | 2024-08-15 11:11:08.623733+08
last_msg_receipt_time | 2024-08-15 11:11:08.623328+08
latest_end_lsn | 0/DC000000
latest_end_time | 2024-08-15 11:09:52.14472+08
slot_name | pa_pg_2
sender_host | 172.22.138.220
sender_port | 5432
conninfo | user=replica passfile=/home/postgres/pgpass channel_binding=prefer dbname=replication host=172.22.138.220 port=5432 application_name=pa-pg-2 fallback_application_name=pgsql sslmode=prefer sslcompression=0 sslsni=1 ssl_min_protocol_version=TLSv1.2 gssencmode=prefer krbsrvname=postgres target_session_attrs=any
3.查看postgresql的server日志搪锣,更是一點(diǎn)信息都沒有休建。
2024-08-15 11:07:18 CST--- :LOG: redo starts at 0/DA0025B8
2024-08-15 11:07:18 CST--- :LOG: consistent recovery state reached at 0/DB000110
2024-08-15 11:07:18 CST--- :LOG: invalid record length at 0/DB000110: wanted 24, got 0
2024-08-15 11:07:18 CST--- :LOG: database system is ready to accept read-only connections
2024-08-15 11:07:18 CST--- :LOG: started streaming WAL from primary at 0/DB000000 on timeline 3
2024-08-15 11:10:06 CST--- :LOG: received fast shutdown request
2024-08-15 11:10:06 CST--- :LOG: aborting any active transactions
2024-08-15 11:10:06 CST-postgres-127.0.0.1-Patroni heartbeat :FATAL: terminating connection due to administrator command
2024-08-15 11:10:06 CST--- :FATAL: terminating walreceiver process due to administrator command
2024-08-15 11:10:06 CST--- :LOG: shutting down
2024-08-15 11:10:06 CST--- :LOG: database system is shut down
2024-08-15 11:10:10 CST--- :LOG: starting PostgreSQL 14.13 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44), 64-bit
2024-08-15 11:10:10 CST--- :LOG: listening on IPv4 address "0.0.0.0", port 5432
2024-08-15 11:10:10 CST--- :LOG: listening on Unix socket "/tmp/.s.PGSQL.5432"
2024-08-15 11:10:10 CST--- :LOG: database system was shut down in recovery at 2024-08-15 11:10:06 CST
2024-08-15 11:10:10 CST--- :LOG: entering standby mode
2024-08-15 11:10:10 CST--- :LOG: redo starts at 0/DA0025B8
2024-08-15 11:10:10 CST--- :LOG: consistent recovery state reached at 0/DB000110
2024-08-15 11:10:10 CST--- :LOG: invalid record length at 0/DB000110: wanted 24, got 0
2024-08-15 11:10:10 CST--- :LOG: database system is ready to accept read-only connections
2024-08-15 11:10:10 CST--- :LOG: started streaming WAL from primary at 0/DB000000 on timeline 3
4.隨后尋求外援盛杰,通過官方github的issues翼闽,有類似報(bào)錯(cuò)禽捆,建議為復(fù)制用戶replica添加postgres庫的connect權(quán)限,司馬當(dāng)活馬醫(yī)了笙什。去添加了相關(guān)的權(quán)限。
postgres=# grant CONNECT on database postgres to replica;
GRANT
5.重啟pa-pg-2服務(wù)胚想,查看patroni的日志發(fā)現(xiàn)琐凭,報(bào)錯(cuò)是一點(diǎn)沒變。此時(shí)把報(bào)錯(cuò)的重點(diǎn)放在了這條上面"psycopg2.ProgrammingError: invalid dsn: invalid connection option "target_session_attrs""
Aug 15 11:09:56 DB-test2 patroni: 2024-08-15 11:09:56,422 ERROR: Exception when working with leader
Aug 15 11:09:56 DB-test2 patroni: Traceback (most recent call last):
Aug 15 11:09:56 DB-test2 patroni: File "/usr/local/lib/python3.6/site-packages/patroni/postgresql/rewind.py", line 80, in check_leader_is_not_in_recovery
Aug 15 11:09:56 DB-test2 patroni: with get_connection_cursor(connect_timeout=3, options='-c statement_timeout=2000', **conn_kwargs) as cur:
Aug 15 11:09:56 DB-test2 patroni: File "/usr/lib64/python3.6/contextlib.py", line 81, in __enter__
Aug 15 11:09:56 DB-test2 patroni: return next(self.gen)
Aug 15 11:09:56 DB-test2 patroni: File "/usr/local/lib/python3.6/site-packages/patroni/postgresql/connection.py", line 157, in get_connection_cursor
Aug 15 11:09:56 DB-test2 patroni: conn = psycopg.connect(**kwargs)
Aug 15 11:09:56 DB-test2 patroni: File "/usr/local/lib/python3.6/site-packages/patroni/psycopg.py", line 104, in connect
Aug 15 11:09:56 DB-test2 patroni: ret = _connect(*args, **kwargs)
Aug 15 11:09:56 DB-test2 patroni: File "/usr/local/lib64/python3.6/site-packages/psycopg2/__init__.py", line 126, in connect
Aug 15 11:09:56 DB-test2 patroni: File "/usr/local/lib64/python3.6/site-packages/psycopg2/extensions.py", line 175, in make_dsn
Aug 15 11:09:56 DB-test2 patroni: psycopg2.ProgrammingError: invalid dsn: invalid connection option "target_session_attrs"
Aug 15 11:09:56 DB-test2 patroni: 2024-08-15 11:09:56,425 INFO: no action. I am (pa-pg-2), a secondary, and following a leader (pa-pg-1)
6.通過官方github的issues浊服,建議將psycopg2模塊的版本修改為psycopg2==2.8.6统屈,但是我的機(jī)服務(wù)器上胚吁,沒有這個(gè)模塊,正常啟動(dòng)的兩臺(tái)也沒有這個(gè)模塊愁憔。但是當(dāng)我檢查這個(gè)相關(guān)模塊時(shí)發(fā)現(xiàn)了問題腕扶。
#pa-pg-1和pa-pg-3
root # pip3 list|grep psycopg2
psycopg2-binary 2.9.8
#pa-pg-2
psycopg2-binary 2.9.5
7.通過上方發(fā)現(xiàn)patroni的所需要的模塊psycopg2-binary,三臺(tái)節(jié)點(diǎn)唯有啟動(dòng)報(bào)錯(cuò)的pa-pg-2版本與其他兩臺(tái)不一致吨掌。所以決定在pa-pg-2節(jié)點(diǎn)重裝這個(gè)模塊半抱。重裝后pa-pg-2的patroni日志不再輸出報(bào)錯(cuò)。
root # pip3 uninstall psycopg2-binary
root # pip3 install psycopg2-binary==2.9.8
...
Aug 15 11:23:14 DB-test2 systemd: Started patroni - a high-availability PostgreSQL.
Aug 15 11:23:14 DB-test2 systemd-logind: New session 886 of user root.
Aug 15 11:23:14 DB-test2 systemd: Started Session 886 of user root.
Aug 15 11:23:14 DB-test2 systemd-logind: Removed session 886.
Aug 15 11:23:15 DB-test2 patroni: postgresql parameter listen_addresses=0.0.0.0 failed validation, defaulting to None
Aug 15 11:23:15 DB-test2 patroni: postgresql parameter port=5432 failed validation, defaulting to None
Aug 15 11:23:15 DB-test2 patroni: 2024-08-15 11:23:15,225 INFO: No PostgreSQL configuration items changed, nothing to reload.
Aug 15 11:23:15 DB-test2 patroni: localhost:5432 - accepting connections
Aug 15 11:23:15 DB-test2 patroni: 2024-08-15 11:23:15,233 INFO: establishing a new patroni heartbeat connection to postgres
Aug 15 11:23:15 DB-test2 patroni: 2024-08-15 11:23:15,254 INFO: no action. I am (pa-pg-2), a secondary, and following a leader (pa-pg-1)
Aug 15 11:23:16 DB-test2 systemd-logind: New session 887 of user root.
Aug 15 11:23:16 DB-test2 systemd: Started Session 887 of user root.
Aug 15 11:23:16 DB-test2 systemd-logind: Removed session 887.
Aug 15 11:23:16 DB-test2 patroni: 2024-08-15 11:23:16,423 INFO: no action. I am (pa-pg-2), a secondary, and following a leader (pa-pg-1)
總結(jié)
在進(jìn)行集群相關(guān)的應(yīng)用部署時(shí)膜宋,只是檢查好相關(guān)模塊安裝成功與否還不夠窿侈,還需要更細(xì)致的檢查相關(guān)的版本號(hào)是否一致。