最近又遇到了雙主的主庫(kù),備庫(kù)只讀经窖,監(jiān)控發(fā)現(xiàn)主庫(kù)大量延遲的情況坡垫。這里用主庫(kù)表示讀寫庫(kù)梭灿,用備庫(kù)表示只讀庫(kù)。存在問(wèn)題在于主庫(kù)的mi->rli->last_master_timestamp得不到更新冰悠,因?yàn)闆](méi)有event進(jìn)行修改胎源,而這個(gè)值長(zhǎng)期不更新,一般只有在從庫(kù)binlog切換的時(shí)候才會(huì)更新屿脐。那么只要show slave status滿足條件就會(huì)進(jìn)入計(jì)算延遲環(huán)節(jié)且只要計(jì)算延遲就很高,但是關(guān)鍵是什么時(shí)候回滿足條件宪卿,這是關(guān)鍵的诵。
這里,主庫(kù)會(huì)接受來(lái)自從庫(kù)的event佑钾,也就是事物的event會(huì)回傳西疤,但是event會(huì)過(guò)濾掉,但是主庫(kù)的pos是需要更新的休溶,這包含IO線程和SQL線程都需要更新自身的POS代赁,也就是主庫(kù)的show slave status看到的POS,這些POS需要跟上從庫(kù)的binlog位點(diǎn)兽掰。也就是在主庫(kù)大概有3個(gè)任務(wù)需要完成
1芭碍、主庫(kù)的IO線程會(huì)過(guò)濾掉從庫(kù)回傳的event,但是特殊的event不過(guò)濾
2孽尽、主庫(kù)的IO線程會(huì)更新讀取到的pos
3窖壕、主庫(kù)的SQL線程會(huì)更新應(yīng)用到的pos
那么IO線程更新和SQL線程更新是一個(gè)異步通知的過(guò)程,這是根本問(wèn)題杉女,很容易模擬瞻讽,就是將SQL線程更新應(yīng)用的pos線程給他停止掉這個(gè)時(shí)候show slave status就能出現(xiàn)主庫(kù)延遲,其中我們watch mi->master_log_pos觀察io線程更改pos熏挎,同時(shí)gdb 斷點(diǎn)exec_relay_log_event速勇,理論上就是不讓SQL線程更新pos,然后io線程更新了pos坎拐,這個(gè)時(shí)候會(huì)進(jìn)入延遲計(jì)算環(huán)節(jié)烦磁,開(kāi)啟gdb線程調(diào)試,如下哼勇,
- 斷點(diǎn)
3 hw watchpoint keep y *((my_off_t *) 0x65364a8)
breakpoint already hit 2 times
4 hw watchpoint keep y *((my_off_t *) 0x65364a8)
5 breakpoint keep y 0x0000000001848850 in exec_relay_log_event(THD*, Relay_log_info*) at /opt/percona-server-locks-detail-5.7.22/sql/rpl_slave.cc:5161
- IO線程更新了pos watch mi->master_log_pos
Old value = 376
New value = 416
Hardware watchpoint 4: *((my_off_t *) 0x65364a8)
Old value = 376
New value = 416
Master_info::set_master_log_pos (this=0x65341f0, log_pos=416) at /opt/percona-server-locks-detail-5.7.22/sql/rpl_mi.h:312
312 }
(gdb) c
Continuing.
Hardware watchpoint 3: *((my_off_t *) 0x65364a8)
Old value = 416
New value = 447
Hardware watchpoint 4: *((my_off_t *) 0x65364a8)
Old value = 416
New value = 447
Master_info::set_master_log_pos (this=0x65341f0, log_pos=447) at /opt/percona-server-locks-detail-5.7.22/sql/rpl_mi.h:312
312 }
(gdb) c
Continuing.
- SQL線程不更新POS
(gdb) info threads
Id Target Id Frame
49 Thread 0x7fff385a3700 (LWP 9745) "mysqld" (running)
48 Thread 0x7fff385e5700 (LWP 9732) "mysqld" (running)
47 Thread 0x7fff50059700 (LWP 9731) "mysqld" (running)
46 Thread 0x7fff593f9700 (LWP 9730) "mysqld" (running)
45 Thread 0x7fff5943b700 (LWP 9729) "mysqld" (running)
44 Thread 0x7fff5947d700 (LWP 9728) "mysqld" exec_relay_log_event (thd=0x7ffed4000950, rli=0x653cda0) at /opt/percona-server-locks-detail-5.7.22/sql/rpl_slave.cc:5161
* 43 Thread 0x7fff594bf700 (LWP 9727) "mysqld" (running)
42 Thread 0x7fff59501700 (LWP 9674) "mysqld" (running)
41 Thread 0x7fff59543700 (LWP 9669) "mysqld" (running)
- show slave 計(jì)算延遲線程
(gdb) n
3836 max(0L, time_diff) : 0));
(gdb) p time_diff
$6 = 131
這里就是延遲的131秒
輸出顯示如下个初,
那么結(jié)合模擬的過(guò)程來(lái)看,實(shí)際上這種情況是不可避免的猴蹂,并且院溺,
- 如果庫(kù)的事務(wù)越多壓力越大主庫(kù)出現(xiàn)延遲的可能性就越高
- 如果主庫(kù)的壓力越大主庫(kù)出現(xiàn)延遲的可能性就越高