本案例是我真實(shí)遇到過的一個坑再来,也在前文中不止一次的提到畏纲,當(dāng)時也是非常納悶独柑,其實(shí)知道原因后只能說為什么會這么坑。
一缝其、觸發(fā)條件
本案列我測試過4個版本
percona Mysql 5.7.14
官方社區(qū) Mysql 5.7.17
percona Mysql 5.7.19
percona Mysql 5.7.15
其中percona Mysql 5.7.14和官方社區(qū) Mysql 5.7.17有這個問題挎塌。其他版本未知
- 已知percona Mysql 5.7.14或者官方社區(qū) Mysql 5.7.17。
- mysqldump備份沒有使用 -F, --flush-logs選項(xiàng)
- Gtid打開内边。
二榴都、故障描述
本故障主要是新搭建的Gtid主從庫,運(yùn)行一段時間后重啟主從必然報錯如下:
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'
三漠其、故障分析
為什么重啟后會報錯找到不事務(wù)呢嘴高,然后發(fā)現(xiàn)這個Gtid事務(wù)在主庫的binlog中已經(jīng)沒有了,應(yīng)該是很久以前的和屎。其實(shí)這個問題我們要回到mysqldump出來的文件如何進(jìn)行Gtid的初始化以及mysql.gtid_executed表中拴驮。
在mysqldump不使用--set-gtid-purged的時候必然會在dump出來的腳本中包含
-- GTID state at the beginning of the backup
SET @@GLOBAL.GTID_PURGED='e859a28b-b66d-11e7-8371-000c291f347d:1-41';
這樣一個設(shè)置GTID_PURGED的語句,它包含了主庫上已經(jīng)執(zhí)行的全部Gtid事務(wù)柴信。從第五節(jié)的源碼和總結(jié)部分我們知道這個語句至少做了三個更改(DBA可見的只有三個):
- mysql.gtid_executed表的寫入
- gtid_executed變量的修改
- gtid_purged變量的修改
而完成了這一步實(shí)際上mysql.gtid_executed表是包含了全部的執(zhí)行過的Gtid事務(wù)的套啤,但是隨后我們看到dump腳本包含了如下語句
210 -- Table structure for table `gtid_executed`
211 --
212
213 DROP TABLE IF EXISTS `gtid_executed`;
214 /*!40101 SET @saved_cs_client = @@character_set_client */;
215 /*!40101 SET character_set_client = utf8 */;
216 CREATE TABLE `gtid_executed` (
217 `source_uuid` char(36) NOT NULL COMMENT 'uuid of the source where the transaction was originally executed.',
218 `interval_start` bigint(20) NOT NULL COMMENT 'First number of interval.',
219 `interval_end` bigint(20) NOT NULL COMMENT 'Last number of interval.',
220 PRIMARY KEY (`source_uuid`,`interval_start`)
221 ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
222 /*!40101 SET character_set_client = @saved_cs_client */;
223
224 --
225 -- Dumping data for table `gtid_executed`
226 --
227
228 LOCK TABLES `gtid_executed` WRITE;
229 /*!40000 ALTER TABLE `gtid_executed` DISABLE KEYS */;
230 INSERT INTO `gtid_executed` VALUES ('e859a28b-b66d-11e7-8371-000c291f347d',1,32);
231 /*!40000 ALTER TABLE `gtid_executed` ENABLE KEYS */;
232 UNLOCK TABLES;
顯然這里我們在source的時候從庫的mysql.gtid_executed將被重新初始化為:
'e859a28b-b66d-11e7-8371-000c291f347d',1,32
而實(shí)際的已經(jīng)執(zhí)行過的Gtid是:
'e859a28b-b66d-11e7-8371-000c291f347d:1-41';
如前文第五節(jié)我們通過源碼分析后總結(jié)如下:
mysql.gtid_executed表修改時機(jī)
在binlog發(fā)生切換(rotate)的時候保存直到上一個binlog文件執(zhí)行過的全部Gtid,它不是實(shí)時更新的颠印。
因此此時表中并沒有完全包含全部執(zhí)行過的Gtid事務(wù)纲岭,而在前文第六節(jié)的源碼分析中我們知道在Gtid模塊啟動的時候必須要讀取兩個Gtid持久化的介質(zhì):
- mysql.gtid_executed
- binlog
來判斷Gtid的集合,顯然從庫不可能在binlog包含這個Gtid事務(wù)线罕,所以這樣的操作步驟就導(dǎo)致了數(shù)據(jù)庫從庫后的報錯止潮,而這里的正確的步驟是壓根不進(jìn)行mysql.gtid_executed的重建和導(dǎo)入,我發(fā)現(xiàn)在percona Mysql 5.7.15和percona Mysql 5.7.19正是這樣的钞楼。但是為了防范這個問題喇闸,我在搭建的Gtid從庫導(dǎo)完數(shù)據(jù)后加入了兩個個步驟如下:
reset master;
set global gtid_purged='e859a28b-b66d-11e7-8371-000c291f347d:1-41';
這兩步也就是為了從新初始化mysql.gtid_executed表询件,讓其正確燃乍。
此問題還可以在mysqldump的時候加入-F, --flush-logs選項(xiàng)規(guī)避,但是-F會加入如下的MDL LOCK:
2017-12-18T08:16:39.580985Z 6 Query FLUSH /*!40101 LOCAL */ TABLES
2017-12-18T08:16:39.612598Z 6 Query FLUSH TABLES WITH READ LOCK
2017-12-18T08:16:39.613406Z 6 Refresh
/root/mysql5.7.14/percona-server-5.7.14-7/sql/mysqld, Version: 5.7.14-7-debug-log (Source distribution). started with:
Tcp port: 13001 Unix socket: /root/mysql5.7.14/percona-server-5.7.14-7/mysql-test/var/tmp/mysqld.1.sock
Time Id Command Argument
2017-12-18T08:16:39.965847Z 6 Query SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ
2017-12-18T08:16:39.966298Z 6 Query START TRANSACTION /*!40100 WITH CONSISTENT SNAPSHOT */
2017-12-18T08:16:39.966792Z 6 Query SHOW VARIABLES LIKE 'gtid\_mode'
2017-12-18T08:16:39.969587Z 6 Query SELECT @@GLOBAL.GTID_EXECUTED
2017-12-18T08:16:39.970216Z 6 Query SHOW STATUS LIKE 'binlog_snapshot_%'
2017-12-18T08:16:39.975242Z 6 Query UNLOCK TABLES
這把鎖是GLOBAL級別的MDL_SHARED(S)鎖宛琅,它會等到你說有的SELECT/DML/DDL語句結(jié)束后才能獲得刻蟹,同時會堵塞全部的SELECT/DML/DDL雖然持有時間很短如下:
mysql> flush tables with read lock;
Query OK, 0 rows affected (0.01 sec)
2017-08-03T18:19:11.603911Z 3 [Note] (acquire_lock)THIS MDL LOCK acquire ok!
2017-08-03T18:19:11.603947Z 3 [Note] (>MDL PRINT) Thread id is 3:
2017-08-03T18:19:11.603971Z 3 [Note] (--->MDL PRINT) Namespace is:GLOBAL
2017-08-03T18:19:11.603994Z 3 [Note] (----->MDL PRINT) Mdl type is:MDL_SHARED(S)
2017-08-03T18:19:11.604045Z 3 [Note] (------>MDL PRINT) Mdl duration is:MDL_EXPLICIT
2017-08-03T18:19:11.604073Z 3 [Note] (------->MDL PRINT) Mdl status is:EMPTY
2017-08-03T18:19:11.604133Z 3 [Note] (acquire_lock)THIS MDL LOCK acquire ok!
當(dāng)然要了解MDL LOCK的朋友可以參考我的文章
http://blog.itpub.net/7728585/viewspace-2143093/
MYSQL METADATA LOCK(MDL LOCK)學(xué)習(xí)(1) 理論知識和加鎖類型測試
四、故障模擬
知道了原因后也是很好模擬我使用的版本是社區(qū)版5.7.17嘿辟,搭建過程就是前面說的步驟舆瘪。只是導(dǎo)完數(shù)據(jù)后不使用reset master和set gtid_purged表進(jìn)行重新初始化mysql.gtid_executed表。搭建完成后做幾個事務(wù)狀態(tài)正常如下:
mysql> show slave status \G
*************************** 1. row ***************************
Master_Log_File: binlog.000002
Read_Master_Log_Pos: 5077
Relay_Log_File: test1-relay-bin.000002
Relay_Log_Pos: 2498
Relay_Master_Log_File: binlog.000002
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Exec_Master_Log_Pos: 5077
Relay_Log_Space: 2705
Last_IO_Errno: 0
Last_IO_Error:
Seconds_Behind_Master: 0
Retrieved_Gtid_Set: e859a28b-b66d-11e7-8371-000c291f347d:42-49
Executed_Gtid_Set: e859a28b-b66d-11e7-8371-000c291f347d:1-49
Auto_Position: 1
但是這個時候我們發(fā)現(xiàn)mysql.gtid_executed表已經(jīng)出現(xiàn)了問題如下:
mysql> select * from gtid_executed;
+--------------------------------------+----------------+--------------+
| source_uuid | interval_start | interval_end |
+--------------------------------------+----------------+--------------+
| e859a28b-b66d-11e7-8371-000c291f347d | 1 | 32 |
| e859a28b-b66d-11e7-8371-000c291f347d | 42 | 42 |
| e859a28b-b66d-11e7-8371-000c291f347d | 43 | 43 |
| e859a28b-b66d-11e7-8371-000c291f347d | 44 | 44 |
| e859a28b-b66d-11e7-8371-000c291f347d | 45 | 45 |
| e859a28b-b66d-11e7-8371-000c291f347d | 46 | 46 |
| e859a28b-b66d-11e7-8371-000c291f347d | 47 | 47 |
| e859a28b-b66d-11e7-8371-000c291f347d | 48 | 48 |
| e859a28b-b66d-11e7-8371-000c291f347d | 49 | 49 |
+--------------------------------------+----------------+--------------+
很容易發(fā)現(xiàn)33-41之間是沒有持久化的红伦。如果這個時候如果我們使用purge binary logs to 來清理掉主庫的日志那么必將出現(xiàn)問題英古,如果不清理也會出現(xiàn)Gtid事物重新執(zhí)行的情況。我們做清理模擬線上錯誤昙读。主庫執(zhí)行:
mysql> show binary logs;
+---------------+-----------+
| Log_name | File_size |
+---------------+-----------+
| binlog.000001 | 9974 |
| binlog.000002 | 5121 |
| binlog.000003 | 194 |
+---------------+-----------+
3 rows in set (0.01 sec)
mysql> purge binary logs to 'binlog.000003';
Query OK, 0 rows affected (0.04 sec)
mysql> show binary logs;
+---------------+-----------+
| Log_name | File_size |
+---------------+-----------+
| binlog.000003 | 194 |
+---------------+-----------+
1 row in set (0.00 sec)
備庫重啟后錯誤重現(xiàn):
mysql> show slave status \G
*************************** 1. row ***************************
Slave_IO_State:
Master_Host: 192.168.190.62
Master_User: repl
Master_Port: 3308
Connect_Retry: 60
Master_Log_File: binlog.000003
Read_Master_Log_Pos: 194
Relay_Log_File: test1-relay-bin.000005
Relay_Log_Pos: 4
Relay_Master_Log_File: binlog.000003
Slave_IO_Running: No
Slave_SQL_Running: Yes
Exec_Master_Log_Pos: 194
Relay_Log_Space: 194
Last_IO_Errno: 1236
Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Retrieved_Gtid_Set: e859a28b-b66d-11e7-8371-000c291f347d:42-49
Executed_Gtid_Set: e859a28b-b66d-11e7-8371-000c291f347d:1-32:42-49
Auto_Position: 1
我們發(fā)現(xiàn)I/O thread 試圖獲取主庫的33-41的Gtid事務(wù)的事務(wù)召调,已經(jīng)不能獲取,實(shí)際上即使能獲取也會造成事務(wù)的重新執(zhí)行蛮浑,我們看到Executed_Gtid_Set已經(jīng)出現(xiàn)了兩個連續(xù)的區(qū)間:
Executed_Gtid_Set: e859a28b-b66d-11e7-8371-000c291f347d:1-32:42-49
五唠叛、總結(jié)
前文已經(jīng)描述過mysql.gtid_executed表的作用和其更改時機(jī),如果我們對其有深刻的了解這個案例也是很容易分析的沮稚,當(dāng)然解決辦法在第八節(jié)主從搭建的步驟中我已經(jīng)給出了玻墅,也就是在搭建完成后進(jìn)行reset master和set global gtid_pruged兩步重新初始化一下mysql.gtid_executed表。