一、背景
- 在業(yè)務(wù)龐大溯香,運(yùn)維流程長(zhǎng)且復(fù)雜的場(chǎng)景下鲫构,為了提高運(yùn)維效率,減小系統(tǒng)對(duì)業(yè)務(wù)工人的依賴(耦合)玫坛,系統(tǒng)引入消息隊(duì)列
- 系統(tǒng)基于MySql實(shí)現(xiàn)的消息隊(duì)列 oxen-queue
- 在高并發(fā)的情況下结笨,消息隊(duì)列由于設(shè)計(jì)原因,有死鎖情況出現(xiàn)
- 造成的影響:入隊(duì)操作失敗湿镀,導(dǎo)致單據(jù)創(chuàng)建失斂晃;重新入隊(duì)不及時(shí)勉痴,導(dǎo)致業(yè)務(wù)處理慢
二赫模、環(huán)境
- 建表語(yǔ)句
create table Queue ( FId bigint unsigned auto_increment primary key, FBatchId bigint unsigned null, FJobType varchar(256) not null, FCreatedTs datetime default CURRENT_TIMESTAMP not null comment '創(chuàng)建時(shí)間', FStartedTs datetime null comment '開(kāi)始時(shí)間', FBody varchar(8096) null, FStatus varchar(256) default 'waiting' null, FResult text null, FRecovered int default 0 null, FRunningTime int null, FUniqueKey bigint null, FPriority bigint null, FRetryCount bigint default 0 not null, FCreateTime datetime default CURRENT_TIMESTAMP not null, constraint i_u_FUniqueKey unique (FUniqueKey) ) charset = utf8; create index i_FBatchId_FPriority on devops.Queue (FBatchId, FPriority); create index i_FCreatedTs on devops.Queue (FCreatedTs); create index i_FJobType_FBatchId_FStatus_FPriority on devops.Queue (FJobType, FBatchId, FStatus, FPriority); create index i_FStartedTs_FJobType_FStatus on devops.Queue (FStartedTs, FJobType, FStatus); create index i_FStatus on devops.Queue (FStatus); INSERT INTO `Queue` (`FBody`, `FCreateTime`, `FCreatedTs`, `FJobType`, `FPriority`, `FRetryCount`, `FStartedTs`, `FUniqueKey`) VALUES ('body', '2022-08-22 10:59:59', '2022-08-23 14:30:00', 'syncChangeStatusQueue', 1661137199157, 0, '2022-08-22 10:59:59', NULL);
- Queue相關(guān)DB操作
// 入隊(duì) INSERT INTO `Queue` (`FBody`, `FCreateTime`, `FCreatedTs`, `FJobType`, `FPriority`, `FRetryCount`, `FStartedTs`, `FUniqueKey`) VALUES ('body', '2022-08-22 10:59:59', '2022-08-23 14:30:00', 'syncChangeStatusQueue', 1661137199157, 0, '2022-08-22 10:59:59', NULL); // 重新入隊(duì) UPDATE `Queue` SET `FStatus` = 'waiting', `FCreatedTs` = '2022-09-19 09:37:46.766', `FStartedTs` = '2022-09-19 09:37:46.766', `FBatchId` = NULL, `FRetryCount` = 1, `FBody` = 'body' WHERE `FId` = 1; // 超時(shí)任務(wù)處理 UPDATE T_DevOpsQueue SET FStatus='waiting', FBatchId = NULL, FStartedTs = NULL, FRecovered = 1 WHERE FStatus = 'processing' AND FStartedTs < (NOW() - INTERVAL 60 SECOND) AND `FJobType` = 'syncChangeStatusQueue';
三、問(wèn)題定位
起因
用戶反饋單據(jù)創(chuàng)建異常蒸矛,通過(guò)相關(guān)信息瀑罗,定位到異常產(chǎn)生的原因是消息隊(duì)列死鎖導(dǎo)致-
系統(tǒng)異常日志
INSERT INTO `Queue` (`FBody`, `FCreateTime`, `FCreatedTs`, `FJobType`, `FPriority`, `FRetryCount`, `FStartedTs`, `FUniqueKey`) VALUES ('body', '2022-08-22 10:59:59', '2022-08-23 14:30:00', 'syncChangeStatusQueue', 1661137199157, 0, '2022-08-22 10:59:59', NULL) ER_LOCK_DEADLOCK: Deadlock found when trying to get lock; try restarting transaction
首次入隊(duì)失敗,由于沒(méi)有回退流程雏掠,導(dǎo)致數(shù)據(jù)單據(jù)數(shù)據(jù)不一致
第二條錯(cuò)誤日志斩祭,這里是我通過(guò)在日志中發(fā)現(xiàn)的, -
統(tǒng)計(jì)系統(tǒng)死鎖(以
ER_LOCK_DEADLOCK
為關(guān)鍵字)
通過(guò)ER_LOCK_DEADLOCK
查找乡话,發(fā)現(xiàn)不只是Insert摧玫, 覺(jué)大部分為UpdateUPDATE `Queue` SET `FStatus` = 'waiting', `FCreatedTs` = '2022-09-19 09:37:46.766', `FStartedTs` = '2022-09-19 09:37:46.766', `FBatchId` = NULL, `FRetryCount` = 1, `FBody`= 'body' WHERE `FId` = 1 ER_LOCK_DEADLOCK: Deadlock found when trying to get lock; try restarting transaction
問(wèn):為什么異常發(fā)生頻率這么高,卻一直沒(méi)有被發(fā)現(xiàn)
答:Insert會(huì)引起業(yè)務(wù)異常(首次入隊(duì)失敗蚊伞,由于沒(méi)有回退流程席赂,導(dǎo)致數(shù)據(jù)單據(jù)數(shù)據(jù)不一致)吮铭,更容易關(guān)注到时迫,Updata(是重新入隊(duì))會(huì)被后臺(tái)Worker補(bǔ)償,所以沒(méi)有業(yè)務(wù)影響 初步分析-業(yè)務(wù)層面
系統(tǒng)錯(cuò)誤日志已經(jīng)明確谓晌,試圖從業(yè)務(wù)層面去分區(qū)死鎖并解決掠拳,
猜測(cè):由于業(yè)務(wù)處理復(fù)雜,整個(gè)流程全部在一個(gè)事務(wù)中處理(包含多次網(wǎng)絡(luò)請(qǐng)求)纸肉,初步認(rèn)為是事務(wù)使用不當(dāng)導(dǎo)致死鎖
解決:拆解事務(wù)溺欧,網(wǎng)絡(luò)請(qǐng)求等耗時(shí)操作移出事務(wù)
結(jié)果:并沒(méi)有改善喊熟,所以不是這里引起的(事后分析,入隊(duì)操作并沒(méi)有使用事務(wù)姐刁,所以一定不是這里引起的)
四芥牌、死鎖分析與處理
-
死鎖分析(一)
-
找DBA拉取死鎖日志(
show engine innodb status
)// 事物一 *** (1) TRANSACTION: TRANSACTION 264237645, ACTIVE 2 sec starting index read mysql tables in use 2, locked 2 LOCK WAIT 128 lock struct(s), heap size 24784, 265 row lock(s) // 事物ID:53640626,client host: x.x.x.177 MySQL thread id 53640626, OS thread handle 139684660885248, query id 1839137253 x.x.x.177 x.x.x.177 db Sending data // 死鎖SQL UPDATE Queue AS main INNER JOIN ( SELECT FId FROM Queue FORCE INDEX (i_FJobType_FBatchId_FStatus_FPriority) WHERE FBatchId IS NULL AND FStatus = "waiting" AND `FJobType` = 'syncChangeStatusQueue' AND FCreatedTs <= NOW() ORDER BY FPriority ASC LIMIT 10 ) sub ON sub.FId = main.FId SET `FBatchId` = 6835010934527921, FStatus = "processing", FStartedTs = NOW() // 事物一聂使,等待的鎖 *** (1) WAITING FOR THIS LOCK TO BE GRANTED: // 等待主鍵的記錄X鎖(排他鎖) RECORD LOCKS space id 217 page no 346233 n bits 96 index PRIMARY of table `db`.`Queue` trx id 264237645 lock_mode X locks rec but not gap waiting Record lock, heap no 25 PHYSICAL RECORD: n_fields 16; compact format; info bits 0 // 事物二 *** (2) TRANSACTION: TRANSACTION 264237646, ACTIVE 1 sec starting index read mysql tables in use 2, locked 2 128 lock struct(s), heap size 24784, 265 row lock(s) // 事物ID:53950278壁拉,client host: x.x.x.198 MySQL thread id 53950278, OS thread handle 139684121917184, query id 1839137525 x.x.x.198 x.x.x.198 db Sending data // 死鎖SQL UPDATE Queue AS main INNER JOIN ( SELECT FId FROM Queue FORCE INDEX (i_FJobType_FBatchId_FStatus_FPriority) WHERE FBatchId IS NULL AND FStatus = "waiting" AND `FJobType` = 'syncChangeStatusQueue' AND FCreatedTs <= NOW() ORDER BY FPriority ASC LIMIT 10 ) sub ON sub.FId = main.FId SET `FBatchId` = 2026729855303369, FStatus = "processing", FStartedTs = NOW() // 事物二,持有的鎖 *** (2) HOLDS THE LOCK(S): // 等待主鍵的記錄S鎖(共享鎖) RECORD LOCKS space id 217 page no 346233 n bits 96 index PRIMARY of table `db`.`Queue` trx id 264237646 lock mode S locks rec but not gap Record lock, heap no 25 PHYSICAL RECORD: n_fields 16; compact format; info bits 0 // 事物二柏靶,等待的鎖 *** (2) WAITING FOR THIS LOCK TO BE GRANTED: // 等待主鍵的記錄X鎖(排他鎖) RECORD LOCKS space id 217 page no 346233 n bits 96 index PRIMARY of table `db`.`Queue` trx id 264237646 lock_mode X locks rec but not gap waiting Record lock, heap no 25 PHYSICAL RECORD: n_fields 16; compact format; info bits 0 // 事物二弃理,回滾 *** WE ROLL BACK TRANSACTION (2)
-
日志分析
疑惑:為什么業(yè)務(wù)系統(tǒng)異常日志信息和MySql死鎖日志信息不一致?
推測(cè): MySql中引起死鎖的Sql會(huì)影響業(yè)務(wù)系統(tǒng)中Sql屎蜓,導(dǎo)致的異常痘昌;不止一個(gè)死鎖先解決MySql死鎖日志中的: 死鎖語(yǔ)句是同一個(gè),但是是不同的Client執(zhí)行(不同的Host)炬转,系統(tǒng)的服務(wù)辆苔,部署在不同的機(jī)器(集群)并發(fā)引起的
- 死鎖SQL
UPDATE Queue AS main INNER JOIN ( SELECT FId FROM Queue FORCE INDEX (i_FJobType_FBatchId_FStatus_FPriority) WHERE FBatchId IS NULL AND FStatus = "waiting" AND `FJobType` = 'syncChangeStatusQueue' AND FCreatedTs <= NOW() ORDER BY FPriority ASC LIMIT 10 ) sub ON sub.FId = main.FId SET `FBatchId` = 2026729855303369, FStatus = "processing", FStartedTs = NOW()
-
通過(guò)日志,分析得到返吻,倆個(gè)事務(wù)持有姑子、等待鎖的情況
事務(wù)1 事務(wù)2 持有鎖 ... PRIMARY的記錄S鎖 等待鎖 PRIMARY的記錄X鎖 PRIMARY的記錄X鎖 -
雖然日志中沒(méi)有記錄事務(wù)1持有鎖的情況,但是不難推斷出测僵,是PRIMARY的記錄S鎖
-
分析一句Sql是怎么形成上面這樣的死鎖的
-
首先分析這條Sql會(huì)加哪些鎖(
select * from performance_schema.data_locks
)
sql鎖情況
- Sql語(yǔ)句中有子查詢街佑,并強(qiáng)制使用i_FJobType_FBatchId_FStatus_FPriority索引,所以這里獲取了二級(jí)索引(i_FJobType_FBatchId_FStatus_FPriority)的Next-Key共享鎖捍靠,
- 由于索引的機(jī)制沐旨,這里會(huì)回表,也就是通過(guò)二級(jí)索引榨婆,會(huì)獲取聚簇索引的記錄(
S,REC_NOT_GAP
)共享鎖磁携, - 通過(guò)主鍵Update,所以再獲取聚簇索引的記錄(
X,REC_NOT_GAP
)排他鎖
-
真實(shí)的死鎖原因
先加了聚簇索引的記錄共享鎖良风,然后再獲取聚簇索引的記錄排他鎖谊迄,在并發(fā)的情況下,同時(shí)持有共享鎖烟央,再獲取排他鎖時(shí)统诺,相互等待,形成死鎖
-
-
如何解決
- 問(wèn)題的根因已經(jīng)明確疑俭,這里主要共享鎖相互兼容粮呢,導(dǎo)致相同的Sql語(yǔ)句同時(shí)持有共享鎖導(dǎo)致,所以我們這里通過(guò)將共享鎖升級(jí)為排他鎖,解決啄寡,添加
FOR UPDATE
UPDATE Queue AS main INNER JOIN ( SELECT FId FROM Queue FORCE INDEX (i_FJobType_FBatchId_FStatus_FPriority) WHERE FBatchId IS NULL AND FStatus = "waiting" AND `FJobType` = 'syncChangeStatusQueue' AND FCreatedTs <= NOW() ORDER BY FPriority ASC LIMIT 10 FOR UPDATE ) sub ON sub.FId = main.FId SET `FBatchId` = 2026729855303369, FStatus = "processing", FStartedTs = NOW();
調(diào)整后Sql鎖情況
通過(guò)FOR UPDATE將共享鎖升級(jí)為排他鎖,順序加鎖,解決問(wèn)題
- 問(wèn)題的根因已經(jīng)明確疑俭,這里主要共享鎖相互兼容粮呢,導(dǎo)致相同的Sql語(yǔ)句同時(shí)持有共享鎖導(dǎo)致,所以我們這里通過(guò)將共享鎖升級(jí)為排他鎖,解決啄寡,添加
-
解決效果
通過(guò)統(tǒng)計(jì)觀察指定異常豪硅,發(fā)現(xiàn)這里的異常已經(jīng)消除
-
-
死鎖分析(二)
上面的異常是消除了,但是并沒(méi)有解決掉我發(fā)現(xiàn)系統(tǒng)日志的異常信息挺物,也就側(cè)面驗(yàn)證了我們?cè)谏厦娴囊苫罄粮。到y(tǒng)不止一個(gè)死鎖所以這里調(diào)整了日志統(tǒng)計(jì)信息,看所有的死鎖問(wèn)題
show engine innodb status
只顯示最近一次發(fā)生的死鎖日志识藤,如果死鎖頻繁嵌溢,可以通過(guò)系統(tǒng)全局變量innodb_print_all_deadlocks
置為ON,通過(guò)日志可以看到所有的死鎖信息-
找DBA拉取死鎖日志(
show engine innodb status
)蹋岩,發(fā)現(xiàn)上面的死鎖確實(shí)換成了另外一個(gè)*** (1) TRANSACTION: TRANSACTION 278046801, ACTIVE 0 sec starting index read mysql tables in use 1, locked 1 LOCK WAIT 5 lock struct(s), heap size 1136, 3 row lock(s) MySQL thread id 54720832, OS thread handle 139687080482560, query id 1895336202 x.x.x.177 x.x.x.177 mmpaydevopsweb Searching rows for update UPDATE Queue SET FStatus="waiting", FBatchId = NULL, FStartedTs = NULL, FRecovered = 1 WHERE FStatus="processing" AND FStartedTs < (NOW() - INTERVAL 60 SECOND) AND `FJobType` = 'syncChangeGrayTimeQueue' *** (1) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 217 page no 325711 n bits 104 index PRIMARY of table `db`.`Queue` trx id 278046801 lock_mode X locks rec but not gap waiting Record lock, heap no 31 PHYSICAL RECORD: n_fields 16; compact format; info bits 0 *** (2) TRANSACTION: TRANSACTION 278046800, ACTIVE 0 sec updating or deleting mysql tables in use 1, locked 1 3 lock struct(s), heap size 1136, 2 row lock(s), undo log entries 1 MySQL thread id 55659707, OS thread handle 139684629968640, query id 1895336203 x.x.x.62 x.x.x.62 mmpaydevopsweb updating update `Queue` set `FStatus` = 'waiting', `FCreatedTs` = '2022-09-19 09:37:46.766', `FStartedTs` = '2022-09-19 09:37:46.766', `FBatchId` = NULL, `FRetryCount` = 26846, `FBody` = '{\"job_type\":\"syncChangeStatusJobType\",\"payload\":{\"deploymentId\":41582},\"start_time\":10000}' where `FId` = 7446396 *** (2) HOLDS THE LOCK(S): RECORD LOCKS space id 217 page no 325711 n bits 104 index PRIMARY of table `db`.`Queue` trx id 278046800 lock_mode X locks rec but not gap Record lock, heap no 31 PHYSICAL RECORD: n_fields 16; compact format; info bits 0 *** (2) WAITING FOR THIS LOCK TO BE GRANTED: RECORD LOCKS space id 217 page no 435270 n bits 864 index i_FStatus of table `db`.`Queue` trx id 278046800 lock_mode X locks rec but not gap waiting Record lock, heap no 767 PHYSICAL RECORD: n_fields 2; compact format; info bits 0 *** WE ROLL BACK TRANSACTION (2)
-
日志分析
-
死鎖SQL
============================= 事物一 ====================================== UPDATE Queue SET FStatus="waiting", FBatchId = NULL, FStartedTs = NULL, FRecovered = 1 WHERE FStatus="processing" AND FStartedTs < (NOW() - INTERVAL 60 SECOND) AND `FJobType` = 'syncChangeGrayTimeQueue' ============================= 事物二 ====================================== UPDATE `Queue` SET `FStatus` = 'waiting', `FCreatedTs` = '2022-09-19 09:37:46.766', `FStartedTs` = '2022-09-19 09:37:46.766', `FBatchId` = NULL, `FRetryCount` = 1, `FBody` = 'body' WHERE `FId` = 150; ```
通過(guò)日志赖草,分析得到,倆個(gè)事務(wù)持有剪个、等待鎖的情況
事務(wù)1 事務(wù)2 持有鎖 ... PRIMARY的記錄X鎖 等待鎖 PRIMARY的記錄X鎖 二級(jí)索引FStatus的記錄X鎖 -
-
分析Sql是怎么形成上面這樣的死鎖的
- 首先分析這條Sql會(huì)加哪些鎖(
select * from performance_schema.data_locks
)
sql鎖情況
Sql-1
Sql-2
這里Sql2好像并沒(méi)有二級(jí)索引FStatus的X鎖秧骑,難道推斷出了問(wèn)題?并不是扣囊,該語(yǔ)句更新了FStatus乎折,而FStatus是一個(gè)二級(jí)索引(二級(jí)索引所在的列產(chǎn)生實(shí)際變化的更新),所以在更新前侵歇,需要對(duì)FStatus的二級(jí)索引對(duì)應(yīng)的記錄加鎖(隱式鎖)
所以實(shí)際的加鎖情況是
這種情況主要是不同的sql骂澄,加鎖的順序不一致,更新的條件是主鍵惕虑、二級(jí)索引坟冲,很容易產(chǎn)生死鎖
- 首先分析這條Sql會(huì)加哪些鎖(
-
如何解決
倆個(gè)語(yǔ)句如果加鎖順序一致,就可以避免這樣的問(wèn)題發(fā)生// 先查詢 SELECT FId FROM ${this.db_table} WHERE FStatus="processing" AND FStartedTs < (NOW() - INTERVAL ${this.job_timeout_seconds} SECOND) AND ? LIMIT 10 // where條件加上主鍵 UPDATE Queue SET FStatus="waiting", FBatchId = NULL, FStartedTs = NULL, FRecovered = 1 WHERE FId IN(1,2) //上面的ID FStatus="processing" AND FStartedTs < (NOW() - INTERVAL 60 SECOND) AND `FJobType` = 'jobType'
這里where條件為什么不去掉后面的條件溃蔫,只通過(guò)主鍵修改健提?原因是可能形成數(shù)據(jù)覆蓋
-
解決效果
-
五、疑惑
-
上面并沒(méi)有分析關(guān)于引發(fā)系統(tǒng)異常的Insert語(yǔ)句的死鎖伟叛,具體Insert為什么也會(huì)有死鎖的問(wèn)題呢私痹?
-
處理超時(shí)消息
-
首次入隊(duì)(Insert)
-
真實(shí)執(zhí)行鎖情況
這里可能有疑問(wèn)卷玉,事物二(Insert)為什么會(huì)有二級(jí)索引的X鎖绷雏,這個(gè)就要說(shuō)說(shuō)隱式鎖
為了減少鎖的數(shù)量,提高M(jìn)ysql的性能汞贸,在一些情況下侥蒙,可以不不加鎖暗膜,等判斷可能發(fā)生沖突時(shí)才加鎖,也就是延遲加鎖辉哥, -
結(jié)果:倆次解決效果
熱點(diǎn)行鎖桦山,影響系統(tǒng)性能
六、總結(jié)
排查分析的過(guò)程比較曲折醋旦,剛開(kāi)始只通過(guò)系統(tǒng)日志中的一句異常信息恒水,從系統(tǒng)上分析死鎖的原因,結(jié)果浪費(fèi)了很多時(shí)間饲齐,系統(tǒng)日志與MySql死鎖日志不一致也就糾結(jié)了好久钉凌,最后發(fā)現(xiàn),死鎖問(wèn)題不單單只有一個(gè)捂人,所以最后采用逐個(gè)擊破的方式御雕,最后解決了該問(wèn)題
記錄主要注意以下五點(diǎn),在開(kāi)發(fā)中可以很大程度的避免死鎖滥搭,
- 事務(wù)時(shí)間要短酸纲,原則是耗時(shí)操作不要放到事務(wù)中處理,最好將影響并發(fā)度的鎖,放到時(shí)候最后
- 遇到死鎖也不要慌瑟匆,通過(guò)死鎖日志分析是最有效的闽坡,但是要先了解清楚Mysql的鎖機(jī)制
- 一個(gè)表加太多索引,很容易導(dǎo)致死鎖愁溜,加索引前要仔細(xì)思考是否有必要疾嗅,不能無(wú)腦加,同時(shí)在寫Sql的時(shí)候也要考慮冕象,是不是存在死鎖的問(wèn)題代承,和那些會(huì)形成
- 死鎖有時(shí)候在并發(fā)度高的情況才會(huì)出現(xiàn),對(duì)業(yè)務(wù)壓測(cè)還是很有必要的
- 通過(guò)mysql工具可以清晰的分析Sql鎖的情況渐扮,很好的幫助我們定位問(wèn)題