kafka版本:0.10.2
問(wèn)題描述
線上kafka的生產(chǎn)者程序大概每一周都會(huì)拋出以下異常后停止,重啟后恢復(fù)霹崎。觀察監(jiān)控發(fā)現(xiàn)屹耐,異常前網(wǎng)卡流量平穩(wěn)少漆,不存在抖動(dòng)臼膏。(是不是很神奇)
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.
追蹤日志
server.log
org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition
stat-change.log
[2019-02-25 15:01:03,023] TRACE Broker 2 received LeaderAndIsr request PartitionState(controllerEpoch=8, leader=3, leaderEpoch=10, isr=[1, 2, 3], zkVersion=19, replicas=[1, 2, 3]) correlation id 202 from controller 1 epoch 8 for partition [__consumer_offsets,36] (state.change.logger)
[2019-02-25 15:01:03,023] TRACE Broker 2 handling LeaderAndIsr request correlationId 202 from controller 1 epoch 8 starting the become-follower transition for partition __consumer_offsets-36 (state.change.logger)
[2019-02-25 15:01:03,023] TRACE Broker 2 stopped fetchers as part of become-follower request from controller 1 epoch 8 with correlation id 202 for partition __consumer_offsets-36 (state.change.logger)
[2019-02-25 15:01:03,044] TRACE Broker 2 truncated logs and checkpointed recovery boundaries for partition __consumer_offsets-36 as part of become-follower request with correlation id 202 from controller 1 epoch 8 (state.change.logger)
[2019-02-25 15:01:03,044] TRACE Broker 2 started fetcher to new leader as part of become-follower request from controller 1 epoch 8 with correlation id 202 for partition __consumer_offsets-36 (state.change.logger)
[2019-02-25 15:01:03,044] TRACE Broker 2 completed LeaderAndIsr request correlationId 202 from controller 1 epoch 8 for the become-follower transition for partition __consumer_offsets-36 (state.change.logger)
[2019-02-25 15:01:03,045] TRACE Broker 2 cached leader info (LeaderAndIsrInfo:(Leader:3,ISR:1,2,3,LeaderEpoch:10,ControllerEpoch:8),ReplicationFactor:3),AllReplicas:1,2,3) for partition __consumer_offsets-36 in response to UpdateMetadata request sent by controller 1 epoch 8 with correlation id 203 (state.change.logger)
通過(guò)以下關(guān)鍵字received-->handling-->stopped-->truncated-->started-->completed LeaderAndIsr request-->cached leader info 硼被,可見(jiàn) partition 的leader選舉還是非呈舅穑快的,毫秒級(jí)
錯(cuò)誤分析
我們的producer端的代碼里沒(méi)加 reties 參數(shù)嚷硫,默認(rèn)就發(fā)送一次检访,遇到leader選舉時(shí),找不到leader就會(huì)發(fā)送失敗仔掸,造成程序停止
解決辦法
producer端加上參數(shù) reties=3脆贵, 重試發(fā)送三次(默認(rèn)100ms重試一次 由 retry.backoff.ms控制);
如果還需要保證消息發(fā)送的有序性起暮,記得加上參數(shù) max.in.flight.requests.per.connection = 1 限制客戶端在單個(gè)連接上能夠發(fā)送的未響應(yīng)請(qǐng)求的個(gè)數(shù)卖氨,設(shè)置此值是1表示kafka broker在響應(yīng)請(qǐng)求之前client不能再向同一個(gè)broker發(fā)送請(qǐng)求。(注意:設(shè)置此參數(shù)是為了滿足必須順序消費(fèi)的場(chǎng)景负懦,比如binlog數(shù)據(jù))