Redisson重連后WatchDog失效問題解決

Redisson分布式鎖提供了WatchDog功能，如果你使用了分布式鎖且沒有設(shè)置超時時間Ression會為你設(shè)置一個默認(rèn)的超時時間脱柱，且在你沒有主動釋放鎖之前會不斷續(xù)期萧朝。這樣既可以保證在持鎖期間的代碼不會被其他線程執(zhí)行然低，也可以防止死鎖的發(fā)生。

不過最近在做項(xiàng)目的時候發(fā)現(xiàn)我的Redisson斷線重連后WatchDog居然失效了间护。跟了一下Redisson的代碼發(fā)現(xiàn)了原因，在這里分享一下挖诸。

問題重現(xiàn)

String name = "REDIS_LOCK"
try{
   if(!redissonClient.getLock(name).tryLock()){
     return;
   }
   doSomething();
}catch(Exception e){
   RLock rLock = redissonClient.getLock(name);
   if (rLock.isLocked() && rLock.isHeldByCurrentThread()) {
       rLock.unlock();
   }
}

項(xiàng)目中用的是tryLock()汁尺，線程會不斷地嘗試拿到鎖，拿到鎖之后線程就會開始執(zhí)行業(yè)務(wù)代碼多律。當(dāng)一個線程拿到鎖之后不主動釋放痴突，WatchDog就會生效搂蜓，不斷地為這個鎖續(xù)時。這個時候我們讓網(wǎng)絡(luò)斷開一段時間辽装，Redisson就會報(bào)以下這個錯帮碰，這個時候因?yàn)檫B不上redis了WatchDog會在默認(rèn)的時間內(nèi)失效，鎖也會被釋放拾积。

2020-11-06 14:56:53.682 [redisson-timer-4-1] ERROR org.redisson.RedissonLock - Can't update lock REDIS_LOCK expiration
org.redisson.client.RedisResponseTimeoutException: Redis server response timeout (3000 ms) occured after 3 retry attempts. Increase nettyThreads and/or timeout settings. Try to define pingConnectionInterval setting. Command: null, params: null, channel: [id: 0x1e676dd8, L:/192.168.20.49:58477 - R:/192.168.2.21:6379]
    at org.redisson.command.RedisExecutor$3.run(RedisExecutor.java:333)
    at io.netty.util.HashedWheelTimer$HashedWheelTimeout.expire(HashedWheelTimer.java:672)
    at io.netty.util.HashedWheelTimer$HashedWheelBucket.expireTimeouts(HashedWheelTimer.java:747)
    at io.netty.util.HashedWheelTimer$Worker.run(HashedWheelTimer.java:472)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:748)

當(dāng)我們網(wǎng)絡(luò)正常后程序再執(zhí)行上面的代碼殉挽，某個線程持有的REDIS_LOCK這個鎖并不會像往常一樣一直持有，過了30秒之后就會自動失效拓巧，也就是說WatchDog不再為你續(xù)時了斯碌。反復(fù)測試幾次都是這樣的結(jié)果，這個可能是Redisson的一個bug肛度，目前用的是最新的redisson 3.13.6 版本傻唾，可能未來的新版本不會有這個問題。

分析原因

下載redisson源碼打開RedissonLock這個類贤斜，找到我們用的tryLock方法

    @Override
    public boolean tryLock() {
        return get(tryLockAsync());
    }

發(fā)現(xiàn)trylock()和lock()最終實(shí)現(xiàn)的方法是tryAcquireOnceAsync()這個方法策吠，我們看一下這個方法的邏輯

private RFuture<Boolean> tryAcquireOnceAsync(long waitTime, long leaseTime, TimeUnit unit, long threadId) {
        //判斷有沒有設(shè)置超時時間（-1表示沒有設(shè)置）
        if (leaseTime != -1) {
            //異步執(zhí)行redis加鎖腳本
            return tryLockInnerAsync(waitTime, leaseTime, unit, threadId, RedisCommands.EVAL_NULL_BOOLEAN);
        }
        //異步執(zhí)行redis加鎖腳本，且根據(jù)異步結(jié)果判斷是否加鎖成功
        RFuture<Boolean> ttlRemainingFuture = tryLockInnerAsync(waitTime,
                                                    commandExecutor.getConnectionManager().getCfg().getLockWatchdogTimeout(),//這里獲取watchdog的配置時間來作為鎖的超時時間
                                                    TimeUnit.MILLISECONDS, threadId, RedisCommands.EVAL_NULL_BOOLEAN);
        ttlRemainingFuture.onComplete((ttlRemaining, e) -> {
            if (e != null) {
                return;
            }
   
            // lock acquired
            //redis腳本執(zhí)行成功就會執(zhí)行watchdog的需時任務(wù)
            if (ttlRemaining) {
                scheduleExpirationRenewal(threadId);
            }
        });
        return ttlRemainingFuture;
    }

當(dāng)沒有設(shè)置鎖的超時時間且加鎖成功的時候就會執(zhí)行scheduleExpirationRenewal(threadId)這個方法瘩绒。

  private void scheduleExpirationRenewal(long threadId) {
        ExpirationEntry entry = new ExpirationEntry();
        ExpirationEntry oldEntry = EXPIRATION_RENEWAL_MAP.putIfAbsent(getEntryName(), entry);
        
        if (oldEntry != null) {
            oldEntry.addThreadId(threadId);
        } else {
            entry.addThreadId(threadId);
            //重新續(xù)時邏輯
            renewExpiration();
        }
    }

WatchDog重新續(xù)時邏輯

private void renewExpiration() {
        ExpirationEntry ee = EXPIRATION_RENEWAL_MAP.get(getEntryName());
        if (ee == null) {
            return;
        }
        
        Timeout task = commandExecutor.getConnectionManager().newTimeout(new TimerTask() {
            @Override
            public void run(Timeout timeout) throws Exception {
                ExpirationEntry ent = EXPIRATION_RENEWAL_MAP.get(getEntryName());
                if (ent == null) {
                    return;
                }
                Long threadId = ent.getFirstThreadId();
                if (threadId == null) {
                    return;
                }
                
                RFuture<Boolean> future = renewExpirationAsync(threadId);
                future.onComplete((res, e) -> {
                    if (e != null) {
                        //報(bào)錯了timer就不會再執(zhí)行了
                        log.error("Can't update lock " + getName() + " expiration", e);
                        return;
                    }
                    
                    if (res) {
                        // reschedule itself
                        renewExpiration();
                    }
                });
            }
        }, internalLockLeaseTime / 3, TimeUnit.MILLISECONDS);

        ee.setTimeout(task);
    }

可以看到renewExpiration()方法核心是一個timer定時任務(wù)猴抹，每次執(zhí)行完延遲watchdog配置時間/3的時間再執(zhí)行一次。也就是說watchdog默認(rèn)配置是30000毫秒锁荔，這里就是就是每十秒執(zhí)行一次蟀给。但要注意是這個定時任務(wù)并不會一直執(zhí)行下去。

       if (e != null) {
          //報(bào)錯了timer就不會再執(zhí)行了
           log.error("Can't update lock " + getName() + " expiration", e);
           return;
       }
                    
       if (res) {
            // reschedule itself
           renewExpiration();
       }

當(dāng)上一次redis續(xù)時腳本發(fā)生異常的時候就不再執(zhí)行了阳堕，也就是我們在文章開頭看到的那個錯誤ERROR org.redisson.RedissonLock - Can't update lock REDIS_LOCK expiration跋理。這個設(shè)計(jì)也是合理的，可以防止資源浪費(fèi)恬总，那么程序重新trylock()成功的時候應(yīng)該為重新啟動這個定時任務(wù)才對前普。但其實(shí)不是，scheduleExpirationRenewal方法是有判斷的

        ExpirationEntry entry = new ExpirationEntry();
        ExpirationEntry oldEntry = EXPIRATION_RENEWAL_MAP.putIfAbsent(getEntryName(), entry);
        //當(dāng)ExpirationEntry在EXPIRATION_RENEWAL_MAP里存在就只會添加線程ID壹堰，不會重新執(zhí)行續(xù)時邏輯
        if (oldEntry != null) {
            oldEntry.addThreadId(threadId);
        } else {
            entry.addThreadId(threadId);
            //重新續(xù)時邏輯
            renewExpiration();
        }

可以看到核心判斷是getEntryName()這個方法拭卿，作為key存放在EXPIRATION_RENEWAL_MAP中，如果getEntryName()一直不變renewExpiration()就永遠(yuǎn)不會再執(zhí)行贱纠。問題應(yīng)該就出在這里峻厚。

 public RedissonLock(CommandAsyncExecutor commandExecutor, String name) {
        super(commandExecutor, name);
        this.commandExecutor = commandExecutor;
        this.id = commandExecutor.getConnectionManager().getId();
        this.internalLockLeaseTime = commandExecutor.getConnectionManager().getCfg().getLockWatchdogTimeout();
        this.entryName = id + ":" + name;
        this.pubSub = commandExecutor.getConnectionManager().getSubscribeService().getLockPubSub();
    }

    protected String getEntryName() {
        return entryName;
    }

可以看到this.entryName = id + ":" + name;，其中id是RedissonClient創(chuàng)建生成的一個UUID谆焊，name就是我們使用鎖的名字惠桃。我們一般會把RedissonClient的單例對象注入到spring容器里，id在springboot啟動后就不會再變了。我們每使用一個分布式鎖都會起一個固定的name辜王。也就是說在鎖名稱不變的情況下entryName也不會變劈狐，redisson在重新加鎖的時候判斷entryName已經(jīng)存在就不會再續(xù)時了。

總結(jié)一下：不管是trylock()還是lock()方法誓禁，同一個鎖redisson會設(shè)置一個watchdog給它續(xù)時懈息，并把續(xù)時信息緩存起來，正常情況下執(zhí)行unlock()會清除這個緩存摹恰。但當(dāng)客戶端與redis斷開連接后報(bào)"Can't update lock " + getName() + " expiration"錯之后watchdog就會失效辫继，斷線重連后再執(zhí)行trylock()或者lock()方法后會因?yàn)檫@個鎖的緩存不再執(zhí)行watchdog的續(xù)時邏輯。

解決辦法

1.增加watchdog超時時長

   @Bean(destroyMethod = "shutdown")
    public RedissonClient redisson(RedissonProperties properties) throws IOException {
        ObjectMapper mapper = new ObjectMapper();
        String jsonString = mapper.writeValueAsString(properties);
        Config config = Config.fromJSON(jsonString);
        config.setLockWatchdogTimeout(150000);
        return Redisson.create(config);
    }

watchdog默認(rèn)超時時間是30000毫秒俗慈，它的執(zhí)行邏輯是30000/3毫秒執(zhí)行一次續(xù)時姑宽，也就是說斷線后在1-10000毫秒期間重連成功watchdog下次執(zhí)行后就不會再報(bào)錯。我們可以把默認(rèn)的30000毫秒改成150000毫秒闺阱，可以提供斷線重連的容錯幾率炮车。但這樣并不能完全解決這個問題。

2.修改redisson源碼

 private void renewExpiration() {
        ExpirationEntry ee = EXPIRATION_RENEWAL_MAP.get(getEntryName());
        if (ee == null) {
            return;
        }
        
        Timeout task = commandExecutor.getConnectionManager().newTimeout(new TimerTask() {
            @Override
            public void run(Timeout timeout) throws Exception {
                ExpirationEntry ent = EXPIRATION_RENEWAL_MAP.get(getEntryName());
                if (ent == null) {
                    return;
                }
                Long threadId = ent.getFirstThreadId();
                if (threadId == null) {
                    return;
                }
                
                RFuture<Boolean> future = renewExpirationAsync(threadId);
                future.onComplete((res, e) -> {
                    if (e != null) {
                        log.error("Can't update lock " + getName() + " expiration", e);
                        EXPIRATION_RENEWAL_MAP.remove(getEntryName()); //添加異常刪除緩存邏輯
                        return;
                    }
                    
                    if (res) {
                        // reschedule itself
                        renewExpiration();
                    }
                });
            }
        }, internalLockLeaseTime / 3, TimeUnit.MILLISECONDS);
        
        ee.setTimeout(task);
    }

修改RedissonLock類里的renewExpiration()方法酣溃，在 if (e != null) {}判斷里加上EXPIRATION_RENEWAL_MAP.remove(getEntryName())清除緩存邏輯瘦穆，這樣斷線重連后就不會因?yàn)榫彺鎲栴}不再執(zhí)行renewExpiration()這個方法了。

以上的代碼已經(jīng)提交PR到了Redisson最新的版本赊豌，使用最新的Redisson 3.14.0將不會有這個問題扛或。