問題描述
客戶端使用 Lettuce.io 連接 Azure Redis,出現(xiàn)了長達(dá)15分鐘的Timeout異常阵难。
問題解答
Azure Redis作為PaaS服務(wù)岳枷,由于一些平臺的升級操作而引發(fā)的故障轉(zhuǎn)移(Failover)。 如Redis的客戶端時部署在Linux服務(wù)器上呜叫,則可能導(dǎo)致長達(dá)15分鐘無法重新連接的問題空繁。
某些 Linux 版本中的默認(rèn) TCP 設(shè)置可能會導(dǎo)致 Redis 服務(wù)器連接失敗 13 分鐘或更長時間。 默認(rèn)設(shè)置可以防止客戶端應(yīng)用程序檢測關(guān)閉的連接朱庆,并在連接未正常關(guān)閉的情況下防止自動還原這些關(guān)閉的連接盛泡。
如果網(wǎng)絡(luò)連接中斷或 Redis 服務(wù)器脫機(jī)進(jìn)行計劃外維護(hù),重新建立連接可能會失敗娱颊。
目前Lettuce社區(qū)已知問題傲诵,在server端未發(fā)RST斷開服務(wù)的場景下,Lettuce自恢復(fù)需要15+分鐘的時間箱硕。https://github.com/lettuce-io/lettuce-core/issues/2082
目前已知有效的方式是修改linux tcp_retries參數(shù)拴竹,https://docs.azure.cn/zh-cn/azure-cache-for-redis/cache-best-practices-connection#tcp-settings-for-linux-hosted-client-applications
此外,Lettuce社區(qū)也有一些解決方案剧罩,https://github.com/lettuce-io/lettuce-core/issues/2082#issuecomment-1407609439
附錄: Connection does not re-establish for 15 minutes when running on Linux
Connection stalls lasting for 15 minutes like this are often caused by very optimistic default TCP settings in some Linux distros (confirmed on CentOS so far). When a server stops responding without gracefully closing the connection, the client TCP stack will continue retransmitting packets for 15 minutes before declaring the connection dead and allowing the StackExchange.Redis reconnect logic to kick in.
With Azure Cache for Redis, it's fairly easy to reproduce this by rebooting nodes as mentioned above. In this case, the machine goes down abruptly and the Redis server isn't able to transmit a FIN packet to the client. The client TCP stack continues retransmitting on the same socket hoping the server will come back up. Even when the node has rebooted and come back, it has no record of that connection so it continues ignoring the client. If the client gave up and created a NEW connection, it would be able to resume communication with the server much sooner than 15 minutes.
As you found, there are TCP settings you can change on the client machine to force it to timeout the connection sooner and allow for reconnect. In addition to tcp_retries2, you can try tuning the keepalive settings as discussed here: lettuce-io/lettuce-core#1428 (comment). It should be safe to reduce these timeouts to more realistic durations machine-wide unless you have systems that actually depend on the unusually long retransmits.
An additional approach is using the ForceReconnect pattern recommended in the Azure best practices. If you're seeing issues like this, it's perfectly appropriate to trigger reconnect on RedisTimeoutExceptions in addition to RedisConnectionExceptions. Just don't be too aggressive with it because an overloaded server can also result in persistent RedisTimeoutExceptions. Recreating connections in that situation can cause additional server load and a cascade failure.
Unfortunately there's not much the StackExchange.Redis library can do about this situation, because the Linux TCP stack is hiding the lost connection. Detecting the stall at the library level would require making assumptions that would almost certainly lead to false positives in some scenarios. Instead, it's better for the client application to implement some detection/reconnection logic based on what it knows about its load and latency patterns.