RTO(Retransmission TimeOut)即重傳超時(shí)時(shí)間
TCP超時(shí)與重傳中一個(gè)最重要的部分是對(duì)一個(gè)給定連接的往返時(shí)間(RTT)的測(cè)量吕漂。由于網(wǎng)絡(luò)流量的變化既绩,這個(gè)時(shí)間會(huì)相應(yīng)地發(fā)生改變邻遏,TCP需要跟蹤這些變化并動(dòng)態(tài)調(diào)整超時(shí)時(shí)間RTO。
RFC2988中是這樣描述RTO的:
“The Transmission Control Protocol(TCP) uses a retransmission timer to ensure data
delivery in the absence of ?any feedback from the remote data receiver.The duration of
this timer is referred to as RTO(retransmission timeout).”
RTT(Round Trip Time)由三部分組成:鏈路的傳播時(shí)間(propagation delay),末端系統(tǒng)的處理時(shí)間萝衩,路由器緩存中的排隊(duì)和處理時(shí)間(queuing delay)劳曹。
其中,前兩個(gè)部分的值對(duì)于一個(gè)TCP連接相對(duì)固定蔬螟,路由器緩存中的排隊(duì)和處理時(shí)間會(huì)隨著整個(gè)網(wǎng)絡(luò)擁塞程度的變化而變化此迅。所以RTT的變化在一定程度上反應(yīng)了網(wǎng)絡(luò)的擁塞程度。
平均偏差
平均偏差(mean deviation),簡(jiǎn)寫為mdev旧巾。
It is the mean of the distance between each value and the mean.It gives us an idea of how spread out from the center the set of values is.
Here's the formula.
通過計(jì)算平均偏差耸序,可以知道一組數(shù)據(jù)的波動(dòng)情況。
在這里鲁猩,平均偏差可以用來衡量RTT的抖動(dòng)情況坎怪。
RTT測(cè)量原理
RTT的測(cè)量可以采用兩種方法
(1)TCP Timestamp選項(xiàng)
TCP時(shí)間戳選項(xiàng)可以用來精確的測(cè)量RTT.
RTT=當(dāng)前時(shí)間- 數(shù)據(jù)包中Timestamp選項(xiàng)的回顯時(shí)間
這個(gè)回顯時(shí)間是該數(shù)據(jù)包發(fā)出去的時(shí)間,知道了數(shù)據(jù)包的接收時(shí)間(當(dāng)前時(shí)間)和發(fā)送時(shí)間(回顯時(shí)間),就可以輕松的得到RTT的一個(gè)測(cè)量值廓握。
(2)重傳隊(duì)列中數(shù)據(jù)包的TCP控制塊
在TCP重傳隊(duì)列中保存著發(fā)送而未被確認(rèn)的數(shù)據(jù)包搅窿,數(shù)據(jù)包skb中的TCP控制塊包含著一個(gè)變量,tcp_skb_cb->when隙券,記錄了該數(shù)據(jù)包的第一次發(fā)送時(shí)間男应。
RTT=當(dāng)前時(shí)間 - when
有人可能會(huì)問:既然不用TCP Timestamp選項(xiàng)就能測(cè)量出RTT,為啥還要多此一舉?
這是因?yàn)榉椒ㄒ槐确椒ǘ墓δ芨訌?qiáng)大娱仔,它們是有區(qū)別的沐飘。
“TCP must use Karn's algorithm for taking RTT samples.That is,RTT samples MUST NOT be made using segments that were retransmitted(and thus for which it is ambiguious whether the reply was for the first instance of the packet or a later instance).The only case when TCP can safely take RTT samples from retransmitted segments is when the TCP timestamp option is employed, since the timestamp option removes the ambiguity regarding which instance of the data segment triggered the acknowledgement.”
對(duì)于重傳的數(shù)據(jù)包的響應(yīng),方法1可以用它來采集一個(gè)新的RTT測(cè)量樣本牲迫,而方法二則不能耐朴。因?yàn)門CP Timestamp選項(xiàng)可以區(qū)分這個(gè)響應(yīng)是原始數(shù)據(jù)包還是重傳數(shù)據(jù)包觸發(fā)的,從而計(jì)算出準(zhǔn)確的RTT值盹憎。
RTT測(cè)量實(shí)現(xiàn)
發(fā)送方每接收到一個(gè)ACK筛峭,都會(huì)調(diào)用tcp_ack()來處理。
tcp_ack()中會(huì)調(diào)用tcp_clean_rtx_queue()來刪除重傳隊(duì)列中已經(jīng)被確認(rèn)的數(shù)據(jù)段陪每。
在tcp_clean_rtx_queue()中:
如果ACK確認(rèn)了重傳的數(shù)據(jù)包影晓,則seq_rtt=-1;
否則镰吵,seq_rtt = now - scb->when;
然后調(diào)用tcp_ack_update_rtt(sk,flag,seq_rtt)來更新RTT和RTO.
[java]
static inline void tcp_ack_update_rtt(struct sock *sk,const in flag,const s32 seq_rtt)
{
? ? ? ?const struct tcp_sock *tp = tcp_sk(sk);
/*Note that peer MAY send zero echo.In this case it is ignored.(rfc1323)*/
/*如果有啟用TCP Timestamp選項(xiàng)俯艰,且接收方的回顯不為0*/
if(tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr)
tcp_ack_saw_tstamp(sk,flag);/*方法一*/
else if(seq_rtt >= 0) ?/*不能是重傳數(shù)據(jù)包的ACK */
tcp_ack_no_tstamp(sk,seq_rtt,flag);/*方法二*/
}
方法一:tcp_ack_saw_tstamp()
[java]
/* Read draft-ietf-tcplw-high-performance before mucking with this code.
*?(Supersedes?RFC1323)
*/
staticvoidtcp_ack_saw_tstamp(struct?sock?*sk,intflag)
{
/*?RTTM?Rule?:?A?TSecr?value?received?in?a?segment?is?used?to?update?the
*?averaged?RTT?measurement?only?if?the?segment?acknowledges?some?new
*?data,?i.e.,?only?if?it?advances?the?left?edge?of?the?send?window.
*
*?See?draft-ietf-tcplw-high-performance-00,?section?3.3.
*?1998/04/10?Andrey?V.?Savochkin?saw@msu.ru
*
*?Changed?:?reset?backoff?as?soon?as?we?see?the?first?valid?sample.
*?If?we?do?not,?we?get?strongly?overestimated?rto.?With?timestamps
*?samples?are?accepted?even?from?very?old?segments:?f.e.,?when
*?rtt?=?1?increases?to?8,?we?retransmit?5?times?and?after?8?seconds
*?delayed?answer?arrives?rto?becomes?120?seconds!?If?at?least?one
*?of?segments?in?window?is?lost...?Volia.
*?——ANK(010210)
*/
struct?tcp_sock?*tp?=?tcp_sk(sk);
/*?RTT采樣值:now?-?rcv_tsecr?*/
tcp_valid_rtt_meas(sk,?tcp_time_stamp?-?tp->rx_opt.rcv_tsecr);
}
方法二:tcp_ack_no_tstamp()
[java]
staticvoidtcp_ack_no_tstamp(struct sock *sk, u32 seq_rtt,intflag)
{
/*?We?don't?have?a?timestsamp.?Can?only?use?packets?that?are?not
*?retransmitted?to?determine?rtt?estimates.?Also,?we?must?not?reset
*?the?backoff?for?rto?until?we?get?a?non-retransmitted?packet.?This
*?allows?us?to?deal?with?a?situation?where?the?network?delay?has
*?increased?suddenly.?I.e.?Karn's?algorithm.?(SIGCOMM?'87,?p5.)
*/
if(flag?&?FLAG_RETRANS_DATA_ACKED)
return;/*?如果ACK確認(rèn)的是重傳的數(shù)據(jù)包捡遍,則不進(jìn)行RTT采樣*/
/*?RTT采樣值:seq_rtt锌订,這個(gè)值是在tcp_clean_rtx_queue()中計(jì)算得到的竹握。*/
tcp_valid_rtt_meas(sk,?seq_rtt);
}
OK,到這邊RTT的測(cè)量已經(jīng)結(jié)束了,接下來就是RTO值得計(jì)算
RTO計(jì)算原理
涉及到的變量
[java]
struct tcp_sock {
...
/*?RTT?measurement?*/
u32?srtt;/*?smoothed?round?trip?time?<<?3?*/
u32?mdev;/*?medium?deviation?*/
u32?mdev_max;/*?maximal?mdev?for?the?last?rtt?period?*/
u32?rttvar/*?smoothed?mdev_max?*/
u32?rtt_seq;/*?sequence?number?to?update?rttvar?*/
...
}
srtt為經(jīng)過平滑后的RTT值辆飘,它代表著當(dāng)前的RTT值啦辐,每收到一個(gè)ACK更新一次。
為了避免浮點(diǎn)運(yùn)算蜈项,它是實(shí)際RTT值的8倍芹关。
mdev為RTT的平均偏差,用來衡量RTT的抖動(dòng)紧卒,每收到一個(gè)ACK更新一次侥衬。
mdev_max為上一個(gè)RTT內(nèi)的最大mdev,代表上個(gè)RTT內(nèi)時(shí)延的波動(dòng)情況跑芳,有效期為一個(gè)RTT.
rttvar為mdev_max的平滑值轴总,可升可降,代表著連接的抖動(dòng)情況博个,在連接斷開前都有效怀樟。
“To compute the current RTO,a TCP sender maintains two state variables,SRTT(smoothed round-trip time) and RTTVAR(round-trip time variation).”
實(shí)際上盆佣,RTO = srtt>>3+rttvar.
rtt表示新的RTT測(cè)量值往堡。
old_srtt表示srtt更新前的srtt>>3,即舊的srtt值共耍。
new_srtt表示srtt更新后的srtt>>3,即新的srtt值虑灰。
old_mdev表示舊的mdev。
new_mdev表示更新后的mdev痹兜。
(1)獲得第一個(gè)RTT測(cè)量值
srtt = rtt<<3;
mdev=rtt<<1;
mdev_max = rttvar = max(mdev,rto_min);
所以穆咐,獲得第一個(gè)RTT測(cè)量值后的RTO = rtt+rttvar,如果mdev=2*rtt>rto_min,
那么RTO = 3*rtt;否則 RTO=rtt+rto_min.
(2)獲得第一個(gè)RTT測(cè)量值
srtt = rtt<<3;
mdev = rtt<<1;
mdev_max=rttvar=max(mdev,rto_min);
所以獲得第一個(gè)RTT測(cè)量值后的RTO=rtt+rttvar,如果mdev = 2*rtt>rto_min,
那么RTO=3*rtt;否則佃蚜,RTO=rtt+rto_min庸娱。
(2)獲得第n個(gè)RTT測(cè)量值(n>=2)
srtt的更新:new_srtt = 7/8 old_srtt+1/8 rtt
mdev的更新:
err=rtt-old_srtt
當(dāng)RTT變小時(shí),即err<0時(shí)
1)如果|err|>1/4 old_mdev谐算,則new_mdev = 31/32 old_mdev + 1/8|err|
此時(shí):old_mdev<new_mdev<3/4 old_mdev + |err|
new_mdev有稍微變大熟尉,但是增大得不多。由于RTT是變小洲脂,所以RTO也要變小斤儿,如果
new_mdev增大很多(比如:new_mdev = 3/4 old_mdev+|err|),就會(huì)導(dǎo)致RTO變大剧包,不符合我們的預(yù)期
“This is similar to one of Eifel findings.Eifel blocks mdev updates when rtt decreases.
This solution is a bit different:we use finer gain for mdev in this case(alpha *beta).
Like Eifel it also prevents growth of rto,but also it limits too fast rto decreases,happening in pure Eifel.”
2)如果|err|<=1/4 old_mdev,則new_mdev=3/4 old_mdev + |err|
此時(shí):new_mdev < old_mdev
new_mdev變小,會(huì)導(dǎo)致RTO變小往果,符合我們的預(yù)期疆液。
當(dāng)RTT變大時(shí),即err>0時(shí)
new_mdev = 3/4 old_mdev + |err|
此時(shí):new_mdev > old_mdev
new_mdev變大陕贮,會(huì)導(dǎo)致RTO變小堕油,這也符合我們的預(yù)期。
mdev_max和rttvar的更新
在每個(gè)RTT開始時(shí)肮之,mdev_max = rto_min
如果在此RTO內(nèi)掉缺,有更大的mdev,則更新mdev_max戈擒。
如果mdev_max ?> rttvar,則rttvar = mdev_max;
否則眶明,本RTT結(jié)束后,rttvar -=(rttvar - mdev_max)>>2筐高。
這樣一來搜囱,就可以通過mdev_max來調(diào)節(jié)rttvar,間接的調(diào)節(jié)RTO。
RTO計(jì)算實(shí)現(xiàn)
不管是方法一還是方法二柑土,最終都調(diào)用tcp_valid_rtt_means()來更新RTT和RTO.
[java]
/* seq_rtt為此次得到的RTT測(cè)量值蜀肘。*/
voidtcp_valid_rtt_meas(struct?sock?*sk,?u32?seq_rtt)
{
tcp_rtt_estimator(sk,?seq_rtt);/*?更新相關(guān)值*/
tcp_set_rto(sk);/*設(shè)置新的RTO*/
inet_csk(sk)->icsk_backoff?=0;/*?清零退避指數(shù)*/
}
RTO = srtt>>8+rttvar。而srtt和rttvar的更新都是在tcp_rtt_estimator()來進(jìn)行冰单。
[java]
/* Called to compute a smoothed rtt estimate. The data fed to this
*?routine?either?comes?from?timestamps,?or?from?segments?that?were
*?known?_not_?to?have?been?retransmitted?[see?Karn/Partridge?Proceedings
*?SIGCOMM?87].?The?algorithm?is?from?the?SIGCOMM?88?piece?by?Van
*?Jacobson.
*?NOTE?:?the?next?three?routines?used?to?be?one?big?routine.
*?To?save?cycles?in?the?RFC?1323?implementation?it?was?better?to?break?it
*?up?into?three?procedures.?——erics
*/
staticvoidtcp_rtt_estimator?(struct?sock?*sk,const__u32?mrtt)
{
struct?tcp_sock?*tp?=?tcp_sk(sk);
longm?=?mrtt;/*此為得到的新的RTT測(cè)量值*/
/*?The?following?amusing?code?comes?from?Jacobson's?article?in
*?SIGCOMM?'88.?Note?that?rtt?and?mdev?are?scaled?versions?of?rtt?and
*?mean?deviation.?This?is?designed?to?be?as?fast?as?possible
*?m?stands?for?"measurement".
*
*?On?a?1990?paper?the?rto?value?is?changed?to?:
*?RTO?=?rtt?+?4?*?mdev
*
*?Funny.?This?algorithm?seems?to?be?very?broken.
*?These?formulae?increase?RTO,?when?it?should?be?decreased,?increase
*?too?slowly,?when?it?should?be?increased?quickly,?decrease?too?quickly
*?etc.?I?guess?in?BSD?RTO?takes?ONE?value,?so?that?it?is?absolutely?does
*?not?matter?how?to?calculate?it.?Seems,?it?was?trap?that?VJ?failed?to
*?avoid.?8)
*/
if(m?==0)
m?=1;/*?RTT的采樣值不能為0?*/
/*?不是得到第一個(gè)RTT采樣*/
if(tp->srtt?!=0)?{
m?-=?(tp->srtt?>>3);/*?m?is?now?error?in?rtt?est?*/
tp->srtt?+=?m;/*?rtt?=?7/8?rtt?+?1/8?new?幌缝,更新srtt*/
if(m?<0)?{/*RTT變小*/
m?=?-m;/*?m?is?now?abs(error)?*/
m?-=?(tp->mdev?>>2);/*?similar?update?on?mdev?*/
/*?This?is?similar?to?one?of?Eifel?findings.
*?Eifel?blocks?mdev?updates?when?rtt?decreases.
*?This?solution?is?a?bit?different?:?we?use?finer?gain
*?mdev?in?this?case?(alpha?*?beta).
*?Like?Eifel?it?also?prevents?growth?of?rto,?but?also?it
*?limits?too?fast?rto?decreases,?happening?in?pure?Eifel.
*/
if(m?>0)/*?|err|?>?1/4?mdev?*/
m?>>=3;
}else{/*?RTT變大?*/
m?-=?(tp->mdev?>>2);/*?similar?update?on?mdev?*/
}
tp->mdev?+=?m;/*?mdev?=?3/4?mdev?+?1/4?new,更新mdev?*/
/*?更新mdev_max和rttvar?*/
if(tp->mdev?>?tp->mdev_max)?{
tp->mdev_max?=?tp->mdev;
if(tp->mdev_max?>?tp->rttvar?)
tp->rttvar?=?tp->mdev_max;
}
/* 過了一個(gè)RTT了诫欠,更新mdev_max和rttvar */
if(after(tp->snd_una, tp->rtt_seq)) {
if(tp->mdev_max < tp->rttvar)/*減小rttvar */
tp->rttvar -= (tp->rttvar - tp->mdev_max) >>2;
tp->rtt_seq = tp->snd_nxt;
tp->mdev_max = tcp_rto_min(sk);/*重置mdev_max */
}
if(after(tp->snd_una, tp->rtt_seq)) {
if(tp->mdev_max < tp->rttvar)/*減小rttvar */
tp->rttvar -= (tp->rttvar - tp->mdev_max) >>2;
tp->rtt_seq = tp->snd_nxt;
tp->mdev_max = tcp_rto_min(sk);/*重置mdev_max */
}
}else{
/* 獲得第一個(gè)RTT采樣*/
/* no previous measure. */
tp->srtt = m <<3;/* take the measured time to be rtt */
tp->mdev = m <<1;/* make sure rto = 3 * rtt */
tp->mdev_max = tp->rttvar = max(tp->mdev, tcp_rto_min(sk));
tp->rtt_seq = tp->snd_nxt;/*設(shè)置更新mdev_max的時(shí)間*/
}
}
rto_min的取值如下:
[java]
/*?最大的RTO為120s涵卵,指數(shù)退避時(shí)不能超過這個(gè)值?*/
#define?TCP_RTO_MAX?((unsigned)?(120*HZ))
/*?最小的RTO為200ms,rttvar不能低于這個(gè)值?*/
#define?TCP_RTO_MIN?((unsigned)?(HZ?/5))
/*?還沒有計(jì)算出RTO值前的RTO初始值荒叼,為1s?*/
#define?TCP_TIMEOUT_INIT?((unsigned)?(1*?HZ))
/*?Compute?the?actual?rto_min?value?*/
staticinline?u32?tcp_rto_min?(struct?sock?*sk)
{
conststruct?dst_entry?*dst?=?__sk_dst_get(sk);
u32?rto_min?=?TCP_RTO_MIN;
/*如果路由緩存中存在RTO_MIN轿偎,則取其為最小RTO*/
if(dst?&&?dst_metric_locked(dst,?RTAX_RTO_MIN))
rto_min?=?dst_metric_rtt(dst,?RTAX_RTO_MIN));
returnrto_min;
}
RTO的設(shè)置
[java]
/* Calculate rto without backoff. This is the second half of Van Jacobson's
*?routine?referred?to?above.
*/
staticinlinevoidtcp_set_rto(struct?sock?*sk)
{
conststruct?tcp_sock?*tp?=?tcp_sk(sk);
inet_csk(sk)->icsk_rto?=?__tcp_set_rto(tp);
tcp_bound_rto(sk);
}
staticinline?u32?__tcp_set_rto(conststruct?tcp_sock?*tp)
{
return(tp->srtt?>>3)?+?tp->rttvar;
}
staticinlinevoidtcp_bound_rto(conststruct?sock?*sk)
{
if(inet_csk(sk)->icsk_rto?>?TCP_RTO_MAX)
inet_csk(sk)->icsk_rto?=?TCP_RTO_MAX;
}
函數(shù)調(diào)用
以上涉及到的函數(shù)調(diào)用關(guān)系如下:
總結(jié)
早期的RTT的測(cè)量是采用粗粒度的定時(shí)器(Coarse grained timer),這會(huì)有比較大的誤差。
現(xiàn)在由于TCP Timestamp選項(xiàng)的使用被廓,能夠更精確的測(cè)量RTT,從而計(jì)算出更加準(zhǔn)確的RTO