TCP建立連接的過(guò)程中,Linux內(nèi)核會(huì)維護(hù)兩個(gè)隊(duì)列霸株,分別是:
1. 半連接隊(duì)列(SYN隊(duì)列)掺喻。
2. 全連接隊(duì)列(accept隊(duì)列)。
服務(wù)器端收到SYN請(qǐng)求之后仅乓,內(nèi)核會(huì)建立連接請(qǐng)求塊(req)赖舟,然后將該連接記錄到半連接隊(duì)列里面,發(fā)送SYN+ACK段夸楣,然后接收到客戶端發(fā)來(lái)的ACK回應(yīng)之后宾抓,會(huì)新創(chuàng)建傳輸控制塊(child)子漩,設(shè)置req->sk = child,然后將req放入全連接隊(duì)列icsk_accept_queue里面石洗,這時(shí)候會(huì)觸發(fā)accept喚醒幢泼,建立連接成功。
在內(nèi)核4.9中通過(guò)分析我發(fā)現(xiàn)內(nèi)核中沒(méi)有明確的劃分半連接隊(duì)列和全連接隊(duì)列讲衫,它們都是在icsk_accept_queue中缕棵,只不過(guò)半連接隊(duì)列增加和減少是對(duì)icsk_accept_queue->qlen計(jì)數(shù)增加或減少,而全連接隊(duì)列增加和減少是掛接在icsk_accept_queue->rskq_accept_head為首的鏈表中涉兽,根據(jù)內(nèi)核代碼來(lái)分析招驴。以ipv6為例。
1. 接收SYN請(qǐng)求
進(jìn)入到tcp_conn_request中處理SYN段
/** net/ipv4/tcp_input.c*/
int tcp_conn_request(struct request_sock_ops *rsk_ops,
const struct tcp_request_sock_ops *af_ops,
struct sock *sk, struct sk_buff *skb)
{
......
/*判斷半連接隊(duì)列是否已滿枷畏,主要函數(shù)為inet_csk_reqsk_queue_is_full别厘,此函數(shù)的詳細(xì)處理在下文*/
if ((net->ipv4.sysctl_tcp_syncookies == 2 ||
inet_csk_reqsk_queue_is_full(sk)) && !isn) {
want_cookie = tcp_syn_flood_action(sk, skb, rsk_ops->slab_name);
if (!want_cookie)
goto drop;
}
/*判斷全連接隊(duì)列是否已滿,主要函數(shù)為sk_acceptq_is_full拥诡,此函數(shù)的詳細(xì)處理在下文*/
if (sk_acceptq_is_full(sk)) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_LISTENOVERFLOWS);
goto drop;
}
/*創(chuàng)建建立連接控制塊req触趴,下面都是對(duì)req中相應(yīng)字段賦值*/
req = inet_reqsk_alloc(rsk_ops, sk, !want_cookie);
if (!req)
goto drop;
tcp_rsk(req)->af_specific = af_ops;
tcp_clear_options(&tmp_opt);
tmp_opt.mss_clamp = af_ops->mss_clamp;
tmp_opt.user_mss = tp->rx_opt.user_mss;
tcp_parse_options(skb, &tmp_opt, 0, want_cookie ? NULL : &foc);
......
if (!dst) {
dst = af_ops->route_req(sk, &fl, req, NULL);
if (!dst)
goto drop_and_free;
}
tcp_ecn_create_request(req, skb, sk, dst);
tcp_rsk(req)->snt_isn = isn;
tcp_rsk(req)->txhash = net_tx_rndhash();
tcp_openreq_init_rwin(req, sk, dst);
......
if (fastopen_sk) {
af_ops->send_synack(fastopen_sk, dst, &fl, req,
&foc, TCP_SYNACK_FASTOPEN);
/* Add the child socket directly into the accept queue */
if (!inet_csk_reqsk_queue_add(sk, req, fastopen_sk)) {
reqsk_fastopen_remove(fastopen_sk, req, false);
bh_unlock_sock(fastopen_sk);
sock_put(fastopen_sk);
reqsk_put(req);
goto drop;
}
sk->sk_data_ready(sk);
bh_unlock_sock(fastopen_sk);
sock_put(fastopen_sk);
} else {
tcp_rsk(req)->tfo_listener = false;
if (!want_cookie)
inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);/*半連接隊(duì)列中->qlen加1*/
af_ops->send_synack(sk, dst, &fl, req, &foc,
!want_cookie ? TCP_SYNACK_NORMAL :
TCP_SYNACK_COOKIE);
if (want_cookie) {
reqsk_free(req);
return 0;
}
}
reqsk_put(req);
return 0;
drop_and_release:
dst_release(dst);
drop_and_free:
reqsk_free(req);
drop:
tcp_listendrop(sk);
return 0;
}
EXPORT_SYMBOL(tcp_conn_request);
從代碼中可以看出,接收到新的SYN請(qǐng)求后會(huì)先檢查半連接隊(duì)列是否已滿渴肉,若滿則丟棄請(qǐng)求冗懦,然后再檢查全連接隊(duì)列是否已滿,若滿則丟棄請(qǐng)求仇祭。若隊(duì)列有剩余位置披蕉,則創(chuàng)建建立連接請(qǐng)求塊req,一系列操作完成之后會(huì)將新建的連接加入到半連接隊(duì)列中前塔,下面詳細(xì)來(lái)看其中三個(gè)隊(duì)列處理函數(shù)嚣艇。
inet_csk_reqsk_queue_is_full
static inline int inet_csk_reqsk_queue_is_full(const struct sock *sk)
{
return inet_csk_reqsk_queue_len(sk) >= sk->sk_max_ack_backlog;
}
----------------------------------------------------------------------------
static inline int inet_csk_reqsk_queue_len(const struct sock *sk)
{
return reqsk_queue_len(&inet_csk(sk)->icsk_accept_queue);
}
----------------------------------------------------------------------------
static inline int reqsk_queue_len(const struct request_sock_queue *queue)
{
return atomic_read(&queue->qlen);
}
從代碼中可以看出,此函數(shù)主要處理為判斷是否sock->icsk_accept_queue->qlen > sock->sk_max_ack_backlog华弓,sk_max_ack_backlog是在listen系統(tǒng)調(diào)用中確定的食零,若不了解listen系統(tǒng)調(diào)用,可移步Linux內(nèi)核TCP建立連接階段服務(wù)器端socket狀態(tài)變化以及在哈希表中的轉(zhuǎn)移流程詳解 - 簡(jiǎn)書 (jianshu.com)
sk_acceptq_is_full
static inline bool sk_acceptq_is_full(const struct sock *sk)
{
return sk->sk_ack_backlog > sk->sk_max_ack_backlog;
}
此函數(shù)判斷目前全連接隊(duì)列的長(zhǎng)度是否已經(jīng)超過(guò)最大隊(duì)列長(zhǎng)度寂屏,若超過(guò)則丟棄請(qǐng)求贰谣。
inet_csk_reqsk_queue_hash_add
void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
unsigned long timeout)
{
reqsk_queue_hash_req(req, timeout);
inet_csk_reqsk_queue_added(sk);
}
EXPORT_SYMBOL_GPL(inet_csk_reqsk_queue_hash_add);
----------------------------------------------------------------------------
static inline void inet_csk_reqsk_queue_added(struct sock *sk)
{
reqsk_queue_added(&inet_csk(sk)->icsk_accept_queue);
}
----------------------------------------------------------------------------
static inline void reqsk_queue_added(struct request_sock_queue *queue)
{
atomic_inc(&queue->young);
atomic_inc(&queue->qlen);
}
從代碼中可以看到,此函數(shù)主要處理為將sock->icsk_accept_queue + 1;
2. 接收ACK應(yīng)答
服務(wù)器端接收到ACK段后會(huì)對(duì)skb中的字段和hash表中儲(chǔ)存的req中的相應(yīng)字段進(jìn)行比對(duì)迁霎,若滿足條件的話則創(chuàng)建子傳輸控制塊child吱抚,這時(shí)會(huì)把req從ehash表中刪除,然后將child加入到ehash表中考廉,詳細(xì)流程移步Linux內(nèi)核TCP建立連接階段服務(wù)器端socket狀態(tài)變化以及在哈希表中的轉(zhuǎn)移流程詳解 - 簡(jiǎn)書 (jianshu.com)
后續(xù)傳輸信息都是通過(guò)此傳輸控制塊來(lái)傳輸信息秘豹,主要操作函數(shù)為tcp_check_req。
struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
struct request_sock *req,
bool fastopen)
{
struct tcp_options_received tmp_opt;
struct sock *child;
const struct tcphdr *th = tcp_hdr(skb);
__be32 flg = tcp_flag_word(th) & (TCP_FLAG_RST|TCP_FLAG_SYN|TCP_FLAG_ACK);
bool paws_reject = false;
bool own_req;
tmp_opt.saw_tstamp = 0;
if (th->doff > (sizeof(struct tcphdr)>>2)) {
tcp_parse_options(skb, &tmp_opt, 0, NULL);
if (tmp_opt.saw_tstamp) {
tmp_opt.ts_recent = req->ts_recent;
/* We do not store true stamp, but it is not required,
* it can be estimated (approximately)
* from another data.
*/
tmp_opt.ts_recent_stamp = get_seconds() - ((TCP_TIMEOUT_INIT/HZ)<<req->num_timeout);
paws_reject = tcp_paws_reject(&tmp_opt, th->rst);
}
}
/* Check for pure retransmitted SYN. */
if (TCP_SKB_CB(skb)->seq == tcp_rsk(req)->rcv_isn &&
flg == TCP_FLAG_SYN &&
!paws_reject) {
if (!tcp_oow_rate_limited(sock_net(sk), skb,
LINUX_MIB_TCPACKSKIPPEDSYNRECV,
&tcp_rsk(req)->last_oow_ack_time) &&
!inet_rtx_syn_ack(sk, req)) {
unsigned long expires = jiffies;
expires += min(TCP_TIMEOUT_INIT << req->num_timeout,
TCP_RTO_MAX);
if (!fastopen)
mod_timer_pending(&req->rsk_timer, expires);
else
req->rsk_timer.expires = expires;
}
return NULL;
}
if ((flg & TCP_FLAG_ACK) && !fastopen &&
(TCP_SKB_CB(skb)->ack_seq !=
tcp_rsk(req)->snt_isn + 1))
return sk;
/* Also, it would be not so bad idea to check rcv_tsecr, which
* is essentially ACK extension and too early or too late values
* should cause reset in unsynchronized states.
*/
/* RFC793: "first check sequence number". */
if (paws_reject || !tcp_in_window(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq,
tcp_rsk(req)->rcv_nxt, tcp_rsk(req)->rcv_nxt + req->rsk_rcv_wnd)) {
/* Out of window: send ACK and drop. */
if (!(flg & TCP_FLAG_RST) &&
!tcp_oow_rate_limited(sock_net(sk), skb,
LINUX_MIB_TCPACKSKIPPEDSYNRECV,
&tcp_rsk(req)->last_oow_ack_time))
req->rsk_ops->send_ack(sk, skb, req);
if (paws_reject)
__NET_INC_STATS(sock_net(sk), LINUX_MIB_PAWSESTABREJECTED);
return NULL;
}
/* In sequence, PAWS is OK. */
if (tmp_opt.saw_tstamp && !after(TCP_SKB_CB(skb)->seq, tcp_rsk(req)->rcv_nxt))
req->ts_recent = tmp_opt.rcv_tsval;
if (TCP_SKB_CB(skb)->seq == tcp_rsk(req)->rcv_isn) {
/* Truncate SYN, it is out of window starting
at tcp_rsk(req)->rcv_isn + 1. */
flg &= ~TCP_FLAG_SYN;
}
/* RFC793: "second check the RST bit" and
* "fourth, check the SYN bit"
*/
if (flg & (TCP_FLAG_RST|TCP_FLAG_SYN)) {
__TCP_INC_STATS(sock_net(sk), TCP_MIB_ATTEMPTFAILS);
goto embryonic_reset;
}
/* ACK sequence verified above, just make sure ACK is
* set. If ACK not set, just silently drop the packet.
*
* XXX (TFO) - if we ever allow "data after SYN", the
* following check needs to be removed.
*/
if (!(flg & TCP_FLAG_ACK))
return NULL;
/* For Fast Open no more processing is needed (sk is the
* child socket).
*/
if (fastopen)
return sk;
/* While TCP_DEFER_ACCEPT is active, drop bare ACK. */
if (req->num_timeout < inet_csk(sk)->icsk_accept_queue.rskq_defer_accept &&
TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) {
inet_rsk(req)->acked = 1;
__NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPDEFERACCEPTDROP);
return NULL;
}
/* OK, ACK is valid, create big socket and
* feed this segment to it. It will repeat all
* the tests. THIS SEGMENT MUST MOVE SOCKET TO
* ESTABLISHED STATE. If it will be dropped after
* socket is created, wait for troubles.
*/
child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL,
req, &own_req);
if (!child)
goto listen_overflow;
sock_rps_save_rxhash(child, skb);
tcp_synack_rtt_meas(child, req);
return inet_csk_complete_hashdance(sk, child, req, own_req);
listen_overflow:
if (!sysctl_tcp_abort_on_overflow) {
inet_rsk(req)->acked = 1;
return NULL;
}
embryonic_reset:
if (!(flg & TCP_FLAG_RST)) {
/* Received a bad SYN pkt - for TFO We try not to reset
* the local connection unless it's really necessary to
* avoid becoming vulnerable to outside attack aiming at
* resetting legit local connections.
*/
req->rsk_ops->send_reset(sk, skb);
} else if (fastopen) { /* received a valid RST pkt */
reqsk_fastopen_remove(sk, req, true);
tcp_reset(sk);
}
if (!fastopen) {
inet_csk_reqsk_queue_drop(sk, req);
__NET_INC_STATS(sock_net(sk), LINUX_MIB_EMBRYONICRSTS);
}
return NULL;
}
EXPORT_SYMBOL(tcp_check_req);
函數(shù)中前端都是檢測(cè)此ACK段是否合法昌粤,若不合法則丟棄此ACK段既绕,若合法最終會(huì)通過(guò)child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL,
req, &own_req);來(lái)創(chuàng)建子傳輸控制塊啄刹,若child創(chuàng)建成功,則會(huì)通過(guò)inet_csk_complete_hashdance對(duì)半連接隊(duì)列和全連接隊(duì)列進(jìn)行操作凄贩。
下面重點(diǎn)分析inet_csk_complete_hashdance中的操作誓军。
/** net/ipv4/inet_connection_sock.c*/
struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
struct request_sock *req, bool own_req)
{
if (own_req) {
inet_csk_reqsk_queue_drop(sk, req);
reqsk_queue_removed(&inet_csk(sk)->icsk_accept_queue, req);
if (inet_csk_reqsk_queue_add(sk, req, child))
return child;
}
/* Too bad, another child took ownership of the request, undo. */
bh_unlock_sock(child);
sock_put(child);
return NULL;
}
EXPORT_SYMBOL(inet_csk_complete_hashdance);
前面child創(chuàng)建成功的話這里own_req = true,先進(jìn)入inet_csk_reqsk_queue_drop進(jìn)行處理疲扎,此函數(shù)判斷ehash表中是否還有req昵时,從創(chuàng)建child那里我們就可知,創(chuàng)建child成功之后就將req從ehash表中刪除然后添加child到ehash表中椒丧,所以這里在ehash表中找不到req壹甥,所以inet_csk_reqsk_queue_drop函數(shù)中不做處理。之后進(jìn)入到reqsk_queue_removed函數(shù)壶熏,此函數(shù)如下所示:
/** include/net/request_sock.h*/
static inline void reqsk_queue_removed(struct request_sock_queue *queue,
const struct request_sock *req)
{
if (req->num_timeout == 0)
atomic_dec(&queue->young);
atomic_dec(&queue->qlen);
}
/** net/ipv4/inet_connection_sock.c*/
可見(jiàn)會(huì)將queue->qlen 減去 1盹廷,即將半連接隊(duì)列長(zhǎng)度減一。
然后會(huì)進(jìn)入到inet_csk_reqsk_queue_add函數(shù)里
struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
struct request_sock *req,
struct sock *child)
{
struct request_sock_queue *queue = &inet_csk(sk)->icsk_accept_queue;
spin_lock(&queue->rskq_lock);
if (unlikely(sk->sk_state != TCP_LISTEN)) {
inet_child_forget(sk, req, child);
child = NULL;
} else {
req->sk = child;
req->dl_next = NULL;
if (queue->rskq_accept_head == NULL)
queue->rskq_accept_head = req;
else
queue->rskq_accept_tail->dl_next = req;
queue->rskq_accept_tail = req;
sk_acceptq_added(sk);
}
spin_unlock(&queue->rskq_lock);
return child;
}
EXPORT_SYMBOL(inet_csk_reqsk_queue_add);
此函數(shù)中對(duì)全連接隊(duì)列進(jìn)行操作將req加入到icsk_accept_queue隊(duì)列中久橙,通過(guò)sk_acceptq_added函數(shù)將sock->sk_ack_backlog++;
進(jìn)程執(zhí)行調(diào)度方法:操作系統(tǒng)總體上按照時(shí)間片來(lái)調(diào)度進(jìn)程執(zhí)行,進(jìn)程執(zhí)行調(diào)度狀態(tài)分為三種:Running管怠、Ready 和 Block淆衷。等待資源就緒的進(jìn)程會(huì)置為 Block 狀態(tài)(比如調(diào)用 accept 并阻塞的進(jìn)程),資源就緒可以隨時(shí)運(yùn)行的進(jìn)程會(huì)放在每個(gè) CPU 的調(diào)度隊(duì)列里渤弛,獲得當(dāng)前 CPU 時(shí)間片運(yùn)行中的進(jìn)程是 Running 狀態(tài)祝拯,等待 CPU 時(shí)間片分配的進(jìn)程是 Ready 狀態(tài)。
內(nèi)核執(zhí)行中斷上下文:內(nèi)核在處理硬件中斷時(shí)她肯,會(huì)直接打斷正在執(zhí)行的 Running 狀態(tài)進(jìn)程(包括系統(tǒng)調(diào)用)佳头,進(jìn)行必要的內(nèi)存拷貝和狀態(tài)更新(比如處理 TCP 握手),結(jié)束中斷處理后恢復(fù)運(yùn)行被打斷的進(jìn)程晴氨。
Wait Queue:Linux 內(nèi)核實(shí)現(xiàn)進(jìn)程喚醒的關(guān)鍵數(shù)據(jù)結(jié)構(gòu)康嘉。通常一個(gè)事件體有一個(gè) wait queue,對(duì)這個(gè)事件體感興趣的進(jìn)程或者系統(tǒng)會(huì)提供回調(diào)函數(shù)籽前,并將自己注冊(cè)到這個(gè)事件體的 wait queue 上亭珍。當(dāng)事件發(fā)生時(shí),會(huì)調(diào)用注冊(cè)在 wait queue 上的回調(diào)函數(shù)枝哄。常見(jiàn)的回調(diào)函數(shù)是肄梨,將對(duì)這個(gè)事件感興趣的進(jìn)程的調(diào)度狀態(tài)置為 Ready,于是在調(diào)度系統(tǒng)重新分配 CPU 時(shí)間片時(shí)挠锥,將該進(jìn)程重新執(zhí)行众羡,從而實(shí)現(xiàn)進(jìn)程等待資源就緒而喚醒的過(guò)程。
以上所說(shuō)的TCP建立連接過(guò)程(網(wǎng)絡(luò)IO)蓖租,是如何從網(wǎng)卡到內(nèi)核粱侣,最終通知進(jìn)程做相關(guān)的處理的羊壹?下述其過(guò)程:
- 網(wǎng)卡收到 SYN,觸發(fā)內(nèi)核中斷甜害,直接打斷當(dāng)前執(zhí)行的進(jìn)程舶掖,CPU 進(jìn)行中斷處理邏輯(不展開(kāi) NAPI & 軟中斷過(guò)程),最終將該 SYN 連接信息保存在相應(yīng) listen socket 的半連接隊(duì)列里尔店,并向?qū)Ψ桨l(fā)送 SYN-ACK眨攘,然后恢復(fù)運(yùn)行被打斷的進(jìn)程。
- 進(jìn)程執(zhí)行完當(dāng)前作業(yè)嚣州,調(diào)用 accept 系統(tǒng)調(diào)用(阻塞)繼續(xù)處理新連接鲫售。accept 發(fā)現(xiàn)連接隊(duì)列當(dāng)前沒(méi)有新連接后,于是在 listen socket 的 wait queue 的上注冊(cè)喚醒自身進(jìn)程的回調(diào)函數(shù)该肴,然后內(nèi)核將這個(gè)進(jìn)程置為 Block 狀態(tài)情竹,并讓出 CPU 執(zhí)行其他 Ready 狀態(tài)的進(jìn)程。
- 網(wǎng)卡收到 ACK匀哄,繼續(xù)觸發(fā)內(nèi)核中斷秦效,內(nèi)核完成標(biāo)準(zhǔn)的三次握手,將連接從半連接隊(duì)列移入連接隊(duì)列涎嚼,于是 listen socket 有可讀事件阱州,內(nèi)核調(diào)用 listen socket 的 wait queue 的喚醒回調(diào)函數(shù),將之前阻塞的 accept 進(jìn)程置為 Ready 調(diào)度狀態(tài)法梯。
- 在內(nèi)核下一個(gè) CPU 調(diào)度窗口來(lái)臨時(shí)苔货,Ready 調(diào)度狀態(tài)的 accept 進(jìn)程被選中執(zhí)行,發(fā)現(xiàn)連接隊(duì)列有新連接立哑,于是讀取連接信息夜惭,并最終返回給用戶態(tài)進(jìn)程。