引自 http://hustcat.github.io/qos-in-roce/
Overview
??TCP/IP協(xié)議棧滿足不了現(xiàn)代IDC工作負(fù)載(workloads)的需求伴栓,主要有2個(gè)原因:(1)內(nèi)核處理收發(fā)包需要消耗大量的CPU贬墩;(2)TCP不能滿足應(yīng)用對(duì)低延遲的需求:一方面逃贝,內(nèi)核協(xié)議棧會(huì)帶來數(shù)十ms的延遲璃诀;另一方面,TCP的擁塞控制算法改橘、超時(shí)重傳機(jī)制都會(huì)增加延遲跺嗽。
??RDMA在NIC內(nèi)部實(shí)現(xiàn)傳輸協(xié)議,所以沒有第一個(gè)問題囤攀;同時(shí)软免,通過zero-copy
宫纬、kernel bypass
避免了內(nèi)核層面的延遲。
??與TCP不同的是膏萧,RDMA需要一個(gè)無損(lossless)的網(wǎng)絡(luò)漓骚。例如蝌衔,交換機(jī)不能因?yàn)榫彌_區(qū)溢出而丟包。為此蝌蹂,RoCE使用PFC(Priority-based Flow Control)
帶進(jìn)行流控噩斟。一旦交換機(jī)的port的接收隊(duì)列超過一定閥值(shreshold)時(shí),就會(huì)向?qū)Χ税l(fā)送PFC pause frame
孤个,通知發(fā)送端停止繼續(xù)發(fā)包剃允。一旦接收隊(duì)列低于另一個(gè)閥值時(shí),就會(huì)發(fā)送一個(gè)pause with zero duration
齐鲤,通知發(fā)送端恢復(fù)發(fā)包斥废。
??PFC對(duì)數(shù)據(jù)流進(jìn)行分類(class),不同種類的數(shù)據(jù)流設(shè)置不同的優(yōu)先級(jí)给郊。比如將RoCE的數(shù)據(jù)流和TCP/IP等其它數(shù)據(jù)流設(shè)置不同的優(yōu)先級(jí)牡肉。詳細(xì)參考Considerations for Global Pause, PFC and QoS with Mellanox Switches and Adapters
Network Flow Classification
對(duì)于IP/Ethernet,有2種方式對(duì)網(wǎng)絡(luò)流量分類:
- By using PCP bits on the VLAN header
- By using DSCP bits on the IP header
詳細(xì)介紹參考Understanding QoS Configuration for RoCE淆九。
Traffic Control Mechanisms
??對(duì)于RoCE统锤,有2個(gè)機(jī)制用于流控:Flow Control (PFC)
和Congestion Control (DCQCN)
,這兩個(gè)機(jī)制可以同時(shí)炭庙,也可以分開工作饲窿。
- Flow Control (PFC)
??PFC是一個(gè)鏈路層協(xié)議,只能針對(duì)port進(jìn)行流控焕蹄,粒度較粗免绿。一旦發(fā)生擁塞,會(huì)導(dǎo)致整個(gè)端口停止pause擦盾。這是不合理的嘲驾,參考Understanding RoCEv2 Congestion Management。為此迹卢,RoCE引入Congestion Control
辽故。
- Congestion Control (DCQCN)
DC-QCN
是RoCE使用的擁塞控制協(xié)議,它基于Explicit Congestion Notification (ECN)
腐碱。后面會(huì)詳細(xì)介紹誊垢。
PFC
前面介紹有2種方式對(duì)網(wǎng)絡(luò)流量進(jìn)行分類,所以症见,PFC也有2種實(shí)現(xiàn)喂走。
VLAN-based PFC
- VLAN tag
基于VLAN tag的Priority code point (PCP,3-bits)定義了8個(gè)Priority.
- VLAN-based PFC
In case of L2 network, PFC uses the priority bits within the VLAN tag (IEEE 802.1p) to differentiate up to eight types of flows that can be subject to flow control (each one independently).
- RoCE with VLAN-based PFC
HowTo Run RoCE and TCP over L2 Enabled with PFC.
## 將skb prio 0~7 映射到vlan prio 3
for i in {0..7}; do ip link set dev eth1.100 type vlan egress-qos-map $i:3 ; done
## enable PFC on TC3
mlnx_qos -i eth1 -f 0,0,0,1,0,0,0,0
例如:
[root@node1 ~]# cat /proc/net/vlan/eth1.100
eth1.100 VID: 100 REORDER_HDR: 1 dev->priv_flags: 1001
total frames received 0
total bytes received 0
Broadcast/Multicast Rcvd 0
total frames transmitted 0
total bytes transmitted 0
Device: eth1
INGRESS priority mappings: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0
EGRESS priority mappings:
[root@node1 ~]# for i in {0..7}; do ip link set dev eth1.100 type vlan egress-qos-map $i:3 ; done
[root@node1 ~]# cat /proc/net/vlan/eth1.100
eth1.100 VID: 100 REORDER_HDR: 1 dev->priv_flags: 1001
total frames received 0
total bytes received 0
Broadcast/Multicast Rcvd 0
total frames transmitted 0
total bytes transmitted 0
Device: eth1
INGRESS priority mappings: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0
EGRESS priority mappings: 0:3 1:3 2:3 3:3 4:3 5:3 6:3 7:3
參考HowTo Set Egress Priority VLAN on Linux.
- 問題
??基于VLAN的PFC機(jī)制有2個(gè)主要問題:(1)交換機(jī)需要工作在trunk模式谋作;(2)沒有標(biāo)準(zhǔn)的方式實(shí)現(xiàn)VLAN PCP
跨L3網(wǎng)絡(luò)傳輸(VLAN是一個(gè)L2協(xié)議)芋肠。
??DSCP-based PFC
通過使用IP頭部的DSCP
字段解決了上面2個(gè)問題。
DSCP-based PFC
DSCP-based PFC requires both NICs and switches to classify and queue packets based on the DSCP value instead of the VLAN tag.
- DSCP vs TOS
The type of service (ToS) field in the IPv4 header has had various purposes over the years, and has been defined in different ways by five RFCs.[1] The modern redefinition of the ToS field is a six-bit Differentiated Services Code Point (DSCP) field[2] and a two-bit Explicit Congestion Notification (ECN) field.[3] While Differentiated Services is somewhat backwards compatible with ToS, ECN is not.
詳細(xì)介紹參考:
PFC機(jī)制的一些問題
RDMA的PFC機(jī)制可能會(huì)導(dǎo)致一些問題:
- RDMA transport livelock
??盡管PFC可以避免buffer overflow
導(dǎo)致的丟包遵蚜,但是帖池,其它一些原因奈惑,比如FCS錯(cuò)誤,也可能導(dǎo)致網(wǎng)絡(luò)丟包睡汹。RDMA的go-back-0
算法肴甸,每次出現(xiàn)丟包,都會(huì)導(dǎo)致整個(gè)message的所有packet都會(huì)重傳囚巴,從而導(dǎo)致livelock
原在。TCP有SACK算法,由于RDMA傳輸層在NIC實(shí)現(xiàn)彤叉,受限于硬件資源晤斩,NIC很難實(shí)現(xiàn)SACK算法∧芳幔可以使用go-back-N
算法來避免這個(gè)問題澳泵。
- PFC Deadlock
??當(dāng)PFC機(jī)制與Ethernet的廣播機(jī)制工作時(shí),可能導(dǎo)致出現(xiàn)PFC Deadlock
兼呵。簡(jiǎn)單來說兔辅,就是PFC機(jī)制會(huì)導(dǎo)致相應(yīng)的port停止發(fā)包,而Ethernet的廣播包可能引起新的PFC pause
依賴(比如port對(duì)端的server down掉)击喂,從而引起循環(huán)依賴维苔。廣播和多播對(duì)于loseless
是非常危險(xiǎn)的,建議不要將其歸于loseless classes
懂昂。
- NIC PFC pause frame storm
??由于PFC pause
是傳遞的介时,所以很容易引起pause frame storm
。比如凌彬,NIC因?yàn)閎ug導(dǎo)致接收緩沖區(qū)填滿沸柔,NIC會(huì)一直對(duì)外發(fā)送pause frame
。需要在NIC端和交換機(jī)端使用watchdog
機(jī)制來防止pause storm
铲敛。
- The Slow-receiver symptom
??由于NIC的資源有限褐澎,它將大部分?jǐn)?shù)據(jù)結(jié)構(gòu),比如QPC(Queue Pair Context)
和WQE (Work Queue Element)
都放在host memory伐蒋。而NIC只會(huì)緩存部分?jǐn)?shù)據(jù)對(duì)象工三,一旦出現(xiàn)cache miss
,NIC的處理速度就會(huì)下降先鱼。
ECN
ECN with TCP/IP
??ECN是一個(gè)端到端的擁塞通知機(jī)制俭正,而不需要丟包。ECN是可選的特性焙畔,它需要端點(diǎn)開啟ECN支持掸读,同時(shí)底層的網(wǎng)絡(luò)也需要支持。
??傳統(tǒng)的TCP/IP網(wǎng)絡(luò),通過丟包來表明網(wǎng)絡(luò)擁塞寺枉,router/switch/server
都會(huì)這么做抑淫。而對(duì)于支持ECN的路由器绷落,當(dāng)發(fā)生網(wǎng)絡(luò)擁塞時(shí)姥闪,會(huì)設(shè)置IP頭部的ECN(2bits)標(biāo)志位,而接收端會(huì)給發(fā)送端返回?fù)砣耐ㄖ?echo of the congestion indication
)砌烁,然后發(fā)送端降低發(fā)送速率筐喳。
??由于發(fā)送速率由傳輸層(TCP)控制,所以函喉,ECN需要TCP和IP層同時(shí)配合避归。
rfc3168定義了ECN for TCP/IP
。
ECN with IP
IP頭部有2個(gè)bit的ECN標(biāo)志位:
- 00 – Non ECN-Capable Transport, Non-ECT
- 10 – ECN Capable Transport, ECT(0)
- 01 – ECN Capable Transport, ECT(1)
- 11 – Congestion Encountered, CE.
如果端點(diǎn)支持ECN管呵,就數(shù)據(jù)包中的標(biāo)志位設(shè)置為ECT(0)
或者ECT(1)
梳毙。
ECN with TCP
??為了支持ECN,TCP使用了TCP頭部的3個(gè)標(biāo)志位:Nonce Sum (NS)
捐下,ECN-Echo (ECE)
和Congestion Window Reduced (CWR)
账锹。
ECN in RoCEv2
??RoCEv2引入了ECN機(jī)制來實(shí)現(xiàn)擁塞控制,即RoCEv2 Congestion Management (RCM)
坷襟。通過RCM奸柬,一旦網(wǎng)絡(luò)發(fā)生擁塞,就會(huì)通知發(fā)送端降低發(fā)送速率婴程。與TCP類似廓奕,RoCEv2使用傳輸層頭部Base Transport Header (BTH)
的FECN
標(biāo)志位來標(biāo)識(shí)擁塞。
??實(shí)現(xiàn)RCM的RoCEv2 HCAs必須遵循下面的規(guī)則:
??(1) 如果收到IP.ECN為11
的包档叔,HCA生成一個(gè)RoCEv2 CNP(Congestion Notification Packet)包桌粉,返回給發(fā)送端;
??(2) 如果收到RoCEv2 CNP
包衙四,則降低對(duì)應(yīng)QP的發(fā)送速率番甩;
??(3) 從上一次收到RoCEv2 CNP
后,經(jīng)過配置的時(shí)間或者字節(jié)數(shù)届搁,HCA可以增加對(duì)應(yīng)QP的發(fā)送速率缘薛。
- RCM的一些術(shù)語
Term | Description |
---|---|
RP (Injector) | Reaction Point - the end node that performs rate limitation to prevent congestion |
NP | Notification Point - the end node that receives the packets from the injector and sends back notifications to the injector for indications regarding the congestion situation |
CP | Congestion Point - the switch queue in which congestion happens |
CNP | The RoCEv2 Congestion Notification Packet - The notification message an NP sends to the RP when it receives CE marked packets. |
- RoCEv2的ECN示例
- ECN的配置
Refs
一些關(guān)于PFC的文獻(xiàn)
- RDMA over Commodity Ethernet at Scale
- Network Considerations for Global Pause, PFC and QoS with Mellanox Switches and Adapters
- Understanding QoS Configuration for RoCE
- HowTo Run RoCE and TCP over L2 Enabled with PFC
- Revisiting Network Support for RDMA
- RoCE v2 Considerations