概述
RDMA(Remote Direct Memory Access)是新一代的網(wǎng)絡(luò)通信技術(shù),它允許計算機(jī)之間直接進(jìn)行內(nèi)存對內(nèi)存的數(shù)據(jù)傳輸胁出,而不需要經(jīng)過操作系統(tǒng)或中央處理器的處理。在大規(guī)模的分布式訓(xùn)練中段审,通過使用RDMA有效解決網(wǎng)絡(luò)傳輸中服務(wù)器端數(shù)據(jù)處理的延遲問題全蝶,從而實現(xiàn)高吞吐、低延遲的網(wǎng)絡(luò)通信,提升訓(xùn)練效率抑淫。
環(huán)境準(zhǔn)備
已經(jīng)創(chuàng)建集群绷落,且集群中至少有2臺具有RDMA網(wǎng)絡(luò)的GPU實例。
GPU實例鏡像中包含ofed和nvidia驅(qū)動始苇,這里推薦使用百度智能云提供的GPU鏡像砌烁,已包含OFED驅(qū)動,無需手動安裝催式。
集群已安裝 云原生AI CCE RDMA Device Plugin函喉、 CCE GPU Manager 、 CCE AI Job Scheduler 和 CCE Deep Learning Frameworks Operator 組件荣月。
驗證
登錄集群內(nèi)具有 RDMA 網(wǎng)絡(luò)的GPU節(jié)點管呵,運行以下命令驗證主機(jī)環(huán)境。
$ ofed_info -s #roce驅(qū)動版本
MLNX_OFED_LINUX-*.*-*.*.*.*:
驗證 Nvidia GPU 驅(qū)動
nvidia-smi #nvidia gpu驅(qū)動
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-SXM... On | 00000000:53:00.0 Off | 0 |
| N/A 29C P0 64W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-SXM... On | 00000000:59:00.0 Off | 0 |
| N/A 32C P0 61W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-SXM... On | 00000000:6E:00.0 Off | 0 |
| N/A 33C P0 67W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-SXM... On | 00000000:73:00.0 Off | 0 |
| N/A 29C P0 60W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 4 NVIDIA A100-SXM... On | 00000000:8D:00.0 Off | 0 |
| N/A 29C P0 60W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 5 NVIDIA A100-SXM... On | 00000000:92:00.0 Off | 0 |
| N/A 32C P0 65W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 6 NVIDIA A100-SXM... On | 00000000:C9:00.0 Off | 0 |
| N/A 33C P0 64W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 7 NVIDIA A100-SXM... On | 00000000:CF:00.0 Off | 0 |
| N/A 28C P0 62W / 400W | 0MiB / 81251MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
查詢 RDMA 網(wǎng)卡
show_gids
DEV PORT INDEX GID IPv4 VER DEV
--- ---- ----- --- ------------ --- ---
mlx5_0 1 0 fe80:0000:0000:0000:f820:20ff:fe28:c769 v1 eth0
mlx5_0 1 1 fe80:0000:0000:0000:f820:20ff:fe28:c769 v2 eth0
mlx5_0 1 2 0000:0000:0000:0000:0000:ffff:0a00:3c03 10.0.60.3 v1 eth0
mlx5_0 1 3 0000:0000:0000:0000:0000:ffff:0a00:3c03 10.0.60.3 v2 eth0
mlx5_1 1 0 fe80:0000:0000:0000:eaeb:d3ff:fecc:c920 v1 eth1
mlx5_1 1 1 fe80:0000:0000:0000:eaeb:d3ff:fecc:c920 v2 eth1
mlx5_1 1 2 0000:0000:0000:0000:0000:ffff:190b:8002 25.11.128.2 v1 eth1
mlx5_1 1 3 0000:0000:0000:0000:0000:ffff:190b:8002 25.11.128.2 v2 eth1
mlx5_2 1 0 fe80:0000:0000:0000:eaeb:d3ff:fecc:c921 v1 eth2
mlx5_2 1 1 fe80:0000:0000:0000:eaeb:d3ff:fecc:c921 v2 eth2
mlx5_2 1 2 0000:0000:0000:0000:0000:ffff:190b:8022 25.11.128.34 v1 eth2
mlx5_2 1 3 0000:0000:0000:0000:0000:ffff:190b:8022 25.11.128.34 v2 eth2
mlx5_3 1 0 fe80:0000:0000:0000:eaeb:d3ff:fe6c:51d2 v1 eth3
mlx5_3 1 1 fe80:0000:0000:0000:eaeb:d3ff:fe6c:51d2 v2 eth3
mlx5_3 1 2 0000:0000:0000:0000:0000:ffff:190b:8042 25.11.128.66 v1 eth3
mlx5_3 1 3 0000:0000:0000:0000:0000:ffff:190b:8042 25.11.128.66 v2 eth3
mlx5_4 1 0 fe80:0000:0000:0000:eaeb:d3ff:fe6c:51d3 v1 eth4
mlx5_4 1 1 fe80:0000:0000:0000:eaeb:d3ff:fe6c:51d3 v2 eth4
mlx5_4 1 2 0000:0000:0000:0000:0000:ffff:190b:8062 25.11.128.98 v1 eth4
mlx5_4 1 3 0000:0000:0000:0000:0000:ffff:190b:8062 25.11.128.98 v2 eth4
mlx5_5 1 0 fe80:0000:0000:0000:eaeb:d3ff:fe33:1366 v1 eth5
mlx5_5 1 1 fe80:0000:0000:0000:eaeb:d3ff:fe33:1366 v2 eth5
mlx5_5 1 2 0000:0000:0000:0000:0000:ffff:190b:8082 25.11.128.130 v1 eth5
mlx5_5 1 3 0000:0000:0000:0000:0000:ffff:190b:8082 25.11.128.130 v2 eth5
mlx5_6 1 0 fe80:0000:0000:0000:eaeb:d3ff:fe33:1367 v1 eth6
mlx5_6 1 1 fe80:0000:0000:0000:eaeb:d3ff:fe33:1367 v2 eth6
mlx5_6 1 2 0000:0000:0000:0000:0000:ffff:190b:80a2 25.11.128.162 v1 eth6
mlx5_6 1 3 0000:0000:0000:0000:0000:ffff:190b:80a2 25.11.128.162 v2 eth6
mlx5_7 1 0 fe80:0000:0000:0000:eaeb:d3ff:fe6c:68ae v1 eth7
mlx5_7 1 1 fe80:0000:0000:0000:eaeb:d3ff:fe6c:68ae v2 eth7
mlx5_7 1 2 0000:0000:0000:0000:0000:ffff:190b:80c2 25.11.128.194 v1 eth7
mlx5_7 1 3 0000:0000:0000:0000:0000:ffff:190b:80c2 25.11.128.194 v2 eth7
mlx5_8 1 0 fe80:0000:0000:0000:eaeb:d3ff:fe6c:68af v1 eth8
mlx5_8 1 1 fe80:0000:0000:0000:eaeb:d3ff:fe6c:68af v2 eth8
mlx5_8 1 2 0000:0000:0000:0000:0000:ffff:190b:80e2 25.11.128.226 v1 eth8
mlx5_8 1 3 0000:0000:0000:0000:0000:ffff:190b:80e2 25.11.128.226 v2 eth8
NCCL使用
NCCL是NVIDIA的集合通信庫哺窄,能實現(xiàn)Collective通信和點對點通信捐下,NCCL內(nèi)部已經(jīng)實現(xiàn)了RDMA通信,同時NCCL可以根據(jù)環(huán)境中網(wǎng)卡類型和拓?fù)潢P(guān)系堂氯,自行選擇一個最優(yōu)的通信路徑,目前主流的分布式訓(xùn)練框架都已支持NCCL牌废。