0. 安裝前言
- The list of prerequisites for running NVIDIA Container Toolkit is described below:
GNU/Linux x86_64 with kernel version > 3.10
Docker >= 19.03 (recommended, but some distributions may include older versions of Docker. The minimum supported version is 1.12)
NVIDIA GPU with Architecture >= Kepler (or compute capability 3.0)
NVIDIA Linux drivers >= 418.81.07 (Note that older driver releases or branches are unsupported.)
安裝docker-19.03及以上版本
docker19.03及以上版本,已經(jīng)內(nèi)置了nvidia-docker翘魄,無需再單獨(dú)部署nvidia-docker了罗岖。安裝方式如下:
安裝docker:
我安裝20版本的docker 纺非,具體步驟不在描述窄陡。
[root@localhost home]# yum install docker-ce-20.10.17
Running transaction
正在安裝 : 2:container-selinux-2.119.2-1.911c772.el7_8.noarch 1/4
正在安裝 : containerd.io-1.6.8-3.1.el7.x86_64 2/4
正在安裝 : 3:docker-ce-20.10.17-3.el7.x86_64 3/4
正在安裝 : docker-ce-rootless-extras-20.10.18-3.el7.x86_64 4/4
驗(yàn)證中 : docker-ce-rootless-extras-20.10.18-3.el7.x86_64 1/4
驗(yàn)證中 : 2:container-selinux-2.119.2-1.911c772.el7_8.noarch 2/4
驗(yàn)證中 : containerd.io-1.6.8-3.1.el7.x86_64 3/4
驗(yàn)證中 : 3:docker-ce-20.10.17-3.el7.x86_64 4/4
已安裝:
docker-ce.x86_64 3:20.10.17-3.el7
作為依賴被安裝:
container-selinux.noarch 2:2.119.2-1.911c772.el7_8 containerd.io.x86_64 0:1.6.8-3.1.el7
docker-ce-rootless-extras.x86_64 0:20.10.18-3.el7
完畢!
只安裝docker 沒有安裝nvidia-docker2
[root@localhost home]# docker --version
Docker version 20.10.18, build b40c2f6
[root@localhost home]# systemctl start docker
[root@localhost home]# systemctl enable docker
1. Ubuntu安裝nvidia-docker
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/experimental/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/experimental/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-docker2
# 設(shè)置默認(rèn)運(yùn)行時(shí)后,重新啟動(dòng)Docker守護(hù)程序以完成安裝:
sudo systemctl restart docker
# 此時(shí)策菜,可以通過運(yùn)行基本CUDA容器來測試工作設(shè)置:
sudo docker run --rm --gpus all nvidia/cuda:11.0.3-base-ubuntu20.04 nvidia-smi
這將產(chǎn)生如下所示的控制臺(tái)輸出:
Thu Sep 29 12:30:53 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... Off | 00000000:00:0C.0 Off | 0 |
| N/A 41C P0 47W / 300W | 0MiB / 80994MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
如果輸出跟直接在宿主機(jī)上執(zhí)行 nvidia-smi
一致則說明安裝成功。
[root@localhost]# nvidia-docker version
NVIDIA Docker: 2.11.0
/usr/bin/nvidia-docker:行34: /usr/bin/docker: 權(quán)限不夠
/usr/bin/nvidia-docker:行34: /usr/bin/docker: 成功
[root@localhost]# setenforce 0
[root@localhost]# nvidia-docker version
NVIDIA Docker: 2.11.0
2. Centos7安裝nvidia-docker
docker 已經(jīng)安裝完畢,20.版本的
安裝nvidia-container-toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
sudo yum install -y nvidia-container-toolkit
sudo yum install -y nvidia-docker2
sudo systemctl restart docker
啟動(dòng)容器:
[root@localhost ]# docker run --gpus all nvidia/cuda:10.0-base /bin/sh -c "while true; do echo hello world; sleep 1; done"
hello world
hello world
hello world
驗(yàn)證:
- 查看–gpus 參數(shù)是否安裝成功:
[root@localhost]# docker run --help | grep -i gpus
--gpus gpu-request GPU devices to add to the container ('all' to pass all GPUs)
自從升級(jí)了docker19后跑需要gpu的docker只需要加個(gè)參數(shù)–gpus all 即可(表示使用所有的gpu沟沙,如果要使用2個(gè)gpu:–gpus 2颊咬,也可直接指定哪幾個(gè)卡:–gpus ‘“device=1,2”’挺尿,后面有詳細(xì)介紹)窄俏。
--gpus '"device=1,2"'仰坦,這個(gè)的意思是,將物理機(jī)的第二塊、第三塊gpu卡映射給容器?
下面三個(gè)參數(shù)代表的都是是容器內(nèi)可以使用物理機(jī)的所有g(shù)pu卡
--gpus all
NVIDIA_VISIBLE_DEVICES=all
--runtime=nvida
NVIDIA_VISIBLE_DEVICES=2 只公開兩個(gè)gpu校读,容器內(nèi)只能用兩個(gè)gpu
使用顯卡數(shù)量示例
- 使用所有顯卡
$ docker run --rm --gpus all nvidia/cuda nvidia-smi
$ docker run --rm --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=all nvidia/cuda nvidia-smi
- 指明使用哪幾張卡
$ docker run --gpus '"device=1,2"' nvidia/cuda nvidia-smi
$ docker run --rm --runtime=nvidia -e NVIDIA_VISIBLE_DEVICES=1,2 nvidia/cuda nvidia-smi
到這里在 Docker 下使用 Nvidia 顯卡加速計(jì)算的基礎(chǔ)環(huán)境搭建就介紹完了
- 運(yùn)行nvidia官網(wǎng)提供的鏡像兔甘,并輸入nvidia-smi命令洞焙,查看nvidia界面是否能夠啟動(dòng):
[root@localhost]# docker run --rm --gpus all nvidia/cuda:10.0-base nvidia-smi
Thu Sep 29 12:52:00 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.141.03 Driver Version: 470.141.03 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100 80G... Off | 00000000:00:0C.0 Off | 0 |
| N/A 41C P0 47W / 300W | 0MiB / 80994MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
3. 進(jìn)入容器
以centos7為例闽晦,當(dāng)你運(yùn)行:
[root@localhost ~]# docker run -it --rm --runtime=nvidia --gpus all nvidia/cuda:9.0-base /bin/bash
docker: Error response from daemon: Unknown runtime specified nvidia.
# 報(bào)錯(cuò)荠瘪,因?yàn)闆]有安裝 nvidia-docker2篮绰,安裝好后纵散,重新執(zhí)行即可。
docker exec進(jìn)入容器垛贤,再次運(yùn)行nvidia-smi
會(huì)出現(xiàn)和在主機(jī)運(yùn)行的一樣結(jié)果焰坪。
進(jìn)入容器內(nèi)部,發(fā)現(xiàn)是ubuntu版本的系統(tǒng)
root@c2c7d583633f:/home/Python-3.8.13# cat /etc/issue
Ubuntu 16.04.7 LTS \n \l
4. 驗(yàn)證
pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
>>> import torch
>>> torch.cuda.is_available()
True
如果輸出 True 證明環(huán)境也成功了聘惦,可以使用顯卡某饰。
5. docker 鏡像源
官網(wǎng):link
# 專業(yè)版
# Centos
docker pull nvidia/cuda:11.1.1-cudnn8-devel-centos7
docker pull nvidia/cuda:11.1.1-cudnn8-devel-centos8
# Ubuntu
docker pull nvidia/cuda:11.1.1-cudnn8-devel-ubuntu18.04
docker pull nvidia/cuda:11.1.1-cudnn8-devel-ubuntu20.04
參考官網(wǎng):