一袄膏、Ubuntu主機安裝Nvidia CUDA 驅動
本小節(jié)參考NVIDIA Driver Installation Quickstart Guide :: NVIDIA Tesla Documentation
本節(jié)敘述如何使用包管理器在 Ubuntu 16.04 LTS 和 Ubuntu 18.04 LTS 發(fā)行版上安裝 NVIDIA 驅動程序。
- NVIDIA 驅動程序在安裝時需要依賴當前內核版本的
linux kernel header
和開發(fā)包蜓斧。例如呵晨,linux 內核是 4.4.0牍汹,則必須安裝linux-headers-4.4.0
璧亚。$ sudo apt-get install linux-headers-$(uname -r)
- 確保 CUDA 軟件源上的包優(yōu)先于Canonical軟件源
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID | sed -e 's/\.//g') $ wget https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/cuda-$distribution.pin $ sudo mv cuda-$distribution.pin /etc/apt/preferences.d/cuda-repository-pin-600
- 安裝 CUDA 軟件源的GPG公鑰
$ sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/7fa2af80.pub
- 安裝 CUDA 軟件源
$ echo "deb http://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64 /" | sudo tee /etc/apt/sources.list.d/cuda.list
- 更新 APT 緩存并使用 CUDA 軟件源安裝驅動程序砰识。可以使用
--no-install-recommends
選項安裝簡化版驅動程序块茁,無需任何 X 依賴齿坷。這對于云實例上的 headless 安裝特別有用。$ sudo apt-get update $ sudo apt-get -y install cuda-drivers
- 驗證nVidia驅動安裝結果
$ nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 455.38 Driver Version: 455.38 CUDA Version: 11.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce RTX 208... Off | 00000000:3B:00.0 Off | N/A | | 30% 28C P8 17W / 250W | 0MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 GeForce RTX 208... Off | 00000000:AF:00.0 Off | N/A | | 30% 25C P8 12W / 250W | 0MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
二、安裝Docker與NVIDIA Container Toolkit
本小節(jié)參考Installation Guide - NVIDIA Cloud Native Technologies documentation
- 安裝Docker
$ curl -fsSL https://get.docker.com | bash -s docker --mirror Aliyun
- 添加nvidia-docker軟件源與對應GPG 公鑰
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID) $ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - $ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
- 安裝nvidia-docker2
$ sudo apt-get update $ sudo apt-get install -y nvidia-docker2
- Docker 的默認運行時改為
nvidia-container-runtime
胃夏,而不是runc
$ vim /etc/docker/daemon.json { "default-runtime": "nvidia", "runtimes": { "nvidia": { "path": "nvidia-container-runtime", "runtimeArgs": [] } }, "registry-mirrors": ["https://hub-mirror.c.163.com"] }
- 重啟 Docker Engine
$ systemctl restart docker
- 驗證 nvidia-docker
$ docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 455.38 Driver Version: 455.38 CUDA Version: 11.1 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 GeForce RTX 208... Off | 00000000:3B:00.0 Off | N/A | | 30% 27C P8 17W / 250W | 0MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | 1 GeForce RTX 208... Off | 00000000:AF:00.0 Off | N/A | | 30% 25C P8 14W / 250W | 0MiB / 11019MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
三轴或、添加主機到Kubesphere集群
- 修改
config-sample.yaml
昌跌,添加GPU主機到配置文件$ vim config-sample.yaml
- 使用
kubekey
根據配置文件自動化加入節(jié)點到Kubesphere集群$ ./kk add nodes -f config-sample.yaml
- 設置節(jié)點標簽仰禀,打上GPU節(jié)點標簽
圖形化操作,參考 Kubesphere - 節(jié)點管理 - 在Kubesphere集群安裝
k8s-device-plugin
插件
參考 調度 GPUs | Kubernetes$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml