前言
Kubernetes集群中葛闷,基于已有的透傳型GPU虛機,部署一個GPU-Node领曼,與常規(guī)節(jié)點相比需要增加三個步驟:
- 安裝NVIDIA-Driver
- 安裝NVIDIA-Docker2
- 部署NVIDIA-Device-Plugin
安裝NVIDIA-driver
下載 NVIDIA 驅(qū)動
驅(qū)動都是免費的帜乞,根據(jù)顯卡型號選擇下載合適的驅(qū)動撑帖,官方驅(qū)動下載地址
禁用 nouveau 驅(qū)動
添加conf 文件:
vi /etc/modprobe.d/blacklist.conf
在最后兩行添加:
blacklist nouveau
options nouveau modeset=0
重新生成 kernel initramfs:
執(zhí)行sudo update-initramfs -u
重啟節(jié)點虛機:
reboot
驗證:沒輸出代表禁用生效,在重啟之后執(zhí)行
lsmod | grep nouveau
安裝驅(qū)動
示例中:虛機操作系統(tǒng)為 Ubuntu18.04-amd64肿孵,顯卡型號為 Tesla-V100唠粥,安裝驅(qū)動版本選擇 440.33.01
在線安裝:
apt install nvidia-driver-430 nvidia-utils-430 nvidia-settings
離線安裝:
./NVIDIA-Linux-x86_64-{{ gpu_version }}.run -s
驗證驅(qū)動安裝:
nvidia-smi
正確安裝驅(qū)動后疏魏,輸出示例如下:
以上完成NVIDIA驅(qū)動安裝
安裝NVIDIA-docker2
由于18.06版本的docker不支持GPU容器, 需要安裝NVIDIA-Docker2以支持容器使用NVIDIA-GPUs
注意:安裝之前要先安裝好 docker及NVIDIA驅(qū)動,但不需要安裝 CUDA厅贪。
在線安裝:
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -
distribution= $(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | tee /etc/apt/sources.list.d/nvidia-docker.list
apt-get update
apt-get install -y nvidia-docker2
systemctl restart docker
離線安裝:
在通外網(wǎng)的機器上蠢护,運行以下命令:
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list
$ sudo apt-get update
下載5個包
apt download libnvidia-container1
apt download libnvidia-container-tools
apt download nvidia-container-toolkit
apt download nvidia-container-runtime
apt download nvidia-docker2
將下載好的包拷貝到目標節(jié)點虛機后,執(zhí)行如下命令進行安裝
dpkg -i libnvidia-container1_1.0.7-1_amd64.deb && dpkg -i libnvidia-container-tools_1.0.7-1_amd64.deb && dpkg -i nvidia-container-toolkit_1.0.5-1_amd64.deb && dpkg -i nvidia-container-runtime_3.1.4-1_amd64.deb && dpkg -i nvidia-docker2_2.2.2-1_all.deb
設(shè)置GPU節(jié)點的docker default runtime 為 nvidia-container-runtime
vi /etc/docker/daemon.json
需要在該配置文件中添加的內(nèi)容如下:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
重啟docker
systemctl restart docker
驗證安裝
docker info
以上完成NVIDIA-docker2安裝
安裝插件 nvidia-device-plugin-daemonset
注:示例中版本為1.0.0-beta6养涮,可去Nvidia Github 項目下查看所有可用版本
在線安裝:
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin:1.0.0-beta6/nvidia-device-plugin.yml
Nvidia官方manifest/清單為:
# Copyright (c) 2019, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: nvidia-device-plugin-daemonset
namespace: kube-system
spec:
selector:
matchLabels:
name: nvidia-device-plugin-ds
updateStrategy:
type: RollingUpdate
template:
metadata:
# This annotation is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
annotations:
scheduler.alpha.kubernetes.io/critical-pod: ""
labels:
name: nvidia-device-plugin-ds
spec:
tolerations:
# This toleration is deprecated. Kept here for backward compatibility
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
- key: CriticalAddonsOnly
operator: Exists
- key: nvidia.com/gpu
operator: Exists
effect: NoSchedule
# Mark this pod as a critical add-on; when enabled, the critical add-on
# scheduler reserves resources for critical add-on pods so that they can
# be rescheduled after a failure.
# See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
priorityClassName: "system-node-critical"
containers:
- image: nvidia/k8s-device-plugin:1.0.0-beta6
name: nvidia-device-plugin-ctr
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop: ["ALL"]
volumeMounts:
- name: device-plugin
mountPath: /var/lib/kubelet/device-plugins
volumes:
- name: device-plugin
hostPath:
path: /var/lib/kubelet/device-plugins
注: 可以給GPU節(jié)點添加label葵硕,并在nvidia-device-plugin-daemonset.yaml中添加nodeselector或nodeAffinity。
以上步驟完成后贯吓,驗證GPU-Node安裝
kubectl get no {nodeName} -oyaml
圖中Node詳情中已經(jīng)可以取到nvidia.com/gpu的值了懈凹,此時GPU資源是以顯卡個數(shù)暴露給Kubernetes集群的,說明配置生效了悄谐,另外介评,也可以顯存形式暴露給Kubernetes集群并實現(xiàn)GPU共享調(diào)度。
以上完成在Kubernetes集群中GPU節(jié)點的部署及驗證
Reference
安裝NVIDIA驅(qū)動: https://www.cnblogs.com/youpeng/p/10887346.html
NVIDIA驅(qū)動官方下載地址:https://www.nvidia.cn/Download/index.aspx?lang=cn
安裝NVIDIA-docker2: https://fanfuhan.github.io/2019/11/22/docker_based_use/
解決 Ubuntu18 無法安裝 Nvidia-docker2: https://blog.csdn.net/wuzhongli/article/details/86539433
kubernetes官方文檔: https://kubernetes.io/zh/docs/tasks/manage-gpus/scheduling-gpus/
NVIDIA device plugin 官方文檔: https://github.com/NVIDIA/k8s-device-plugin