一.kubernetes對(duì)GPU的支持版本
kubernetes提供對(duì)分布式節(jié)點(diǎn)上的AMD GPU和NVIDIA GPU管理的實(shí)驗(yàn)性的支持。在V1.6中已經(jīng)添加了對(duì)NVIDIA GPU的支持横殴,并且經(jīng)歷了多次
向后不兼容的迭代次舌。通過(guò)設(shè)備插件在v1.9中添加了對(duì)AMD GPU的支持檩奠。
從1.8版本開(kāi)始,使用GPU的推薦方法是使用驅(qū)動(dòng)插件。要是在1.10版本之前通過(guò)設(shè)備插件啟用GPU支持又厉,必須在整個(gè)系統(tǒng)中將DevicePlugins功能
設(shè)置為true: --feature-gates="DevicePlugins=true
嘲碧。1.10之后版本不需要這么做了稻励。
然后,必須在節(jié)點(diǎn)上安裝相應(yīng)供應(yīng)商GPU驅(qū)動(dòng)程序,并從GPU供應(yīng)商(AMD/NVIDIA)運(yùn)行相應(yīng)的設(shè)備插件加矛。
二.kubernetes集群部署GPU
kubernetes集群版本: 1.13.5
docker版本: 18.06.1-ce
os系統(tǒng)是版本: centos7.5
內(nèi)核版本: 4.20.13-1.el7.elrepo.x86_64
Nvidia GPU型號(hào): P4000
2.1 安裝nvidia驅(qū)動(dòng)
2.1.1 安裝gcc
[root@k8s-01 ~]# yum install -y gcc
2.1.2 下載nvidia的驅(qū)動(dòng)
下載鏈接: NVIDIA DRIVERS Linux x64 (AMD64/EM64T) Display Driver
這里我們下載的是如下版本:
[root@k8s-01 ~]# ls NVIDIA-Linux-x86_64-410.93.run -alh
-rw-r--r-- 1 root root 103M Jul 25 17:22 NVIDIA-Linux-x86_64-410.93.run
2.1.3 修改/etc/modprobe.d/blacklist.conf文件斟览,阻止nouveau模塊的加載
[root@k8s-01 ~]# echo -e "blacklist nouveau\noptions nouveau modeset=0" > /etc/modprobe.d/blacklist.conf
2.1.4 重新建立initramfs image
[root@k8s-01 ~]# mv /boot/initramfs-$(uname -r).img /boot/initramfs-$(uname -r).img.bak
[root@k8s-01 ~]# dracut /boot/initramfs-$(uname -r).img $(uname -r)
2.1.5 執(zhí)行驅(qū)動(dòng)安裝
[root@k8s-01 ~]# sh NVIDIA-Linux-x86_64-410.93.run -a -q -s
2.1.6 安裝工具包
只有驅(qū)動(dòng)是不夠的,我們需要一些工具包便于我們使用苛茂,其中cuda鸠窗、cudnn是相關(guān)工具包妓羊。
[root@k8s-01 ~]# cat /etc/yum.repos.d/cuda.repo
[cuda]
name=cuda
baseurl=http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64
enabled=1
gpgcheck=1
gpgkey=http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/7fa2af80.pub
[root@k8s-01 ~]#
2.2 安裝nvidia-docker2
nvidia-docker是一個(gè)可以使用GPU的docker,在docker的基礎(chǔ)上做了一層封裝稍计。目前基本被棄用。
nvidia-docker2是一個(gè)runtime涨颜,能更好的和docker兼容茧球。
- 獲取nvidia-docker2的yum源
[root@k8s-01 ~]# distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
[root@k8s-01 ~]# curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.repo | sudo tee /etc/yum.repos.d/nvidia-docker.repo
- 查看nvidia-docker2的列表
這里我們需要安裝支持docker-18.06.1-ce版本的nvidia-docker2版本抢埋,否則會(huì)不支持。
[root@k8s-01 ~]# yum list nvidia-docker2 --showduplicate
Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
* base: mirrors.aliyun.com
* epel: mirror01.idc.hinet.net
* extras: mirrors.aliyun.com
* updates: mirrors.163.com
Installed Packages
nvidia-docker2.noarch 2.0.3-1.docker18.06.1.ce @nvidia-docker
Available Packages
nvidia-docker2.noarch 2.0.0-1.docker1.12.6 nvidia-docker
nvidia-docker2.noarch 2.0.0-1.docker17.03.2.ce nvidia-docker
nvidia-docker2.noarch 2.0.0-1.docker17.06.1.ce nvidia-docker
nvidia-docker2.noarch 2.0.0-1.docker17.06.2.ce nvidia-docker
nvidia-docker2.noarch 2.0.0-1.docker17.09.0.ce nvidia-docker
nvidia-docker2.noarch 2.0.1-1.docker1.12.6 nvidia-docker
nvidia-docker2.noarch 2.0.1-1.docker1.13.1 nvidia-docker
nvidia-docker2.noarch 2.0.1-1.docker17.03.2.ce nvidia-docker
nvidia-docker2.noarch 2.0.1-1.docker17.06.2.ce nvidia-docker
nvidia-docker2.noarch 2.0.1-1.docker17.09.0.ce nvidia-docker
nvidia-docker2.noarch 2.0.1-1.docker17.09.1.ce nvidia-docker
nvidia-docker2.noarch 2.0.2-1.docker1.12.6 nvidia-docker
nvidia-docker2.noarch 2.0.2-1.docker1.13.1 nvidia-docker
nvidia-docker2.noarch 2.0.2-1.docker17.03.2.ce nvidia-docker
nvidia-docker2.noarch 2.0.2-1.docker17.06.2.ce nvidia-docker
nvidia-docker2.noarch 2.0.2-1.docker17.09.0.ce nvidia-docker
nvidia-docker2.noarch 2.0.2-1.docker17.09.1.ce nvidia-docker
nvidia-docker2.noarch 2.0.2-1.docker17.12.0.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker1.12.6 nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker1.13.1 nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker17.03.2.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker17.06.2.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker17.09.0.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker17.09.1.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker17.12.0.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker17.12.1.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker18.03.0.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker18.03.1.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker18.06.0.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker18.06.1.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker18.06.2 nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker18.09.0.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker18.09.1.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker18.09.2 nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker18.09.2.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker18.09.3.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-1.docker18.09.4.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-2.docker18.06.2.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-2.docker18.09.5.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-3.docker18.06.3.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-3.docker18.09.5.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-3.docker18.09.6.ce nvidia-docker
nvidia-docker2.noarch 2.0.3-3.docker18.09.7.ce nvidia-docker
nvidia-docker2.noarch 2.1.0-1 nvidia-docker
nvidia-docker2.noarch 2.1.1-1 nvidia-docker
nvidia-docker2.noarch 2.2.0-1 nvidia-docker
這里我們安裝2.0.3-1.docker18.06.1.ce版本即可捡鱼。
- 安裝nvidia-docker2
[root@k8s-01 ~]# yum install -y nvidia-docker2-2.0.3-1.docker18.06.1.ce
- 配置默認(rèn)的docker runtime為nvidia
[root@k8s-01 ~]# cat /etc/docker/daemon.json
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}
- 重啟docker
[root@k8s-01 ~]# systemctl restart docker
- 查看docker信息
[root@k8s-01 wf-deploy]# docker info
Containers: 63
Running: 0
Paused: 0
Stopped: 63
Images: 51
Server Version: 18.06.1-ce
Storage Driver: overlay2
Backing Filesystem: xfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc nvidia
Default Runtime: nvidia
Init Binary: docker-init
containerd version: 468a545b9edcd5932818eb9de8e72413e616e86e
runc version: 69663f0bd4b60df09991c08812a60108003fa340-dirty (expected: 69663f0bd4b60df09991c08812a60108003fa340)
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 4.20.13-1.el7.elrepo.x86_64
Operating System: CentOS Linux 7 (Core)
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.79GiB
Name: k8s-01
ID: DWPY:P2I4:NWL4:3U3O:UTGC:PLJC:IGTO:7ZXJ:A7CD:SJGT:7WT5:WNGX
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
192.168.50.2
127.0.0.0/8
Live Restore Enabled: false
可以看出docker的默認(rèn)runtime為nvidia
2.3 安裝驅(qū)動(dòng)插件
- 獲取插件的最新yaml文件
[root@k8s-01 ~]# kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.yml
2.4 查看有GPU的節(jié)點(diǎn)
[root@wf-229 ~]# kubectl get node 192.18.1.26 -ojson | jq '.status.allocatable'
{
"cpu": "48",
"ephemeral-storage": "258961942919",
"hugepages-1Gi": "0",
"hugepages-2Mi": "0",
"memory": "131471388Ki",
"nvidia.com/gpu": "1",
"pods": "200"
}
2.5 創(chuàng)建包含GPU資源的POD
[root@wf-229 gpu]# cat test.yaml
apiVersion: v1
kind: Pod
metadata:
labels:
k8s-app: nginx-pod
name: nginx-pod
spec:
containers:
- image: nginx:latest
imagePullPolicy: Always
name: nginx
ports:
- containerPort: 80
name: nginx
protocol: TCP
resources:
limits:
nvidia.com/gpu: "1"
2.6 查看Pod中分配的GPU資源
[root@wf-229 gpu]# kubectl exec -it nginx-pod bash
root@nginx-pod:/# nvidia-smi
Mon Aug 12 11:39:05 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.93 Driver Version: 410.93 CUDA Version: N/A |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro P4000 Off | 00000000:3B:00.0 Off | N/A |
| 46% 34C P8 5W / 105W | 0MiB / 8119MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
三.CLI介紹
- nvidia-container-cli
nvidia-container-cli 是一個(gè)命令行工具乍迄,用于配置Linux容器對(duì)GPU 硬件的使用士败。支持:
1)list: 打印nvidia驅(qū)動(dòng)庫(kù)及路徑
2)info: 打印所有Nvidia GPU設(shè)備
3)configure: 進(jìn)入給定進(jìn)程的命名空間,執(zhí)行必要操作保證容器內(nèi)可以使用被指定的GPU以及對(duì)應(yīng)能力(指定 Nvidia 驅(qū)動(dòng)庫(kù))漾狼。 configure是我們使用到的主要命令,它將Nvidia 驅(qū)動(dòng)庫(kù)的so文件 和 GPU設(shè)備信息逊躁, 通過(guò)文件掛載的方式映射到容器中。
- 查看NODE節(jié)點(diǎn)GPU
kubectl get node 192.18.1.26 -ojson | jq '.status.allocatable'