在本文中我將展示如何將Jetson Nano開發(fā)板連接到Kubernetes集群以作為一個GPU節(jié)點大溜。我將介紹使用GPU運行容器所需的NVIDIA docker設(shè)置化漆,以及將Jetson連接到Kubernetes集群。在成功將節(jié)點連接到集群后钦奋,我還將展示如何在Jetson Nano上使用GPU運行簡單的TensorFlow 2訓(xùn)練會話座云。
K3s還是K8s?
K3s是一個輕量級Kubernetes發(fā)行版付材,其大小不超過100MB朦拖。在我看來,它是單板計算機的理想選擇厌衔,因為它所需的資源明顯減少璧帝。你可以查看我們的往期文章,了解更多關(guān)于K3s的教程和生態(tài)富寿。在K3s生態(tài)中睬隶,有一款不得不提的開源工具K3sup锣夹,這是由Alex Ellis開發(fā)的,用于簡化K3s集群安裝苏潜。你可以訪問Github了解這款工具:
https://github.com/alexellis/k3sup
我們需要準備什么银萍?
- 一個K3s集群——只需要一個正確配置的主節(jié)點即可
- NVIDIA Jetson Nano開發(fā)板,并安裝好開發(fā)者套件
如果你想了解如何在開發(fā)板上安裝開發(fā)者套件恤左,你可以查看以下文檔:
https://developer.nvidia.com/embedded/learn/get-started-jetson-nano-devkit#write
- K3sup
- 15分鐘的時間
計劃步驟
- 設(shè)置NVIDIA docker
- 添加Jetson Nano到K3s集群
- 運行一個簡單的MNIST例子來展示Kubernetes pod內(nèi)GPU的使用
設(shè)置NVIDIA docker
在我們配置Docker以使用nvidia-docker作為默認的運行時之前贴唇,我需要先解釋一下為什么要這樣做。默認情況下赃梧,當(dāng)用戶在Jetson Nano上運行容器時滤蝠,運行方式與其他硬件設(shè)備相同,你不能從容器中訪問GPU授嘀,至少在沒有黑客攻擊的情況下不能物咳。如果你想自己測試一下,你可以運行以下命令蹄皱,應(yīng)該會看到類似的結(jié)果:
1. root@jetson:~# echo "python3 -c 'import tensorflow'" | docker run -i
icetekio/jetson-nano-tensorflow /bin/bash
2. 2020-05-14 00:10:23.370761: W
tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could
not load dynamic library 'libcudart.so.10.2'; dlerror:
libcudart.so.10.2: cannot open shared object file: No such file or
directory; LD_LIBRARY_PATH:
/usr/local/cuda-10.2/targets/aarch64-linux/lib:
3. 2020-05-14 00:10:23.370859: I
tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above
cudart dlerror if you do not have a GPU set up on your machine.
4. 2020-05-14 00:10:25.946896: W
tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could
not load dynamic library 'libnvinfer.so.7'; dlerror:
libnvinfer.so.7: cannot open shared object file: No such file or
directory; LD_LIBRARY_PATH:
/usr/local/cuda-10.2/targets/aarch64-linux/lib:
5. 2020-05-14 00:10:25.947219: W
tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could
not load dynamic library 'libnvinfer_plugin.so.7'; dlerror:
libnvinfer_plugin.so.7: cannot open shared object file: No such file
or directory; LD_LIBRARY_PATH:
/usr/local/cuda-10.2/targets/aarch64-linux/lib:
6. 2020-05-14 00:10:25.947273: W
tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen
some TensorRT libraries. If you would like to use Nvidia GPU with
TensorRT, please make sure the missing libraries mentioned above are
installed properly.
7. /usr/lib/python3/dist-packages/h5py/__init__.py:36: FutureWarning:
Conversion of the second argument of issubdtype from `float` to
`np.floating` is deprecated. In future, it will be treated as
`np.float64 == np.dtype(float).type`.
8. from ._conv import register_converters as _register_converters
如果你現(xiàn)在嘗試運行相同的命令览闰,但在docker命令中添--runtime=nvidia參數(shù),你應(yīng)該看到類似以下的內(nèi)容:
1. root@jetson:~# echo "python3 -c 'import tensorflow'" | docker run
--runtime=nvidia -i icetekio/jetson-nano-tensorflow /bin/bash
2. 2020-05-14 00:12:16.767624: I
tensorflow/stream_executor/platform/default/dso_loader.cc:48]
Successfully opened dynamic library libcudart.so.10.2
3. 2020-05-14 00:12:19.386354: I
tensorflow/stream_executor/platform/default/dso_loader.cc:48]
Successfully opened dynamic library libnvinfer.so.7
4. 2020-05-14 00:12:19.388700: I
tensorflow/stream_executor/platform/default/dso_loader.cc:48]
Successfully opened dynamic library libnvinfer_plugin.so.7
5. /usr/lib/python3/dist-packages/h5py/__init__.py:36: FutureWarning:
Conversion of the second argument of issubdtype from `float` to
`np.floating` is deprecated. In future, it will be treated as
`np.float64 == np.dtype(float).type`.
6. from ._conv import register_converters as _register_converters
nvidia-docker已經(jīng)配置完成巷折,但是默認情況下并沒有啟用压鉴。要啟用docker運行nvidia-docker運行時作為默認值,需要將"default-runtime":"nvidia"添加到/etc/docker/daemon.json配置文件中锻拘,如下所示:
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia"
}
現(xiàn)在你可以跳過docker run命令中--runtime=nvidia參數(shù)油吭,GPU將被默認初始化。這樣K3s就會用nvidia-docker運行時來使用Docker署拟,讓Pod不需要任何特殊配置就能使用GPU婉宰。
將Jetson作為K8S節(jié)點連接
使用K3sup將Jetson作為Kubernetes節(jié)點連接只需要1個命令,然而要想成功連接Jetson和master節(jié)點推穷,我們需要能夠在沒有密碼的情況下同時連接到Jetson和master節(jié)點心包,并且在沒有密碼的情況下做sudo,或者以root用戶的身份連接馒铃。
如果你需要生成SSH 密鑰并復(fù)制它們蟹腾,你需要運行以下命令:
1. ssh-keygen -t rsa -b 4096 -f ~/.ssh/rpi -P ""
2. ssh-copy-id -i .ssh/rpi user@host
默認情況下,Ubuntu安裝要求用戶在使用sudo命令時輸入密碼区宇,因此娃殖,更簡單的方法是用root賬戶來使用K3sup。要使這個方法有效萧锉,需要將你的~/.ssh/authorized_keys復(fù)制到/root/.ssh/目錄下珊随。
在連接Jetson之前,我們查看一下想要連接到的集群:
1. upgrade@ZeroOne:~$ kubectl get node -o wide
2. NAME STATUS ROLES AGE VERSION INTERNAL-IP
EXTERNAL-IP OS-IMAGE KERNEL-VERSION
CONTAINER-RUNTIME
3. nexus Ready master 32d v1.17.2+k3s1 192.168.0.12
<none> Ubuntu 18.04.4 LTS 4.15.0-96-generic
containerd://1.3.3-k3s1
4. rpi3-32 Ready <none> 32d v1.17.2+k3s1 192.168.0.30
<none> Ubuntu 18.04.4 LTS 5.3.0-1022-raspi2
containerd://1.3.3-k3s1
5. rpi3-64 Ready <none> 32d v1.17.2+k3s1 192.168.0.32
<none> Ubuntu 18.04.4 LTS 5.3.0-1022-raspi2
containerd://1.3.3-k3s1
你可能會注意到柿隙,master節(jié)點是一臺IP為192.168.0.12的nexus主機叶洞,它正在運行containerd。默認狀態(tài)下禀崖,k3s會將containerd作為運行時衩辟,但這是可以修改的。由于我們設(shè)置了nvidia-docker與docker一起運行波附,我們需要修改containerd艺晴。無需擔(dān)心,將containerd修改為Docker我們僅需傳遞一個額外的參數(shù)到k3sup命令即可掸屡。所以封寞,運行以下命令即可連接Jetson到集群:
1. k3sup join --ssh-key ~/.ssh/rpi --server-ip 192.168.0.12 --ip
192.168.0.40 --k3s-extra-args '--docker'
IP 192.168.0.40是我的Jetson Nano。正如你所看到的仅财,我們傳遞了--k3s-extra-args'--docker'標志狈究,在安裝k3s agent 時,將--docker標志傳遞給它盏求。多虧如此抖锥,我們使用的是nvidia-docker設(shè)置的docker,而不是containerd碎罚。
要檢查節(jié)點是否正確連接磅废,我們可以運行kubectl get node -o wide:
1. upgrade@ZeroOne:~$ kubectl get node -o wide
2. NAME STATUS ROLES AGE VERSION INTERNAL-IP
EXTERNAL-IP OS-IMAGE KERNEL-VERSION
CONTAINER-RUNTIME
3. nexus Ready master 32d v1.17.2+k3s1 192.168.0.12
<none> Ubuntu 18.04.4 LTS 4.15.0-96-generic
containerd://1.3.3-k3s1
4. rpi3-32 Ready <none> 32d v1.17.2+k3s1 192.168.0.30
<none> Ubuntu 18.04.4 LTS 5.3.0-1022-raspi2
containerd://1.3.3-k3s1
5. rpi3-64 Ready <none> 32d v1.17.2+k3s1 192.168.0.32
<none> Ubuntu 18.04.4 LTS 5.3.0-1022-raspi2
containerd://1.3.3-k3s1
6. jetson Ready <none> 11s v1.17.2+k3s1 192.168.0.40
<none> Ubuntu 18.04.4 LTS 4.9.140-tegra
docker://19.3.6
簡易驗證
我們現(xiàn)在可以使用相同的docker鏡像和命令來運行pod,以檢查是否會有與本文開頭在Jetson Nano上運行docker相同的結(jié)果荆烈。要做到這一點拯勉,我們可以應(yīng)用這個pod規(guī)范:
1. apiVersion: v1
2. kind: Pod
3. metadata:
4. name: gpu-test
5. spec:
6. nodeSelector:
7. kubernetes.io/hostname: jetson
8. containers:
9. image: icetekio/jetson-nano-tensorflow
10. name: gpu-test
11. command:
-
12. "/bin/bash"
-
13. "-c"
-
14. "echo 'import tensorflow' | python3"
15. restartPolicy: Never
等待docker鏡像拉取,然后通過運行以下命令查看日志:
1. upgrade@ZeroOne:~$ kubectl logs gpu-test
2. 2020-05-14 10:01:51.341661: I
tensorflow/stream_executor/platform/default/dso_loader.cc:48]
Successfully opened dynamic library libcudart.so.10.2
3. 2020-05-14 10:01:53.996300: I
tensorflow/stream_executor/platform/default/dso_loader.cc:48]
Successfully opened dynamic library libnvinfer.so.7
4. 2020-05-14 10:01:53.998563: I
tensorflow/stream_executor/platform/default/dso_loader.cc:48]
Successfully opened dynamic library libnvinfer_plugin.so.7
5. /usr/lib/python3/dist-packages/h5py/__init__.py:36: FutureWarning:
Conversion of the second argument of issubdtype from `float` to
`np.floating` is deprecated. In future, it will be treated as
`np.float64 == np.dtype(float).type`.
6. from ._conv import register_converters as _register_converters
如你所見,我們的日志信息與之前在Jetson上運行Docker相似。
運行MNIST訓(xùn)練
我們有一個支持GPU的運行節(jié)點屉凯,所以現(xiàn)在我們可以測試出機器學(xué)習(xí)的 "Hello world"躏啰,并使用MNIST數(shù)據(jù)集運行TensorFlow 2模型示例。
要運行一個簡單的訓(xùn)練會話匾二,以證明GPU的使用情況,應(yīng)用下面的manifest:
1. apiVersion: v1
2. kind: Pod
3. metadata:
4. name: mnist-training
5. spec:
6. nodeSelector:
7. kubernetes.io/hostname: jetson
8. initContainers:
-
9. name: git-clone
10. image: iceci/utils
11. command:
-
12. "git"
-
13. "clone"
14. - "<https://github.com/IceCI/example-mnist-training.git>"
-
15. "/workspace"
16. volumeMounts:
-
17. mountPath: /workspace
18. name: workspace
19. containers:
-
20. image: icetekio/jetson-nano-tensorflow
21. name: mnist
22. command:
-
23. "python3"
-
24. "/workspace/mnist.py"
25. volumeMounts:
-
26. mountPath: /workspace
27. name: workspace
28. restartPolicy: Never
29. volumes:
-
30. name: workspace
31. emptyDir: {}
從下面的日志中可以看到,GPU正在運行:
1. ...
2. 2020-05-14 11:30:02.846289: I
tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding
visible gpu devices: 0
3. 2020-05-14 11:30:02.846434: I
tensorflow/stream_executor/platform/default/dso_loader.cc:48]
Successfully opened dynamic library libcudart.so.10.2
4. ....
如果你在節(jié)點上诵次,你可以通過運行tegrastats命令來測試CPU和GPU的使用情況:
1. upgrade@jetson:~$ tegrastats --interval 5000
2. RAM 2462/3964MB (lfb 2x4MB) SWAP 362/1982MB (cached 6MB) CPU
[52%@1479,41%@1479,43%@1479,34%@1479] EMC_FREQ 0% GR3D_FREQ 9%
PLL@23.5C CPU@26C PMIC@100C GPU@24C AO@28.5C thermal@25C POM_5V_IN
3410/3410 POM_5V_GPU 451/451 POM_5V_CPU 1355/1355
3. RAM 2462/3964MB (lfb 2x4MB) SWAP 362/1982MB (cached 6MB) CPU
[53%@1479,42%@1479,45%@1479,35%@1479] EMC_FREQ 0% GR3D_FREQ 9%
PLL@23.5C CPU@26C PMIC@100C GPU@24C AO@28.5C thermal@24.75C
POM_5V_IN 3410/3410 POM_5V_GPU 451/451 POM_5V_CPU 1353/1354
4. RAM 2461/3964MB (lfb 2x4MB) SWAP 362/1982MB (cached 6MB) CPU
[52%@1479,38%@1479,43%@1479,33%@1479] EMC_FREQ 0% GR3D_FREQ 10%
PLL@24C CPU@26C PMIC@100C GPU@24C AO@29C thermal@25.25C POM_5V_IN
3410/3410 POM_5V_GPU 493/465 POM_5V_CPU 1314/1340
總 結(jié)
如你所見,將Jetson Nano連接到Kubernetes集群是一個非常簡單的過程枚碗。只需幾分鐘逾一,你就能利用Kubernetes來運行機器學(xué)習(xí)工作負載——同時也能利用NVIDIA袖珍GPU的強大功能。你將能夠在Kubernetes上運行任何為Jetson Nano設(shè)計的GPU容器肮雨,這可以簡化你的開發(fā)和測試遵堵。
作者: Jakub Czapliński,Icetek編輯
原文鏈接:
https://medium.com/icetek/how-to-connect-jetson-nano-to-kubernetes-using-k3s-and-k3sup-c715cf2bf212