安裝節(jié)點健康監(jiān)測
在kubernetes集群上,通常我們只是管制集群本身以及容器的穩(wěn)定運行肮疗。但是這些穩(wěn)定性都是強依賴節(jié)點node的穩(wěn)定的漱抓。通過節(jié)點健康監(jiān)測,將節(jié)點的信息通知到apiServer精钮,避免pod調度到異常節(jié)點暴心。node problem detector就是專門來做這件事情。
一般節(jié)點常見的問題主要有
1杂拨、硬件錯誤
- CPU壞了
- Memory壞了
- 磁盤壞了
2、kernel問題
- kernel deadlock (內核死鎖)
- corrupted file systems (文件系統(tǒng)崩潰)
- unresponsive runtime daemons (系統(tǒng)運行后臺進程無響應)
3悯衬、docker問題
- unresponsive runtime daemons (docker后臺進程無響應)
- docker image error (docker文件系統(tǒng)錯誤)
K8S集群管理對node的健康狀態(tài)是無法感知的弹沽,pod依舊會調度到有問題的node上,通過DaemonSet部署node-problem-detector筋粗,向apiserver上報node的狀態(tài)信息策橘,使node的健康狀態(tài)對上游管理可見,pod不會再調度到有異常的node上娜亿。
這里剛開始也是踩了坑丽已,k8s官方給的demo比較簡單,版本還是v0.1买决,node-problem-detector這個項目的鏡像版本是v0.8.1沛婴,但是這倆都沒有明確的給出權限這一塊的配置,導致安裝后訪問資源拒絕督赤,可以通過查看log日志發(fā)現(xiàn)嘁灯。
后來在k8s的addon目錄下,找到了npd.yaml文件躲舌,完美的運行了丑婿。
文件地址:https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/node-problem-detector
yaml文件如下:
apiVersion: v1
kind: ServiceAccount
metadata:
name: node-problem-detector
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: npd-binding
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: system:node-problem-detector
subjects:
- kind: ServiceAccount
name: node-problem-detector
namespace: kube-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: npd-v0.8.1
namespace: kube-system
labels:
k8s-app: node-problem-detector
version: v0.8.0
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: Reconcile
spec:
selector:
matchLabels:
k8s-app: node-problem-detector
version: v0.8.1
template:
metadata:
labels:
k8s-app: node-problem-detector
version: v0.8.1
kubernetes.io/cluster-service: "true"
spec:
containers:
- name: node-problem-detector
image: registry.cn-hangzhou.aliyuncs.com/speed_containers/node-problem-detector:v0.8.1
command:
- "/bin/sh"
- "-c"
- "exec /node-problem-detector --logtostderr --config.system-log-monitor=/config/kernel-monitor.json,/config/docker-monitor.json,/config/systemd-monitor.json --config.custom-plugin-monitor=/config/kernel-monitor-counter.json,/config/systemd-monitor-counter.json --config.system-stats-monitor=/config/system-stats-monitor.json >>/var/log/node-problem-detector.log 2>&1"
securityContext:
privileged: true
resources:
limits:
cpu: "200m"
memory: "100Mi"
requests:
cpu: "20m"
memory: "20Mi"
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
volumeMounts:
- name: log
mountPath: /var/log
- name: localtime
mountPath: /etc/localtime
readOnly: true
volumes:
- name: log
hostPath:
path: /var/log/
- name: localtime
hostPath:
path: /etc/localtime
type: "FileOrCreate"
serviceAccountName: node-problem-detector
tolerations:
- operator: "Exists"
effect: "NoExecute"
- key: "CriticalAddonsOnly"
operator: "Exists"
這里默認的鏡像是谷歌官方倉庫gcr.io的庫,因為外網問題没卸,這里我上傳了一份到阿里云的倉庫羹奉,公開可直接使用的。
可能有點小伙伴會疑惑约计,我安裝完成之后如何查看效果呢诀拭,參考鏈接:https://stackoverflow.com/questions/48134835/how-to-use-k8s-node-problem-detector
里面有相關的介紹,其實 node-problem-detector 是以Event事件的形式病蛉,將信息傳遞給了集群炫加,我肯可以通過 kubectl describe nodes <node-name> -n kube-system
來查看瑰煎,詳細的過程這里簡單的參考上面的stackverflow
安裝之前
Bash# kubectl describe node ip-10-40-22-166.ec2.internal | grep -i condition -A 20 | grep Ready -B 20
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Thu, 20 Jun 2019 12:30:05 -0400 Thu, 20 Jun 2019 12:30:05 -0400 WeaveIsUp Weave pod has set this
OutOfDisk False Thu, 20 Jun 2019 18:27:39 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Thu, 20 Jun 2019 18:27:39 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Thu, 20 Jun 2019 18:27:39 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Thu, 20 Jun 2019 18:27:39 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Thu, 20 Jun 2019 18:27:39 -0400 Thu, 20 Jun 2019 12:30:14 -0400 KubeletReady kubelet is posting ready status
安裝之后
Bash# helm upgrade --install npd stable/node-problem-detector -f node-problem-detector.values.yaml
Bash# kubectl rollout status daemonset npd-node-problem-detector #(wait for up)
Bash# kubectl describe node ip-10-40-22-166.ec2.internal | grep -i condition -A 20 | grep Ready -B 20
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
DockerDaemon False Thu, 20 Jun 2019 22:06:17 -0400 Thu, 20 Jun 2019 22:04:14 -0400 DockerDaemonHealthy Docker daemon is healthy
EBSHealth False Thu, 20 Jun 2019 22:06:17 -0400 Thu, 20 Jun 2019 22:04:14 -0400 NoVolumeErrors Volumes are attaching successfully
KernelDeadlock False Thu, 20 Jun 2019 22:06:17 -0400 Thu, 20 Jun 2019 22:04:14 -0400 KernelHasNoDeadlock kernel has no deadlock
ReadonlyFilesystem False Thu, 20 Jun 2019 22:06:17 -0400 Thu, 20 Jun 2019 22:04:14 -0400 FilesystemIsNotReadOnly Filesystem is not read-only
NetworkUnavailable False Thu, 20 Jun 2019 12:30:05 -0400 Thu, 20 Jun 2019 12:30:05 -0400 WeaveIsUp Weave pod has set this
OutOfDisk False Thu, 20 Jun 2019 22:07:10 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False Thu, 20 Jun 2019 22:07:10 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Thu, 20 Jun 2019 22:07:10 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Thu, 20 Jun 2019 22:07:10 -0400 Thu, 20 Jun 2019 12:29:44 -0400 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Thu, 20 Jun 2019 22:07:10 -0400 Thu, 20 Jun 2019 12:30:14 -0400 KubeletReady kubelet is posting ready status
可以很明顯的看出來,多了好多檢測信息俗孝。