2020-05-15 - 簡書

安裝節(jié)點健康監(jiān)測

在kubernetes集群上，通常我們只是管制集群本身以及容器的穩(wěn)定運行肮疗。但是這些穩(wěn)定性都是強依賴節(jié)點node的穩(wěn)定的漱抓。通過節(jié)點健康監(jiān)測，將節(jié)點的信息通知到apiServer精钮，避免pod調度到異常節(jié)點暴心。node problem detector就是專門來做這件事情。
一般節(jié)點常見的問題主要有

1杂拨、硬件錯誤

CPU壞了
Memory壞了
磁盤壞了

2、kernel問題

kernel deadlock (內核死鎖)
corrupted file systems (文件系統(tǒng)崩潰)
unresponsive runtime daemons (系統(tǒng)運行后臺進程無響應)

3悯衬、docker問題

unresponsive runtime daemons (docker后臺進程無響應)
docker image error (docker文件系統(tǒng)錯誤)

K8S集群管理對node的健康狀態(tài)是無法感知的弹沽，pod依舊會調度到有問題的node上，通過DaemonSet部署node-problem-detector筋粗，向apiserver上報node的狀態(tài)信息策橘，使node的健康狀態(tài)對上游管理可見，pod不會再調度到有異常的node上娜亿。

這里剛開始也是踩了坑丽已，k8s官方給的demo比較簡單，版本還是v0.1买决，node-problem-detector這個項目的鏡像版本是v0.8.1沛婴，但是這倆都沒有明確的給出權限這一塊的配置，導致安裝后訪問資源拒絕督赤，可以通過查看log日志發(fā)現(xiàn)嘁灯。

后來在k8s的addon目錄下，找到了npd.yaml文件躲舌，完美的運行了丑婿。
文件地址：https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/node-problem-detector
yaml文件如下：

apiVersion: v1
kind: ServiceAccount
metadata:
  name: node-problem-detector
  namespace: kube-system
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: npd-binding
  labels:
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: system:node-problem-detector
subjects:
- kind: ServiceAccount
  name: node-problem-detector
  namespace: kube-system
---
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: npd-v0.8.1
  namespace: kube-system
  labels:
    k8s-app: node-problem-detector
    version: v0.8.0
    kubernetes.io/cluster-service: "true"
    addonmanager.kubernetes.io/mode: Reconcile
spec:
  selector:
    matchLabels:
      k8s-app: node-problem-detector
      version: v0.8.1
  template:
    metadata:
      labels:
        k8s-app: node-problem-detector
        version: v0.8.1
        kubernetes.io/cluster-service: "true"
    spec:
      containers:
      - name: node-problem-detector
        image: registry.cn-hangzhou.aliyuncs.com/speed_containers/node-problem-detector:v0.8.1
        command:
        - "/bin/sh"
        - "-c"
        - "exec /node-problem-detector --logtostderr --config.system-log-monitor=/config/kernel-monitor.json,/config/docker-monitor.json,/config/systemd-monitor.json --config.custom-plugin-monitor=/config/kernel-monitor-counter.json,/config/systemd-monitor-counter.json --config.system-stats-monitor=/config/system-stats-monitor.json >>/var/log/node-problem-detector.log 2>&1"
        securityContext:
          privileged: true
        resources:
          limits:
            cpu: "200m"
            memory: "100Mi"
          requests:
            cpu: "20m"
            memory: "20Mi"
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        volumeMounts:
        - name: log
          mountPath: /var/log
        - name: localtime
          mountPath: /etc/localtime
          readOnly: true
      volumes:
      - name: log
        hostPath:
          path: /var/log/
      - name: localtime
        hostPath:
          path: /etc/localtime
          type: "FileOrCreate"
      serviceAccountName: node-problem-detector
      tolerations:
      - operator: "Exists"
        effect: "NoExecute"
      - key: "CriticalAddonsOnly"
        operator: "Exists"

這里默認的鏡像是谷歌官方倉庫gcr.io的庫，因為外網問題没卸，這里我上傳了一份到阿里云的倉庫羹奉，公開可直接使用的。

可能有點小伙伴會疑惑约计，我安裝完成之后如何查看效果呢诀拭，參考鏈接：https://stackoverflow.com/questions/48134835/how-to-use-k8s-node-problem-detector
里面有相關的介紹，其實 node-problem-detector 是以Event事件的形式病蛉，將信息傳遞給了集群炫加，我肯可以通過 kubectl describe nodes <node-name> -n kube-system 來查看瑰煎，詳細的過程這里簡單的參考上面的stackverflow

安裝之前

Bash# kubectl describe node ip-10-40-22-166.ec2.internal | grep -i condition -A 20 | grep Ready -B 20
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  NetworkUnavailable   False   Thu, 20 Jun 2019 12:30:05 -0400   Thu, 20 Jun 2019 12:30:05 -0400   WeaveIsUp                    Weave pod has set this
  OutOfDisk            False   Thu, 20 Jun 2019 18:27:39 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure       False   Thu, 20 Jun 2019 18:27:39 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Thu, 20 Jun 2019 18:27:39 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Thu, 20 Jun 2019 18:27:39 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Thu, 20 Jun 2019 18:27:39 -0400   Thu, 20 Jun 2019 12:30:14 -0400   KubeletReady                 kubelet is posting ready status

安裝之后

Bash# helm upgrade --install npd stable/node-problem-detector -f node-problem-detector.values.yaml 
Bash# kubectl rollout status daemonset npd-node-problem-detector #(wait for up) 
Bash# kubectl describe node ip-10-40-22-166.ec2.internal | grep -i condition -A 20 | grep Ready -B 20 
Conditions:
  Type                 Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----                 ------  -----------------                 ------------------                ------                       -------
  DockerDaemon         False   Thu, 20 Jun 2019 22:06:17 -0400   Thu, 20 Jun 2019 22:04:14 -0400   DockerDaemonHealthy          Docker daemon is healthy
  EBSHealth            False   Thu, 20 Jun 2019 22:06:17 -0400   Thu, 20 Jun 2019 22:04:14 -0400   NoVolumeErrors               Volumes are attaching successfully
  KernelDeadlock       False   Thu, 20 Jun 2019 22:06:17 -0400   Thu, 20 Jun 2019 22:04:14 -0400   KernelHasNoDeadlock          kernel has no deadlock
  ReadonlyFilesystem   False   Thu, 20 Jun 2019 22:06:17 -0400   Thu, 20 Jun 2019 22:04:14 -0400   FilesystemIsNotReadOnly      Filesystem is not read-only
  NetworkUnavailable   False   Thu, 20 Jun 2019 12:30:05 -0400   Thu, 20 Jun 2019 12:30:05 -0400   WeaveIsUp                    Weave pod has set this
  OutOfDisk            False   Thu, 20 Jun 2019 22:07:10 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasSufficientDisk     kubelet has sufficient disk space available
  MemoryPressure       False   Thu, 20 Jun 2019 22:07:10 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure         False   Thu, 20 Jun 2019 22:07:10 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure          False   Thu, 20 Jun 2019 22:07:10 -0400   Thu, 20 Jun 2019 12:29:44 -0400   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready                True    Thu, 20 Jun 2019 22:07:10 -0400   Thu, 20 Jun 2019 12:30:14 -0400   KubeletReady                 kubelet is posting ready status

可以很明顯的看出來，多了好多檢測信息俗孝。