[toc]
k8s監(jiān)控方案
cadvisor+heapster+influxdb+grafana
缺點:只能支持監(jiān)控容器資源抗悍,無法支持業(yè)務監(jiān)控钉蒲,擴展性較差
cadvisor/exporter+prometheus+grafana
總體流程: 數(shù)據(jù)采集-->匯總-->處理-->存儲-->展示
- 容器的監(jiān)控
- prometheus使用cadvisor采集容器監(jiān)控指標鹉勒,cadvisor集成在k8s的kubelet中-通過prometheus進程存儲-使用grafana進行展現(xiàn)
- node的監(jiān)控-通過node_pxporter采集當前主機的資源-通過prometheus進程存儲-使用grafana進行展現(xiàn)
- master的監(jiān)控-通過kube-state-metrics插件從k8s中獲取到apiserver的相關(guān)數(shù)據(jù)-通過prometheus進程存儲-使用grafana進行展現(xiàn)
kubernetes監(jiān)控指標
kubernetes自身的監(jiān)控
- node的資源利用率-node節(jié)點上的cpu迹栓、內(nèi)存拐格、硬盤僧免、鏈接
- node的數(shù)量-node數(shù)量與資源利用率、業(yè)務負載的比例情況捏浊、成本懂衩、資源擴展的評估
- pod的數(shù)量-當負載到一定程度時,node與pod的數(shù)量金踪,評估負載到哪個階段浊洞,大約需要多少服務器,每個pod的資源占用率如何热康,進行整體評估
- 資源對象狀態(tài)-k8s在運行過程中沛申,會創(chuàng)建很多pod,控制器姐军,任務,這些內(nèi)容都是由k8s中的資源對象進行維護尖淘,需要進行對資源對象的監(jiān)控奕锌,獲取資源對象的狀態(tài)
pod監(jiān)控
- 每個項目中pod的數(shù)量-正常的pod數(shù)量,有問題的pod數(shù)量
- 容器資源利用率-統(tǒng)計當前pod的資源利用率村生,統(tǒng)計pod中的容器資源利用率惊暴,cpu、網(wǎng)絡趁桃、內(nèi)存評估
- 應用程序-項目中的程序的自身情況辽话,如并發(fā),請求響應卫病,項目用戶數(shù)量油啤,訂單數(shù)等
實現(xiàn)思路
監(jiān)控指標 具體實現(xiàn) 舉例
pod性能 cadvisor 容器的cpu、內(nèi)存利用率
node性能 node-exporter node節(jié)點的cpu蟀苛、內(nèi)存利用率
k8s資源對象 kube-state-metrics pod/deployment/service
服務發(fā)現(xiàn)
從kubernetes的api中去發(fā)現(xiàn)抓取的目標益咬,并始終與kubernetes集群狀態(tài)保持一致,
動態(tài)的獲取被抓取的目標帜平,實時的從api中獲取當前狀態(tài)是否存在幽告,
官方文檔
https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config
自動發(fā)現(xiàn)支持的組件:
- node-自動發(fā)現(xiàn)集群中的node節(jié)點
- pod-自動發(fā)現(xiàn)運行的容器和端口
- service-自動發(fā)現(xiàn)創(chuàng)建的serviceIP、端口
- endpoints-自動發(fā)現(xiàn)pod中的容器
- ingress-自動發(fā)現(xiàn)創(chuàng)建的訪問入口和規(guī)則
使用prometheus監(jiān)控k8s
在k8s中部署prometheus
官方部署文檔: https://github.com/kubernetes/kubernetes/tree/master/cluster/addons/prometheus
制作prometheus PV/PVC
#安裝依賴包
yum -y install nfs-utils rpcbind
#開機啟動,
systemctl enable rpcbind.service
systemctl enable nfs-server.service
systemctl start rpcbind.service #端口是111
systemctl start nfs-server.service # 端口是 2049
# 創(chuàng)建一個/data/pvdata的共享目錄
# mkdir /data/pvdata
# chown nfsnobody:nfsnobody /data/pvdata
# cat /etc/exports
/data/pvdata 172.22.22.0/24(rw,async,all_squash)
# exportfs -rv
exporting 172.22.22.0/24:/data/pvdata
下載prometheus yaml部署文件
mkdir /data/k8s/yaml/kube-system/prometheus
cd /data/k8s/yaml/kube-system/prometheus/
# 從github官網(wǎng)下載yaml部署文件
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/prometheus/prometheus-rbac.yaml
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/prometheus/prometheus-configmap.yaml
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/prometheus/prometheus-service.yaml
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/prometheus/prometheus-statefulset.yaml
修改statefulset.yaml
# 刪掉最下面的10行
volumeClaimTemplates:
- metadata:
name: prometheus-data
spec:
storageClassName: standard
accessModes:
- ReadWriteOnce
resources:
requests:
storage: "16Gi"
# 新增下面3行
- name: prometheus-data
persistentVolumeClaim:
claimName: prometheus-data
新增pv/pvc yaml文件
mkdir /data/pvdata/prometheus
chown nfsnobody. /data/pvdata/prometheus
cat > prometheus-pvc-data.yaml << EFO
apiVersion: v1
kind: PersistentVolume
metadata:
name: prometheus-data
spec:
storageClassName: prometheus-data
capacity:
storage: 10Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Recycle
nfs:
path: /data/pvdata/prometheus
server: 192.168.1.155
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus-data
namespace: kube-system
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 10Gi
storageClassName: prometheus-data
EFO
新增Prometheus-ingress.yaml文件
主要是方便外部grafana使用
cat > prometheus-ingress.yaml << EFO
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: prometheus-ingress
namespace: kube-system
spec:
rules:
- host: prometheus.baiyongjie.com
http:
paths:
- backend:
serviceName: prometheus
servicePort: 9090
EFO
應用yaml文件
# 部署順序
1. prometheus-rbac.yaml-對prometheus訪問kube-apiserver進行授權(quán)
2. prometheus-configmap.yaml-管理prometheus主配置文件
3. prometheus-service.yaml-將prometheus暴露出去裆甩,可以訪問
4 prometheus-ingress.yaml-對外提供服務
4. prometheus-pvc-data.yaml-為pod提供數(shù)據(jù)存儲
5. prometheus-statefulset.yaml-通過有狀態(tài)的形式冗锁,將prometheus去部署
6. prometheus-ingress.yaml-對外提供服務
# 應用yaml文件
kubectl apply -f prometheus-rbac.yaml
kubectl apply -f prometheus-configmap.yaml
kubectl apply -f prometheus-ingress.yaml
kubectl apply -f prometheus-pvc-data.yaml
kubectl apply -f prometheus-service.yaml
kubectl apply -f prometheus-statefulset.yaml
# 查看部署情況
[root@master prometheus]# kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
prometheus-data 10Gi RWO Recycle Bound kube-system/prometheus-data prometheus-data 32m
[root@master prometheus]# kubectl get pvc -n kube-system
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
prometheus-data Bound prometheus-data 10Gi RWO prometheus-data 33m
[root@master prometheus]# kubectl get service -n kube-system
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kube-dns ClusterIP 10.96.0.10 <none> 53/UDP,53/TCP 12d
prometheus NodePort 10.107.69.131 <none> 9090/TCP 57m
[root@master prometheus]# kubectl get statefulsets.apps -n kube-system
NAME READY AGE
prometheus 1/1 15m
[root@master prometheus]# kubectl get ingresses.extensions -n kube-system
NAME HOSTS ADDRESS PORTS AGE
prometheus-ingress prometheus.baiyongjie.com 80 7m3s
[root@master prometheus]# kubectl get pods -n kube-system -o wide |grep prometheus
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
prometheus-0 2/2 Running 0 42s 10.244.1.6 node01 <none> <none>
訪問ingress
# 修改hosts文件,添加ingress域名解析
192.168.1.156 prometheus.baiyongjie.com
然后訪問 http://prometheus.baiyongjie.com/graph
部署node-exporter
下載yaml文件
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/prometheus/node-exporter-ds.yml
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/prometheus/node-exporter-service.yaml
由于我們要獲取到的數(shù)據(jù)是主機的監(jiān)控指標數(shù)據(jù),而我們的 node-exporter 是運行在容器中的嗤栓,所以我們在 Pod 中需要配置一些 Pod 的安全策略冻河,這里我們就添加了hostPID: true、hostIPC: true、hostNetwork: true3個策略芋绸,用來使用主機的 PID namespace媒殉、IPC namespace 以及主機網(wǎng)絡,這些 namespace 就是用于容器隔離的關(guān)鍵技術(shù)摔敛,要注意這里的 namespace 和集群中的 namespace 是兩個完全不相同的概念廷蓉。
另外我們還將主機的/dev、/proc马昙、/sys這些目錄掛載到容器中桃犬,這些因為我們采集的很多節(jié)點數(shù)據(jù)都是通過這些文件夾下面的文件來獲取到的,比如我們在使用top命令可以查看當前cpu使用情況行楞,數(shù)據(jù)就來源于文件/proc/stat攒暇,使用free命令可以查看當前內(nèi)存使用情況,其數(shù)據(jù)來源是來自/proc/meminfo文件子房。
另外由于我們集群使用的是 kubeadm 搭建的形用,所以如果希望 master 節(jié)點也一起被監(jiān)控,則需要添加響應的容忍证杭。
// 修改node-exporter-ds.yml文件
添加
spec:
hostPID: true
hostIPC: true
hostNetwork: true
tolerations:
- key: "node-role.kubernetes.io/master"
operator: "Exists"
effect: "NoSchedule"
volumes:
- name: proc
hostPath:
path: /proc
- name: dev
hostPath:
path: /dev
- name: sys
hostPath:
path: /sys
- name: rootfs
hostPath:
path: /
應用yaml文件
kubectl apply -f node-exporter-service.yaml
kubectl apply -f node-exporter-ds.yml
# 查看部署情況
[root@master prometheus]# kubectl get pods -n kube-system |grep node-export
node-exporter-lb7gb 1/1 Running 0 4m59s
node-exporter-q22zn 1/1 Running 0 4m59s
[root@master prometheus]# kubectl get service -n kube-system |grep node-export
node-exporter ClusterIP None <none> 9100/TCP 5m49s
查看Prometheus是否獲取到數(shù)據(jù)
部署kube-state-metrics
下載yaml文件
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/prometheus/kube-state-metrics-service.yaml
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/prometheus/kube-state-metrics-rbac.yaml
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/prometheus/kube-state-metrics-deployment.yaml
應用yaml文件
kubectl apply -f kube-state-metrics-service.yaml
kubectl apply -f kube-state-metrics-rbac.yaml
kubectl apply -f kube-state-metrics-deployment.yaml
部署grafana
生成yaml文件
grafana-pvc.yaml
mkdir /data/pvdata/prometheus-grafana
chown nfsnobody. /data/pvdata/prometheus-grafana
cat > grafana-pvc.yaml << EFO
apiVersion: v1
kind: PersistentVolume
metadata:
name: prometheus-grafana
spec:
storageClassName: prometheus-grafana
capacity:
storage: 2Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Recycle
nfs:
path: /data/pvdata/prometheus-grafana
server: 192.168.1.155
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus-grafana
namespace: kube-system
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
storageClassName: prometheus-grafana
EFO
grafana-ingress.yaml
cat > grafana-ingress.yaml << EFO
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: grafana
namespace: kube-system
spec:
rules:
- host: grafana.baiyongjie.com
http:
paths:
- path: /
backend:
serviceName: grafana
servicePort: 3000
EFO
grafana-deployment.yaml
# cat > grafana-deployment.yaml << EFO
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: grafana
namespace: kube-system
labels:
app: grafana
spec:
revisionHistoryLimit: 10
template:
metadata:
labels:
app: grafana
component: prometheus
spec:
containers:
- name: grafana
env:
- name: GF_SECURITY_ADMIN_USER
value: admin
- name: GF_SECURITY_ADMIN_PASSWORD
value: admin
image: grafana/grafana:5.3.0
imagePullPolicy: IfNotPresent
ports:
- containerPort: 3000
name: grafana
readinessProbe:
failureThreshold: 10
httpGet:
path: /api/health
port: 3000
scheme: HTTP
initialDelaySeconds: 30
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 30
livenessProbe:
failureThreshold: 3
httpGet:
path: /api/health
port: 3000
scheme: HTTP
periodSeconds: 10
successThreshold: 1
timeoutSeconds: 1
resources:
limits:
cpu: 100m
memory: 256Mi
requests:
cpu: 100m
memory: 256Mi
volumeMounts:
- mountPath: /var/lib/grafana
subPath: grafana
name: grafana-volumes
volumes:
- name: grafana-volumes
persistentVolumeClaim:
claimName: prometheus-grafana
EFO
部署yaml文件
kubectl apply -f grafana-pvc.yaml
kubectl apply -f grafana-ingress.yaml
kubectl apply -f grafana-deployment.yaml
# 查看部署情況
[root@master prometheus]# kubectl get service -n kube-system |grep grafana
grafana ClusterIP 10.105.159.132 <none> 3000/TCP 150m
[root@master prometheus]# kubectl get ingresses.extensions -n kube-system |grep grafana
grafana grafana.baiyongjie.com 80 150m
[root@master prometheus]# kubectl get pods -n kube-system |grep grafana
grafana-6f6d77d98d-wwmbd 1/1 Running 0 53m
配置grafana
修改本地hosts文件添加ingress域名解析,然后訪問 http://grafana.baiyongjie.com
- 導入dashboard,推薦
- 3131 Kubernetes All Nodes
- 3146 Kubernetes Pods
- 8685 K8s Cluster Summary
- 10000 Cluster Monitoring for Kubernetes
部署alertmanager
下載yaml文件
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/prometheus/alertmanager-pvc.yaml
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/prometheus/alertmanager-service.yaml
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/prometheus/alertmanager-deployment.yaml
curl -O https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/prometheus/alertmanager-configmap.yaml
修改yaml文件
alertmanager-pvc.yaml
mkdir /data/pvdata/prometheus-alertmanager
chown nfsnobody. /data/pvdata/prometheus-alertmanager
cat > alertmanager-pvc.yaml << EFO
apiVersion: v1
kind: PersistentVolume
metadata:
name: prometheus-alertmanager
spec:
storageClassName: prometheus-alertmanager
capacity:
storage: 2Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Recycle
nfs:
path: /data/pvdata/prometheus-alertmanager
server: 192.168.1.155
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: prometheus-alertmanager
namespace: kube-system
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Gi
storageClassName: prometheus-alertmanager
EFO
alertmanager-deployment.yaml
# 修改最后一行的claimName
- name: storage-volume
persistentVolumeClaim:
claimName: prometheus-alertmanager
應用yaml文件
kubectl apply -f alertmanager-pvc.yaml
kubectl apply -f alertmanager-configmap.yaml
kubectl apply -f alertmanager-service.yaml
kubectl apply -f alertmanager-deployment.yaml
# 查看部署情況
[root@master prometheus-ink8s]# kubectl get all -n kube-system |grep alertmanager
pod/alertmanager-c564cb9fc-bfrvb 2/2 Running 0 71s
service/alertmanager ClusterIP 10.102.208.66 <none> 80/TCP 5m44s
deployment.apps/alertmanager 1/1 1 1 71s
replicaset.apps/alertmanager-c564cb9fc 1 1 1 71s
創(chuàng)建告警規(guī)則
// 修改prometheus-configmap.yaml文件
kubectl edit configmaps prometheus-config -n kube-system
// 在prometheus.yml: |下面添加
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:80
rule_files:
- "/etc/config/rules.yml"
// 創(chuàng)建告警規(guī)則, 在最下面添加
rules.yml: |
groups:
- name: example
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: page
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 1 minutes."
- alert: NodeMemoryUsage
expr: (sum(node_memory_MemTotal) - sum(node_memory_MemFree+node_memory_Buffers+node_memory_Cached) ) / sum(node_memory_MemTotal) * 100 > 20
for: 2m
labels:
team: node
annotations:
summary: "{{$labels.instance}}: High Memory usage detected"
description: "{{$labels.instance}}: Memory usage is above 20% (current value is: {{ $value }}"
// 重載配置文件
# kubectl apply -f prometheus-configmap.yaml
# kubectl get service -n kube-system |grep prometheus
prometheus ClusterIP 10.111.97.89 <none> 9090/TCP 4h42m
# curl -X POST http://10.111.97.89:9090/-/reload
創(chuàng)建郵件告警
# 修改alertmanager-configmap.yaml文件
cat > alertmanager-configmap.yaml << EFO
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: EnsureExists
data:
alertmanager.yml: |
global:
resolve_timeout: 3m #解析的超時時間
smtp_smarthost: 'smtp.163.com:25'
smtp_from: 'USERNAMR@163.com'
smtp_auth_username: 'USERNAMR@163.com'
smtp_auth_password: 'PASSWORD'
smtp_require_tls: false
route:
group_by: ['example']
group_wait: 60s
group_interval: 60s
repeat_interval: 12h
receiver: 'mail'
receivers:
- name: 'mail'
email_configs:
- to: 'misterbyj@163.com'
send_resolved: true
EFO
kubectl delete configmaps -n kube-system alertmanager-config
kubectl apply -f alertmanager-configmap.yaml
查看告警
** 訪問Prometheus, 查看是否有alerts告警規(guī)則 **