一劣砍、配置告警規(guī)則
1、配置rule告警規(guī)則存放路徑
$ vim prometheus-configmap.yaml
增加如下配置:
rule_files:
- /etc/config/rules/*.rules
如下圖:
2香嗓、再次更新prometheus-configmap.yaml 靠娱,使其生效像云。
$ kubectl apply -f prometheus-configmap.yaml
configmap/prometheus-config configured
3、編寫告警rules
這里我們直接編輯幾個常規(guī)告警rules用于測試(prometheus-rules.yaml)
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-rules
namespace: kube-system
data:
general.rules: |
groups:
- name: general.rules
rules:
- alert: InstanceDown
expr: up == 0
for: 2m
labels:
severity: error
annotations:
summary: "Instance {{ $labels.instance }} 停止工作"
description: "{{ $labels.instance }}: job {{ $labels.job }} 已經(jīng)停止5分鐘以上."
node.rules: |
groups:
- name: node.rules
rules:
- alert: NodeFilesystemUsage
expr: 100 - (node_filesystem_free_bytes{fstype=~"ext4|xfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs"} * 100) > 1
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: {{$labels.mountpoint }} 分區(qū)使用過高"
description: "{{$labels.instance}}: {{$labels.mountpoint }} 分區(qū)使用大于 1% (當(dāng)前值: {{ $value }})"
- alert: NodeMemoryUsage
expr: 100 - (node_memory_MemFree_bytes+node_memory_Cached_bytes+node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100 > 80
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: 內(nèi)存使用過高"
description: "{{$labels.instance}}: 內(nèi)存使用大于 80% (當(dāng)前值: {{ $value }})"
- alert: NodeCPUUsage
expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) > 80
for: 2m
labels:
severity: warning
annotations:
summary: "{{$labels.instance}}: CPU使用過高"
description: "{{$labels.instance}}: CPU使用大于 80% (當(dāng)前值: {{ $value }})"
4、應(yīng)用 prometheus-rules.yaml
$ kubectl apply -f prometheus-rules.yaml
configmap/prometheus-rules created
5惩歉、將configmap掛載到容器rules目錄撑蚌,修改prometheus-statefulset.yaml搏屑,增加下圖中紅框內(nèi)容辣恋。
$ vim prometheus-statefulset.yaml
volumeMounts:
- name: config-volume
mountPath: /etc/config
- name: prometheus-data
mountPath: /data
subPath: ""
- name: prometheus-rules
mountPath: /etc/config/rules
terminationGracePeriodSeconds: 300
volumes:
- name: config-volume
configMap:
name: prometheus-config
- name: prometheus-rules
configMap:
name: prometheus-rules
注意:這里的configMap名字對應(yīng)剛剛prometheus-rules創(chuàng)建的configmap名字
6包警、重新應(yīng)用prometheus-statefulset.yaml
$ kubectl apply -f prometheus-statefulset.yaml
NAME READY STATUS RESTARTS AGE
alertmanager-6b5bbd5bd4-g9mpd 2/2 Running 0 66m
coredns-55f46dd959-9kspv 1/1 Running 3 35d
coredns-55f46dd959-l5vww 1/1 Running 0 35d
grafana-0 1/1 Running 0 2d
kube-state-metrics-6cf969f79b-29f2r 1/1 Running 0 5d23h
kubernetes-dashboard-ccd98cd4c-jzlbs 1/1 Running 0 34d
node-exporter-7x9zl 1/1 Running 0 18h
node-exporter-ksslf 1/1 Running 0 18h
prometheus-0 2/2 Running 0 30m
7害晦、查看prometheus rules規(guī)則已顯示生效
二壹瘟、配置釘釘告警
1稻轨、注冊釘釘賬號->機(jī)器人管理->自定義(通過webhook接入自定義服務(wù))->添加->復(fù)制webhook
上述配置好群機(jī)器人殴俱,獲得這個機(jī)器人對應(yīng)的Webhook地址线欲,記錄下來李丰,后續(xù)配置釘釘告警插件要用逼泣,格式如下
https://oapi.dingtalk.com/robot/send?access_token=xxxxxxxx
2舟舒、創(chuàng)建釘釘告警插件(dingtalk-webhook.yaml)秃励,并修改文件中 access_token=xxxxxx 為上一步你獲得的機(jī)器人認(rèn)證 token
$ vim dingtalk-webhook.yaml
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
run: dingtalk
name: webhook-dingtalk
namespace: monitoring
spec:
replicas: 1
template:
metadata:
labels:
run: dingtalk
spec:
containers:
- name: dingtalk
image: timonwong/prometheus-webhook-dingtalk:v0.3.0
imagePullPolicy: IfNotPresent
# 設(shè)置釘釘群聊自定義機(jī)器人后莺治,使用實(shí)際 access_token 替換下面 xxxxxx部分
args:
- --ding.profile=webhook1=https://oapi.dingtalk.com/robot/send?access_token=94c9f3664df1a928cb59550ac88caf504ca1808a22e7018fdcf92c50d9960fab
ports:
- containerPort: 8060
protocol: TCP
---
apiVersion: v1
kind: Service
metadata:
labels:
run: dingtalk
name: webhook-dingtalk
namespace: monitoring
spec:
ports:
- port: 8060
protocol: TCP
targetPort: 8060
selector:
run: dingtalk
sessionAffinity: None
3、應(yīng)用dingtalk-webhook.yaml
$ kubectl apply -f dingtalk-webhook.yaml
4滋早、修改 alertsmanager 告警配置后砌们,更新alertmanager-configmap.yaml 部署,成功后測試告警發(fā)送
$ vim alertmanager-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: kube-system
labels:
kubernetes.io/cluster-service: "true"
addonmanager.kubernetes.io/mode: EnsureExists
data:
alertmanager.yml: |
global: null
receivers:
- name: default-receiver
route:
group_interval: 5m
group_wait: 10s
receiver: dingtalk
repeat_interval: 10m
receivers:
- name: dingtalk
webhook_configs:
- send_resolved: true
url: http://webhook-dingtalk.monitoring.svc.cluster.local:8060/dingtalk/webhook1/send
注:url處可以直接使用的svc地址,格式為:servicename.namespace.svc.cluster.local
5影兽、測試釘釘接收告警
①峻堰、修改prometheus-rules.yaml中的規(guī)則
②捐名、查看prometheus Alerts中的狀態(tài)(pending或FIRING)
其中pending狀態(tài)為:已觸發(fā)告警,未發(fā)送成艘。
其中FIRING狀態(tài)為:已發(fā)送告警淆两。(具體信息請查看webhook-dingtalk 的pod日志)