前言
我們?yōu)槭裁匆霰O(jiān)控?
就比如馬路上邊的各種攝像頭,它能監(jiān)控車流量拂封,監(jiān)控交通故障等茬射。出現(xiàn)故障可以第一時間確定發(fā)生地點。當(dāng)然冒签,在我們這里領(lǐng)域里在抛,監(jiān)控也是起著同樣的作用。它也能幫助我們監(jiān)控業(yè)務(wù)流量萧恕,業(yè)務(wù)健康狀態(tài)刚梭,服務(wù)器狀態(tài)等等,一套成熟的監(jiān)控體系將會給我們運維票唆、開發(fā)人員帶來極大便利性朴读。
本文將介紹如何搭建一套目前企業(yè)較為流行的持久化監(jiān)控系統(tǒng),將使用Influxdb作為prometheus持久化存儲走趋。
主要架構(gòu):Kubernetes+Prometheus+Influxdb+Grafana+Prometheus-alert
架構(gòu)圖如下:
環(huán)境介紹
- 操作系統(tǒng):Centos7.6
- Kubernetes: v1.20.5
- Helm: v3.5.4
- Prometheus: v2.19.0
- Influxdb: v1.8.0
- Grafana: v6.7.1
Prometheus和Grafana是部署到k8s上面磨德。Influxdb是使用單獨的機器部署。
一吆视、安裝Influxdb
Influxdb部署在服務(wù)器上:192.168.241.143
1、下載安裝包
wget https://dl.influxdata.com/influxdb/releases/influxdb-1.8.0.x86_64.rpm
2酥宴、安裝
yum -y localinstall influxdb-1.8.0.x86_64.rpm
3啦吧、啟動Influxdb
systemctl start influxdb
systemctl enable influxdb
安裝后, 在/usr/bin下面有如下文件
# influx (按兩下tab鍵)
influx influxd influx_inspect influx_stress influx_tsm
-------
influxd Influxdb服務(wù)器
influx Influxdb命令行客戶端
influx_inspect 查看工具
influx_stress 壓力測試工具
influx_tsm 數(shù)據(jù)庫轉(zhuǎn)換工具(將數(shù)據(jù)庫從b1或bz1格式轉(zhuǎn)換為tsm1格式)
在 /var/lib/influxdb/下面會有如下文件夾
# ls /var/lib/influxdb/
data meta
4拙寡、創(chuàng)建http接口用于普羅米修斯
192.168.241.143是我的Influxdb地址
curl -XPOST http://192.168.241.143:8086/query --data-urlencode "q=CREATE DATABASE prometheus"
二授滓、準(zhǔn)備Prometheus的chart
Prometheus使用的是Helm方式部署到k8s里面。
下載地址:
1肆糕、先將chart克隆到本地般堆。
git clone https://github.com/zyiqian/charts.git
ls charts
prometheus prometheus-k8s-values.yaml README.md
2、創(chuàng)建一個prometheus的命名空間
kubectl create namespace monitoring
三诚啃、準(zhǔn)備remote_storage_adapter
對于業(yè)務(wù)比較大的環(huán)境Local storage是絕對滿足不了的淮摔,那么就要用remote storage了。
Prometheus的remote storage需要借助adapter實現(xiàn)始赎,adapter會提供write url和read url給Prometheus和橙,這樣Prometheus獲取到數(shù)據(jù)后就會先寫到本地然后再調(diào)用write url寫到遠(yuǎn)端。
1造垛、還需要下載一個可執(zhí)行文件remote_storage_adapter魔招。
需要安裝一個go環(huán)境build。我已經(jīng)build好了五辽,下載下面的地址可以直接用办斑。
wget https://github.com/zyiqian/charts/blob/main/prometheus/remote_storage_adapter
chmod +x remote_storage_adapter
2、現(xiàn)在我們啟動一個remote_storage_adapter來對接Influxdb和prometheus杆逗。
./remote_storage_adapter --influxdb-url=http://192.168.241.143:8086/ --influxdb.database="prometheus" --influxdb.retention-policy=autogen
如果想獲取相應(yīng)的幫助可以使用:./remote_storage_adapter -h來獲取相應(yīng)幫助(修改綁定的端口乡翅,Influxdb的設(shè)置等..)
將remote_storage_adapter注冊為系統(tǒng)服務(wù)
cat>/lib/systemd/system/remote_storage_adapter.service<<EOF
[Service]
Restart=on-failure
WorkingDirectory=/root/
ExecStart=/root/remote_storage_adapter --influxdb-url=http://192.168.241.143:8086/ --influxdb.database="prometheus" --influxdb.retention-policy=autogen
[Install]
WantedBy=multi-user.target
EOF
chown 644 /lib/systemd/system/remote_storage_adapter.service
systemctl daemon-reload
systemctl enable remote_storage_adapter
systemctl start remote_storage_adapter
systemctl status remote_storage_adapter
3鳞疲、修改prometheus chart里面的values文件
找到prometheus-k8s-values.yaml里面remoteWrite和remoteRead。
將地址替換成自己Influxdb的地址.
修改service的type為NodePort的類型
cd prometheus
vim prometheus-k8s-values.yaml
........
remoteWrite:
- url: "http://192.168.241.143:8086/api/v1/prom/write?db=prometheus"
## https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_read
##
remoteRead:
- url: "http://192.168.241.143:8086/api/v1/prom/read?db=prometheus"
..........
servicePort: 9090
sessionAffinity: None
type: NodePort
5峦朗、啟動prometheus
進(jìn)入剛剛 clone 的charts
cd /charts
helm upgrade --install -f prometheus-k8s-values.yaml --namespace monitoring prometheus ./prometheus
查看pod
# kubectl get pods -n monitoring
NAME READY STATUS RESTARTS AGE
prometheus-alertmanager-6755f85f9b-sh4n8 2/2 Running 0 122m
prometheus-kube-state-metrics-95d956569-rpp64 1/1 Running 0 122m
prometheus-node-exporter-fdjvr 1/1 Running 0 122m
prometheus-node-exporter-s7bjr 1/1 Running 0 122m
prometheus-node-exporter-tctzx 1/1 Running 0 122m
prometheus-server-7dffdd7f6b-sghlv 2/2 Running 0 122m
6建丧、將prometheus的服務(wù)端口代理到本地
下面操作是在Windows CMD中執(zhí)行
>kubectl.exe get svc -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
prometheus-alertmanager ClusterIP 10.233.42.87 <none> 80/TCP 2m56s
prometheus-kube-state-metrics ClusterIP 10.233.56.122 <none> 8080/TCP 2m56s
prometheus-node-exporter ClusterIP None <none> 9100/TCP 2m56s
prometheus-server ClusterIP 10.233.22.16 <none> 80/TCP 2m56s
>kubectl.exe port-forward svc/prometheus-server -n monitoring 80
Forwarding from 127.0.0.1:80 -> 9090
Forwarding from [::1]:80 -> 9090
查看svc
# kubectl get svc -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
prometheus-alertmanager ClusterIP 10.233.42.87 <none> 80/TCP 4h1m
prometheus-kube-state-metrics ClusterIP 10.233.56.122 <none> 8080/TCP 4h1m
prometheus-node-exporter ClusterIP None <none> 9100/TCP 4h1m
prometheus-server NodePort 10.233.22.16 <none> 9090:30056/TCP 4h1m
7、在瀏覽器中訪問
nodeIP+port
8波势、查看Influxdb
# influx
Connected to http://localhost:8086 version 1.8.0
influxdb shell version: 1.8.0
> show databases;
name: databases
name
----
prometheus
_internal
> use prometheus
Using database prometheus
> SHOW MEASUREMENTS;
name: measurements
name
----
ALERTS
ALERTS_FOR_STATE
aggregator_openapi_v2_regeneration_count
........
四翎朱、安裝Grafana
參考文檔:https://blog.z0ukun.com/?p=2358
前面我們使用 Prometheus 采集了 Kubernetes 集群中的一些監(jiān)控數(shù)據(jù)指標(biāo),Prometheus 的圖表功能相對較弱尺铣,所以一般情況下我們會通過第三方工具grafana來展示這些數(shù)據(jù)拴曲。
1、安裝NFS服務(wù)
NFS服務(wù)用于持久化Grafana的數(shù)據(jù)凛忿。nfs服務(wù)器部署與Influxdb在同一臺服務(wù)器澈灼。注意,生產(chǎn)環(huán)境不推薦這樣做店溢。
yum install -y nfs-utils rpcbind
啟動服務(wù)
注意:先啟動rpc服務(wù)叁熔,再啟動nfs服務(wù)。
systemctl start rpcbind #先啟動rpc服務(wù)
systemctl enable rpcbind #設(shè)置開機啟動
systemctl start nfs-server #啟動nfs服務(wù)
systemctl enable nfs-server
配置共享文件目錄床牧,編輯配置文件
mkdir /data/public
vi /etc/exports
/data/public *(rw,sync,no_root_squash)
systemctl reload nfs
使用showmount命令查看nfs服務(wù)器共享信息
showmount -e 192.168.241.143
Export list for 192.168.241.143:
/data/public *
需要在每臺node節(jié)點和master節(jié)點上安裝nfs-utils
yum install nfs-utils -y
否則啟動pod的時候會報以下錯誤
# kubectl get pods -n grafana
NAME READY STATUS RESTARTS AGE
grafana-59f68b896-rsgp2 0/1 ContainerCreating 0 84s
grafana-chown-fkgst 0/1 ContainerCreating 0 84s
# kubectl describe pod grafana-59f68b896-rsgp2 -n grafana
.......
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 95s default-scheduler Successfully assigned grafana/grafana-59f68b896-rsgp2 to node2
Warning FailedMount <invalid> (x8 over <invalid>) kubelet MountVolume.SetUp failed for volume "grafana" : mount failed: exit status 32
Mounting command: mount
Mounting arguments: -t nfs 192.168.241.143:/data/public /var/lib/kubelet/pods/2cf96c6d-49cb-405b-8861-57147969acf3/volumes/kubernetes.io~nfs/grafana
Output: mount: wrong fs type, bad option, bad superblock on 192.168.241.143:/data/public,
missing codepage or helper program, or other error
(for several filesystems (e.g. nfs, cifs) you might
need a /sbin/mount.<type> helper program)
In some cases useful info is found in syslog - try
dmesg | tail or so.
2荣回、把grafana文件clone下來
根據(jù)自己的需求修改配置文件。
git clone https://github.com/zyiqian/grafana.git
ls
grafana-chown-job.yaml grafana-deploy.yaml grafana-svc.yaml grafana-volume.yaml
- grafana-deploy.yaml:里面兩個比較重要的環(huán)境變量GF_SECURITY_ADMIN_USER和GF_SECURITY_ADMIN_PASSWORD戈咳,用來配置 grafana 的管理員用戶和密碼心软。
- grafana-volume.yaml :主要做數(shù)據(jù)的持久化。掛載的nfs著蛙。需要修改nfs為自己的nfs服務(wù)器删铃。
- grafana-svc.yaml :主要是對外暴露 grafana 這個服務(wù)。
- grafana-chown-job.yaml :主要是對/var/lib/grafana目錄的權(quán)限更改問題踏堡。
3猎唁、創(chuàng)建grafana命名空間
kubectl create namespace grafana
4、部署
ls
grafana-chown-job.yaml grafana-deploy.yaml grafana-svc.yaml grafana-volume.yaml
kubectl apply -f .
5暂吉、查看pod
# kubectl get pods -n grafana
NAME READY STATUS RESTARTS AGE
grafana-59f68b896-ch5q9 1/1 Running 0 63m
grafana-chown-qsrh4 0/1 Completed 0 63m
如果不是running狀態(tài)可以使用describe和logs命令查看是什么原因胖秒。。
6、查看svc
# kubectl get svc -n grafana
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
grafana NodePort 10.233.18.36 <none> 3000:32214/TCP 64m
7、瀏覽器訪問
node IP + port
使用grafana-deploy.yaml 里面設(shè)置的 admin admin321登錄
8揖庄、創(chuàng)建數(shù)據(jù)源
剛登錄進(jìn)去需要create a data source.
選擇prometheus.
URL為上面prometheus service的url.
9、導(dǎo)入dashboard
模板下載:https://grafana.com/grafana/dashboards
可以使用id:13105 的方式導(dǎo)入风题,也可以使用json file的方式導(dǎo)
導(dǎo)入之后就可以看到各個node的狀態(tài)信息了。
頂部可以選擇時間段。
測試
測試數(shù)據(jù)是否做了持久化存儲沛硅。
卸載prometheus眼刃。
# helm uninstall prometheus -n monitoring
確定全部pod已經(jīng)移除。
# kubectl get pods -n monitoring
No resources found in monitoring namespace.
重新部署prometheus.
# cd /data/prometheus/
# helm upgrade --install -f prometheus-k8s-values.yaml --namespace monitoring prometheus ./prometheus
查看pod
# kubectl get pods -n monitoring
NAME READY STATUS RESTARTS AGE
prometheus-alertmanager-6755f85f9b-nstxf 2/2 Running 0 43s
prometheus-kube-state-metrics-95d956569-jd9bm 1/1 Running 0 43s
prometheus-node-exporter-5ldwc 1/1 Running 0 43s
prometheus-node-exporter-l5qlx 1/1 Running 0 43s
prometheus-node-exporter-pwfth 1/1 Running 0 43s
prometheus-server-7dffdd7f6b-hnvdk 2/2 Running 0 43s
查看svc
# kubectl get svc -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
prometheus-alertmanager ClusterIP 10.233.18.217 <none> 80/TCP 117s
prometheus-kube-state-metrics ClusterIP 10.233.20.130 <none> 8080/TCP 117s
prometheus-node-exporter ClusterIP None <none> 9100/TCP 117s
prometheus-server NodePort 10.233.11.102 <none> 9090:30462/TCP 117s
然后在瀏覽器上訪問并查詢摇肌。
可以看到以下有空出來的一部分擂红。那就是剛剛卸載pod的那一個時間段。前面的數(shù)據(jù)還是可以繼續(xù)展示出來的围小。
重新在grafana data source上設(shè)置下url的port
save & test
再去dashboard上面看昵骤。把時間段調(diào)大寫,就可以看到歷史的監(jiān)控數(shù)據(jù)了肯适。說明我們的數(shù)據(jù)持久化是完成部署了变秦。
下面間隔很大是因為我用的是VM本地測試用的。
部署Prometheus-Alert
地址:https://github.com/feiyu563/PrometheusAlert
PrometheusAlert是開源的運維告警中心消息轉(zhuǎn)發(fā)系統(tǒng)框舔,我們可以通過它將收到的這些消息發(fā)送到釘釘蹦玫,微信,email刘绣。
1樱溉、將chart下載到本地
git clone https://github.com/zyiqian/charts.git
cd charts/prometheusalert
ls
Chart.yaml config README.md templates values.yaml
2、修改values.yaml文件
# Default values for prometheusalert.
# This is a YAML-formatted file.
# Declare variables to be passed into your templates.
replicaCount: 1
image:
repository: feiyu563/prometheus-alert
tag: "latest"
pullPolicy: IfNotPresent
nameOverride: ""
fullnameOverride: ""
service:
type: ClusterIP
port: 80
ingress:
enabled: false
annotations: {}
# kubernetes.io/ingress.class: nginx
# kubernetes.io/tls-acme: "true"
hosts:
- host: prometheus-alert.local
paths: ["/"]
tls: []
resources:
limits:
cpu: 1000m
memory: 1024Mi
requests:
cpu: 100m
memory: 128Mi
nodeSelector: {}
tolerations: []
affinity: {}
3纬凤、修改config/app.conf文件
#是否開啟釘釘告警通道,可同時開始多個通道0為關(guān)閉,1為開啟
open-dingding=1
#默認(rèn)釘釘機器人地址
ddurl=https://oapi.dingtalk.com/robot/send?access_token=1111111sssssss1
#是否開啟 @所有人(0為關(guān)閉,1為開啟)
dd_isatall=0
4饺窿、部署
cd charts/prometheusalert
helm install -n monitoring .
查看pod
# kubectl get pods -n monitoring
NAME READY STATUS RESTARTS AGE
grafana-94c76556b-frksh 1/1 Running 0 1h
prometheus-alert-7f64cbb99c-rssw2 1/1 Running 0 18s
prometheus-alertmanager-6755f85f9b-nstxf 2/2 Running 0 1h
prometheus-kube-state-metrics-95d956569-jd9bm 1/1 Running 0 1h
prometheus-node-exporter-5ldwc 1/1 Running 0 1h
prometheus-node-exporter-l5qlx 1/1 Running 0 1h
prometheus-node-exporter-pwfth 1/1 Running 0 1h
prometheus-server-7dffdd7f6b-hnvdk 2/2 Running 0 1h
$kubectl get svc -n monitoring
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
k8s-grafana ClusterIP 12.80.41.192 <none> 80/TCP 1h
prometheus-alert ClusterIP 12.80.128.117 <none> 80/TCP 30s
5、Prometheus 接入配置
在 Prometheus Alertmanager 中啟用 Webhook
編輯prometheus-k8s-values.yaml移斩,找到以下內(nèi)容
........
alertmanagerFiles:
alertmanager.yml:
global: {}
# slack_api_url: ''
route:
receiver: "web.hook.prometheusalert"
group_by: ["alertname"]
group_wait: 10s
group_interval: 5m
repeat_interval: 2h
routes:
- receiver: "page"
group_wait: 10s
group_by: ['page']
repeat_interval: 8h
group_interval: 5m
match_re:
severity: "page"
receivers:
- name: 'web.hook.prometheusalert'
webhook_configs:
- url: 'http://[prometheusalert_url]/prometheus/alert'
# 如果使用ClusterIP的方式的話 [prometheusalert_url]就是prometheus-alert.monitoring.svc.cluster.local
- name: 'page'
webhook_configs:
- url: 'http://prometheus-alert.monitoring.svc.cluster.local/prometheusalert?type=dd&tpl=prometheus-dd&ddurl=https://oapi.dingtalk.com/robot/send?access_token=031db3xxxxxxxxxxxx'
.....
prometheus.yml:
rule_files:
- /etc/config/recording_rules.yml
- /etc/config/alerting_rules.yml
## Below two files are DEPRECATED will be removed from this default values file
- /etc/config/rules
- /etc/config/alerts
scrape_configs:
- job_name: 'node-exporter'
scrape_interval: 5s
scrape_timeout: 5s
metrics_path: /metrics
scheme: http
static_configs:
- targets:
# jenkins-agent
- "192.168.241.114:9100"
labels:
env: infrastructure
6、修改完成后重新部署prometheus
cd charts
helm upgrade --install -f prometheus-k8s-values.yaml --namespace monitoring prometheus ./prometheus
7绢馍、訪問prometheus-alert web
$ kubectl port-forward svc/k8s-prometheus-alert -n monitoring 80
Forwarding from 127.0.0.1:80 -> 8080
Forwarding from [::1]:80 -> 8080
[圖片上傳失敗...(image-7c1c1d-1629065813017)]
用戶名名密碼默認(rèn)是admin
如果配置好釘釘之后就可以嘗試下告警測試
釘釘群將受到如下信息:
這里可以設(shè)置它的告警模板
{{ $var := .externalURL}}{{ range $k,$v:=.alerts }}
{{if eq $v.status "resolved"}}
## [Prometheus恢復(fù)信息]
##### 告警級別:{{$v.labels.severity}}
###### 開始時間:{{GetCSTtime $v.startsAt}}
###### 結(jié)束時間:{{GetCSTtime $v.endsAt}}
###### 故障主機IP:{{$v.labels.instance}}
##### 【Summary】
##### {{$v.annotations.summary}}
##### 【Description】
##### {{$v.annotations.description}}
{{else}}
## [Prometheus告警信息]
#### [{{$v.labels.alertname}}]
##### 告警級別:{{$v.labels.severity}}
###### 開始時間:{{GetCSTtime $v.startsAt}}
###### 故障主機IP:{{$v.labels.instance}}
##### 【Summary】
##### {{$v.annotations.summary}}
##### 【Description】
##### {{$v.annotations.description}}
{{end}}
{{ end }}
下面的JSON可以在日志中找向瓷,如果觸發(fā)告警的話會在日志里面打印的
官方文檔:https://github.com/feiyu563/PrometheusAlert/blob/master/doc/readme/customtpl.md
kubectl logs k8s-prometheus-alert-7f64cbb99c-rssw2 -n monitoring
找到類似這樣的json格式
然后將它復(fù)制出來使用json轉(zhuǎn)換下,復(fù)制過去
點擊模板測試
到這里舰涌,整套的監(jiān)控體系就搭建完成了猖任。