接著上一節(jié) 《Prometheus + Grafana (1) 監(jiān)控 》层皱,我們繼續(xù)探討 Prometheus + Grafana 的復(fù)雜應(yīng)用
實(shí)現(xiàn)目標(biāo)
這節(jié)我們的目標(biāo)是搭建一個(gè)多維度監(jiān)控微服務(wù)的可視化平臺(tái),包括Docker容器監(jiān)控赠潦、MySQL監(jiān)控叫胖、Redis監(jiān)控和微服務(wù)JVM監(jiān)控等,并且在必要的情況下可以發(fā)送預(yù)警郵件她奥。
主要用到的組件有Prometheus瓮增、Grafana怎棱、alertmanager、node_exporter绷跑、mysql_exporter拳恋、redis_exporter、cadvisor砸捏。各自作用如下所示:
- Prometheus:獲取谬运、存儲(chǔ)監(jiān)控?cái)?shù)據(jù),供第三方查詢垦藏;
- Grafana:提供Web頁面梆暖,從Prometheus獲取監(jiān)控?cái)?shù)據(jù)可視化展示;
- alertmanager:定義預(yù)警規(guī)則掂骏,發(fā)送預(yù)警信息轰驳;
- node_exporter:收集微服務(wù)端點(diǎn)監(jiān)控?cái)?shù)據(jù)(與Prometheus一套);
- mysql_exporter:收集MySQL數(shù)據(jù)庫監(jiān)控?cái)?shù)據(jù)芭挽;
- redis_exporter:收集Redis監(jiān)控?cái)?shù)據(jù)滑废;
- cadvisor:收集Docker容器監(jiān)控?cái)?shù)據(jù)。
使用docker安裝 Grafana袜爪、Prometheus及監(jiān)控服務(wù)
上一節(jié)我們是直接使用的Windows下的安裝軟件安裝Grafana和Prometheus,但是在我們的日常生產(chǎn)=環(huán)境中多是用的Linux薛闪,所以我們選擇了方便的docker進(jìn)行安裝部署辛馆。
- 在自己的掛載目錄下創(chuàng)建 prometheus.yml
#創(chuàng)建Prometheus掛載目錄
mkdir -p /dimples/volumes/prometheus
#在該目錄下創(chuàng)建Prometheus配置文件
vim /dimples/volumes/prometheus/prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- 在自己的掛載目錄下創(chuàng)建 alertmanager.yml
global:
smtp_smarthost: 'smtp.qq.com:465'
smtp_from: '1126834403@qq.com'
smtp_auth_username: '1126834403@qq.com'
# qq郵箱獲取的授權(quán)碼
smtp_auth_password: 'xxxxxxxxxxxxxxxxx'
smtp_require_tls: false
#templates:
# - '/alertmanager/template/*.tmpl'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 5m
repeat_interval: 5m
receiver: 'default-receiver'
receivers:
- name: 'default-receiver'
email_configs:
- to: '2119713895@qq.com'
send_resolved: true
- 創(chuàng)建創(chuàng)建 docker-compose.yml 文件
version: '3'
services:
prometheus:
image: prom/prometheus
container_name: prometheus
volumes:
- /dimples/volumes/prometheus/:/etc/prometheus/
ports:
- 9090:9090
restart: on-failure
command:
- '--web.enable-lifecycle '
grafana:
image: grafana/grafana
container_name: grafana
ports:
- 3000:3000
node_exporter:
image: prom/node-exporter
container_name: node_exporter
ports:
- 9100:9100
redis_exporter:
image: oliver006/redis_exporter
container_name: redis_exporter
command:
- "--redis.addr=redis://127.0.0.1:6379"
- "--redis.password 'ZHONG9602.class'" # 認(rèn)證密碼,如果沒有密碼豁延,該參數(shù)不需要
ports:
- 9101:9121
restart: on-failure
mysql_exporter:
image: prom/mysqld-exporter
container_name: mysql_exporter
environment:
- DATA_SOURCE_NAME=root:123456@(127.0.0.1:3306)/
ports:
- 9102:9104
cadvisor:
image: google/cadvisor
container_name: cadvisor
volumes:
- /:/rootfs:ro
- /var/run:/var/run:rw
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
ports:
- 9103:8080
alertmanager:
image: prom/alertmanager
container_name: alertmanager
volumes:
- /dimples/volumes/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
ports:
- 9104:9093
使用 docker-compose up -d 啟動(dòng)服務(wù)
# 不使用docker-compose安裝
docker run -d --name prometheus -p 9090:9090 -v /dimples/volumes/prometheus/:/etc/prometheus/ prom/prometheus --config.file=/etc/prometheus/prometheus.yml --web.enable-lifecycle
docker run -d --name redis_exporter -p 9101:9121 oliver006/redis_exporter --redis.addr redis://127.0.0.1:6379 --redis.password 'ZHONG9602.class'
- 測試是否監(jiān)控到數(shù)據(jù)
如上圖所示昙篙,我們剛剛定義的兩個(gè)警告規(guī)則已經(jīng)成功加載
接著訪問 http://127.0.0.1:9090/targets 觀察在Prometheus配置文件里定義的各個(gè)job的狀態(tài):
可以看的都是監(jiān)控的UP狀態(tài)。
還可以點(diǎn)擊上面這個(gè)頁面的各個(gè) Endpoint 的鏈接诱咏,如果頁面顯示出了收集的數(shù)據(jù)苔可,則說明各個(gè)Endpoint已經(jīng)成功采集到了數(shù)據(jù),以mysql_exporter為例子袋狞,訪問
http://127.0.0.1:9102/metrics
訪問http://127.0.0.1:9104/#/status看看我們?cè)赼lertmanager.yml配置的規(guī)則是否已經(jīng)生效:
配置Java程序監(jiān)控
在上面的配置中我們簡單的將Prometheus采集的對(duì)于自身的數(shù)據(jù)通過Grafana進(jìn)行了展示焚辅,而我們的核心是通過Prometheus去采集Java應(yīng)用的數(shù)據(jù),這就需要針對(duì)前面提到的通過Prometheus的pull模式定時(shí)去拉取SpringBoot通過Actuator暴露的Micrometer采集的監(jiān)控指標(biāo)
- 首先需要的做的是完成Java應(yīng)用的Micrometer集成苟鸯,訪問actuator/prometheus或者/prometheus能夠正常的返回Micrometer采集的數(shù)據(jù)指標(biāo)(這一步操作在上節(jié)中已經(jīng)很詳細(xì)的介紹了同蜻,此處不再贅述)
- 進(jìn)入部署Prometheus的文件目錄,打prometheus.yml進(jìn)行拉取節(jié)點(diǎn)的配置早处,在配置文件的scrape_configs節(jié)點(diǎn)添加針對(duì)java的配置
修改 prometheus.yml 配置所有監(jiān)控服務(wù)
在上面啟動(dòng)的 prometheus湾蔓,我們沒有配置任何的監(jiān)控,所以我們要修改 prometheus.yml 文件砌梆,使其監(jiān)控我們想監(jiān)控的數(shù)據(jù)源默责,具體的修改內(nèi)容如下圖所示
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['127.0.0.1:9090']
- job_name: 'node_exporter'
static_configs:
- targets: ['127.0.0.1:9100']
labels:
instance: 'node_exporter'
- job_name: 'redis_exporter'
static_configs:
- targets: ['127.0.0.1:9101']
labels:
instance: 'redis_exporter'
- job_name: 'mysql_exporter'
static_configs:
- targets: ['127.0.0.1:9102']
labels:
instance: 'mysql_exporter'
- job_name: 'cadvisor'
static_configs:
- targets: ['127.0.0.1:9103']
labels:
instance: 'cadvisor'
- job_name: 'server-demo-actuator'
metrics_path: '/actuator/prometheus'
scrape_interval: 5s
static_configs:
- targets: ['127.0.0.1:8001']
labels:
instance: 'server-demo'
rule_files:
- 'memory_over.yml'
- 'server_down.yml'
alerting:
alertmanagers:
- static_configs:
- targets: ["127.0.0.1:9104"]
PS: 每個(gè)服務(wù)的targets都是一個(gè)數(shù)組贬循,可以收集多個(gè)服務(wù)器下的exporter提供的監(jiān)控?cái)?shù)據(jù)。
接著創(chuàng)建上面提到的兩個(gè)監(jiān)控規(guī)則 memory_over.yml 和 server_down.yml
# 創(chuàng)建 memory_over.yml
vim /dimples/volumes/prometheus/memory_over.yml
內(nèi)容如下:
groups:
- name: server_down
rules:
- alert: InstanceDown
expr: up == 0
for: 20s
labels:
user: Dimples
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 20 s."
當(dāng)某個(gè)節(jié)點(diǎn)的內(nèi)存使用率大于80%桃序,并且持續(xù)時(shí)間大于20秒后甘有,觸發(fā)監(jiān)控預(yù)警。
接著創(chuàng)建 server_down.yml:
# server_down.yml
vim /dimples/volumes/prometheus/server_down.yml
內(nèi)容如下:
groups:
- name: server_down
rules:
- alert: InstanceDown
expr: up == 0
for: 20s
labels:
user: Dimples
annotations:
summary: "Instance {{ $labels.instance }} down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 20 s."
當(dāng)某個(gè)節(jié)點(diǎn)宕機(jī)(up==0表示宕機(jī)葡缰,1表示正常運(yùn)行)超過20秒后亏掀,則觸發(fā)監(jiān)控。
在 Grafana 中使用
使用瀏覽器訪問 http://127.0.0.1:9090泛释,用戶名密碼為admin/admin滤愕,首次登錄需要修改密碼。
第一步:首先需要添加數(shù)據(jù)源怜校,上一節(jié)中已經(jīng)詳細(xì)介紹過了间影,此處不再贅述,結(jié)果如圖:
添加數(shù)據(jù)源成功后茄茁,我們就可以添加監(jiān)控面板了魂贬,同樣的,我們可以去Grafana官方市場選擇別人配置好的模板:https://grafana.com/grafana/dashboards
此處我收集了幾個(gè)好用的監(jiān)控模板裙顽,已經(jīng)上傳到微云網(wǎng)盤付燥,只需要下載然后導(dǎo)入即可( 鏈接:https://share.weiyun.com/XDzICKtf )
下面以 MySql 監(jiān)控為例,演示導(dǎo)入模板:
點(diǎn)擊 Upload JSON file 后愈犹,選擇對(duì)應(yīng)的文件键科,成功后會(huì)自動(dòng)彈出一下界面,然后點(diǎn)擊Import
額外補(bǔ)充
alertmanager 豐富的預(yù)警配置
groups:
- name: example #定義規(guī)則組
rules:
- alert: InstanceDown #定義報(bào)警名稱
expr: up == 0 #Promql語句,觸發(fā)規(guī)則
for: 1m # 一分鐘
labels: #標(biāo)簽定義報(bào)警的級(jí)別和主機(jī)
name: instance
severity: Critical
annotations: #注解
summary: " {{ $labels.appname }}" #報(bào)警摘要漩怎,取報(bào)警信息的appname名稱
description: " 服務(wù)停止運(yùn)行 " #報(bào)警信息
value: "{{ $value }}%" # 當(dāng)前報(bào)警狀態(tài)值
- name: Host
rules:
- alert: HostMemory Usage
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes + node_memory_Buffers_bytes + node_memory_Cached_bytes)) / node_memory_MemTotal_bytes * 100 > 80
for: 1m
labels:
name: Memory
severity: Warning
annotations:
summary: " {{ $labels.appname }} "
description: "宿主機(jī)內(nèi)存使用率超過80%."
value: "{{ $value }}"
- alert: HostCPU Usage
expr: sum(avg without (cpu)(irate(node_cpu_seconds_total{mode!='idle'}[5m]))) by (instance,appname) > 0.65
for: 1m
labels:
name: CPU
severity: Warning
annotations:
summary: " {{ $labels.appname }} "
description: "宿主機(jī)CPU使用率超過65%."
value: "{{ $value }}"
- alert: HostLoad
expr: node_load5 > 4
for: 1m
labels:
name: Load
severity: Warning
annotations:
summary: "{{ $labels.appname }} "
description: " 主機(jī)負(fù)載5分鐘超過4."
value: "{{ $value }}"
- alert: HostFilesystem Usage
expr: 1-(node_filesystem_free_bytes / node_filesystem_size_bytes) > 0.8
for: 1m
labels:
name: Disk
severity: Warning
annotations:
summary: " {{ $labels.appname }} "
description: " 宿主機(jī) [ {{ $labels.mountpoint }} ]分區(qū)使用超過80%."
value: "{{ $value }}%"
- alert: HostDiskio
expr: irate(node_disk_writes_completed_total{job=~"Host"}[1m]) > 10
for: 1m
labels:
name: Diskio
severity: Warning
annotations:
summary: " {{ $labels.appname }} "
description: " 宿主機(jī) [{{ $labels.device }}]磁盤1分鐘平均寫入IO負(fù)載較高."
value: "{{ $value }}iops"
- alert: Network_receive
expr: irate(node_network_receive_bytes_total{device!~"lo|bond[0-9]|cbr[0-9]|veth.*|virbr.*|ovs-system"}[5m]) / 1048576 > 3
for: 1m
labels:
name: Network_receive
severity: Warning
annotations:
summary: " {{ $labels.appname }} "
description: " 宿主機(jī) [{{ $labels.device }}] 網(wǎng)卡5分鐘平均接收流量超過3Mbps."
value: "{{ $value }}3Mbps"
- alert: Network_transmit
expr: irate(node_network_transmit_bytes_total{device!~"lo|bond[0-9]|cbr[0-9]|veth.*|virbr.*|ovs-system"}[5m]) / 1048576 > 3
for: 1m
labels:
name: Network_transmit
severity: Warning
annotations:
summary: " {{ $labels.appname }} "
description: " 宿主機(jī) [{{ $labels.device }}] 網(wǎng)卡5分鐘內(nèi)平均發(fā)送流量超過3Mbps."
value: "{{ $value }}3Mbps"
- name: Container
rules:
- alert: ContainerCPU Usage
expr: (sum by(name,instance) (rate(container_cpu_usage_seconds_total{image!=""}[5m]))*100) > 60
for: 1m
labels:
name: CPU
severity: Warning
annotations:
summary: "{{ $labels.name }} "
description: " 容器CPU使用超過60%."
value: "{{ $value }}%"
- alert: ContainerMem Usage
# expr: (container_memory_usage_bytes - container_memory_cache) / container_spec_memory_limit_bytes * 100 > 10
expr: container_memory_usage_bytes{name=~".+"} / 1048576 > 1024
for: 1m
labels:
name: Memory
severity: Warning
annotations:
summary: "{{ $labels.name }} "
description: " 容器內(nèi)存使用超過1GB."
value: "{{ $value }}G"
預(yù)警除了使用郵件外勋颖,也可以使用企業(yè)微信接收,可以參考:https://songjiayang.gitbooks.io/prometheus/content/alertmanager/wechat.html