Prometheus 監(jiān)控技術(shù)與實(shí)踐
監(jiān)控分類(lèi)
- Logging 用于記錄離散的事件。例如烘贴,應(yīng)用程序的調(diào)試信息或錯(cuò)誤信息探颈。它是我們?cè)\斷問(wèn)題的依據(jù)词裤。比如我們說(shuō)的ELK就是基于Logging黍判。
- Metrics 用于記錄可聚合的數(shù)據(jù)豫尽。例如,1顷帖、隊(duì)列的當(dāng)前深度可被定義為一個(gè)度量值美旧,在元素入隊(duì)或出隊(duì)時(shí)被更新;HTTP 請(qǐng)求個(gè)數(shù)可被定義為一個(gè)計(jì)數(shù)器窟她,新請(qǐng)求到來(lái)時(shí)進(jìn)行累陈症。2、列如獲取當(dāng)前CPU或者內(nèi)存的值震糖。 prometheus專(zhuān)注于Metrics領(lǐng)域。
- Tracing - 用于記錄請(qǐng)求范圍內(nèi)的信息趴腋。例如吊说,一次遠(yuǎn)程方法調(diào)用的執(zhí)行過(guò)程和耗時(shí)。它是我們排查系統(tǒng)性能問(wèn)題的利器优炬。最常用的有Skywalking颁井,ping-point,zipkin蠢护。
Prometheus介紹
簡(jiǎn)介
Prometheus(中文名:普羅米修斯)是由SoundCloud開(kāi)發(fā)的開(kāi)源監(jiān)控報(bào)警系統(tǒng)和時(shí)序列數(shù)據(jù)庫(kù)(TSDB). Prometheus使用Go語(yǔ)言開(kāi)發(fā), 是Google BorgMon監(jiān)控系統(tǒng)的開(kāi)源版本雅宾。
Prometheus的基本原理是通過(guò)HTTP協(xié)議周期性抓取被監(jiān)控組件的狀態(tài), 任意組件只要提供對(duì)應(yīng)的HTTP接口就可以接入監(jiān)控. 不需要任何SDK或者其他的集成過(guò)程。輸出被監(jiān)控組件信息的HTTP接口被叫做exporter葵硕,目前開(kāi)發(fā)常用的組件大部分都有exporter可以直接使用, 比如Nginx眉抬、MySQL、Linux系統(tǒng)信息懈凹、Mongo蜀变、ES等
官網(wǎng)地址:https://prometheus.io/
系統(tǒng)生態(tài)
Prometheus 可以從配置或者用服務(wù)發(fā)現(xiàn),去調(diào)用各個(gè)應(yīng)用的 metrics 接口介评,來(lái)采集數(shù)據(jù)库北,然后存儲(chǔ)在硬盤(pán)中爬舰,而如果是基礎(chǔ)應(yīng)用比如數(shù)據(jù)庫(kù),負(fù)載均衡器等寒瓦,可以在相關(guān)的服務(wù)中安裝 Exporters 來(lái)提供 metrics 接口供 Prometheus 拉取情屹。
采集到的數(shù)據(jù)有兩個(gè)去向,一個(gè)是報(bào)警杂腰,另一個(gè)是可視化垃你。
Prometheus有著非常高效的時(shí)間序列數(shù)據(jù)存儲(chǔ)方法,每個(gè)采樣數(shù)據(jù)僅僅占用3.5byte左右空間颈墅,上百萬(wàn)條時(shí)間序列蜡镶,30秒間隔,保留60天恤筛,大概花了200多G(引用官方PPT)官还。
Prometheus內(nèi)部主要分為三大塊,Retrieval是負(fù)責(zé)定時(shí)去暴露的目標(biāo)頁(yè)面上去抓取采樣指標(biāo)數(shù)據(jù)毒坛,Storage是負(fù)責(zé)將采樣數(shù)據(jù)寫(xiě)磁盤(pán)望伦,PromQL是Prometheus提供的查詢(xún)語(yǔ)言模塊。
Metrics
格式
<metric name>{<label name>=<label value>, ...}
- metric name: [a-zA-Z_:][a-zA-Z0-9_:]*
- label name: [a-zA-Z0-9_]*
- label value: .* (即不限制)
例如通過(guò)NodeExporter的metrics地址暴露指標(biāo)內(nèi)容:
node_cpu_seconds_total{cpu="0",mode="idle"} 927490.95
node_cpu_seconds_total{cpu="0",mode="iowait"} 27.74
樣本
在時(shí)間序列中的每一個(gè)點(diǎn)稱(chēng)為一個(gè)樣本(sample)煎殷,樣本由以下三部分組成:
- 指標(biāo)(metric):指標(biāo)名稱(chēng)和描述當(dāng)前樣本特征的 labelsets屯伞;
- 時(shí)間戳(timestamp):一個(gè)精確到毫秒的時(shí)間戳;
- 樣本值(value):一個(gè) folat64 的浮點(diǎn)型數(shù)據(jù)表示當(dāng)前樣本的值豪直。
數(shù)據(jù)類(lèi)型
Prometheus定義了4中不同的指標(biāo)類(lèi)型(metric type):Counter(計(jì)數(shù)器)劣摇、Gauge(儀表盤(pán))、Histogram(直方圖)弓乙、Summary(摘要)末融。
在NodeExporter返回的樣本數(shù)據(jù)中,其注釋中也包含了該樣本的類(lèi)型暇韧。例如:
# HELP node_cpu_seconds_total Seconds the cpus spent in each mode.
# TYPE node_cpu_seconds_total counter
node_cpu_seconds_total{cpu="0",mode="idle"} 927490.95
node_cpu_seconds_total{cpu="0",mode="iowait"} 27.74...
-
Counter
統(tǒng)計(jì)的數(shù)據(jù)是遞增的勾习,不能使用計(jì)數(shù)器來(lái)統(tǒng)計(jì)可能減小的指標(biāo),計(jì)數(shù)器統(tǒng)計(jì)的指標(biāo)是累計(jì)增加的懈玻,如http請(qǐng)求的總數(shù)巧婶,出現(xiàn)的錯(cuò)誤總數(shù),總的處理時(shí)間涂乌,api請(qǐng)求數(shù)等
例如:
第一次抓取 http_response_total{method="GET",endpoint="/api/tracks"} 100 第二次抓取 http_response_total{method="GET",endpoint="/api/tracks"} 150
-
Gauge
量規(guī)是一種度量標(biāo)準(zhǔn)艺栈,代表可以任意上下波動(dòng)的單個(gè)數(shù)值,用于統(tǒng)計(jì)cpu使用率骂倘,內(nèi)存使用率眼滤,磁盤(pán)使用率,溫度等指標(biāo)历涝,還可以統(tǒng)計(jì)上升和下降的計(jì)數(shù)诅需。如并發(fā)請(qǐng)求數(shù)等漾唉。
例如:
第1次抓取 memory_usage_bytes{host="master-01"} 100 第2秒抓取 memory_usage_bytes{host="master-01"} 30 第3次抓取 memory_usage_bytes{host="master-01"} 50 第4次抓取 memory_usage_bytes{host="master-01"} 80
-
Histogram
統(tǒng)計(jì)在一定的時(shí)間范圍內(nèi)數(shù)據(jù)的分布情況。如請(qǐng)求的持續(xù)/延長(zhǎng)時(shí)間堰塌,請(qǐng)求的響應(yīng)大小等赵刑,還提供度量指標(biāo)的總和,數(shù)據(jù)以直方圖顯示场刑。Histogram由_bucket{le=""}般此,_bucket{le="+Inf"}, _sum,_count 組成
如:
apiserver_request_latencies_sum
apiserver_request_latencies_count
apiserver_request_latencies_bucket -
Summary
和Histogram直方圖類(lèi)似牵现,主要用于表示一段時(shí)間內(nèi)數(shù)據(jù)采樣結(jié)果(通常是請(qǐng)求持續(xù)時(shí)間或響應(yīng)大小之類(lèi)的東西)铐懊,還可以計(jì)算度量值的總和和度量值的分位數(shù)以及在一定時(shí)間范圍內(nèi)的分位數(shù),由{quantile="<φ>"}瞎疼,_sum科乎,_count 組成
PromSQL
-
運(yùn)算
乘:*
除:/
加:+
減:- -
函數(shù)
sum() 函數(shù):求出找到所有value的值
irate() 函數(shù):統(tǒng)計(jì)平均速率
by (標(biāo)簽名) 相當(dāng)于關(guān)系型數(shù)據(jù)庫(kù)中的group by函數(shù)
min:最小值
count: 元素個(gè)數(shù)
avg: 平均值
-
范圍匹配
5分鐘之內(nèi)
[5m] -
案例
1、獲取cpu使用率 :100-(avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance) *100)
2贼急、獲取內(nèi)存使用率:100-(node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes)/node_memory_MemTotal_bytes*100
下載安裝
- 官方下載安裝
# mac下載地址 https://prometheus.io/download/
# 解壓
tar -zxvf prometheus-2.26.0.darwin-amd64.tar.gz
#啟動(dòng)
./prometheus --config.file=/usr/local/prometheus/prometheus.yml
-
docker 方式快速安裝
docker run -d --restart=always \ -p 9090:9090 \ -v ~/prometheus.yml:/etc/prometheus/prometheus.yml \ prom/prometheus
配置文件: prometheus.yml
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
# - alertmanager:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'erp-monitord'
metrics_path: '/prometheus'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:50258']
- 訪(fǎng)問(wèn)http://localhost:9090
Target
Graph
Springboot 集成
Springboot1.5集成
maven 依賴(lài)
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
<version>1.6.1</version>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-spring-legacy</artifactId>
<version>1.3.15</version>
</dependency>
-
配置訪(fǎng)問(wèn)地址
#endpoints.prometheus.path=${spring.application.name}/prometheus management.metrics.tags.application = ${spring.application.name}
-
啟動(dòng)springboot 服務(wù)查看指標(biāo) http://xxxx:port/prometheus
Springboot2.x 版本集成
-
maven 依賴(lài)
<dependency> <groupId>org.springframework.boot</groupId> <artifactId>spring-boot-starter-actuator</artifactId> </dependency> <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-core</artifactId> </dependency> <dependency> <groupId>io.micrometer</groupId> <artifactId>micrometer-registry-prometheus</artifactId> </dependency>
默認(rèn)地址: http://localhost:8990/actuator/prometheus
-
配置
#Prometheus springboot監(jiān)控配置 management: endpoints: web: exposure: include: 'prometheus' # 暴露/actuator/prometheus metrics: tags: application: ${spring.application.name} # 暴露的數(shù)據(jù)中添加application label
自定義業(yè)務(wù)指標(biāo)案例
- 代碼配置
import com.slightech.marvin.api.visitor.app.metrics.VisitorMetrics;
import com.slightech.marvin.api.visitor.app.service.VisitorStatisticsService;
import io.micrometer.core.instrument.MeterRegistry;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.boot.actuate.autoconfigure.metrics.MeterRegistryCustomizer;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
/**
* @author wanglu
*/
@Configuration
public class MeterConfig {
@Value("${spring.application.name}")
private String applicationName;
@Autowired
private VisitorStatisticsService visitorStatisticsService;
@Bean
public MeterRegistryCustomizer<MeterRegistry> configurer() {
return (registry) -> registry.config().commonTags("application", applicationName);
}
@Bean
public VisitorMetrics visitorMetrics() {
return new VisitorMetrics(visitorStatisticsService);
}
}
-
自定義指標(biāo)處理類(lèi)實(shí)現(xiàn)MeterBinder 接口
import io.micrometer.core.instrument.Gauge; import io.micrometer.core.instrument.MeterRegistry; import io.micrometer.core.instrument.binder.MeterBinder; /** * @author wanglu * @date 2019/11/08 */ public class VisitorMetrics implements MeterBinder { private static final String METRICS_NAME_SPACE = "visitor"; private VisitorStatisticsService visitorStatisticsService; public VisitorMetrics(VisitorStatisticsService visitorStatisticsService) { this.visitorStatisticsService = visitorStatisticsService; } @Override public void bindTo(MeterRegistry meterRegistry) { //公寓獲取二維碼次數(shù)-當(dāng)天 //visitor_today_apartment_get_qr_code_count{"application"="xxx", "option"="apartment get qr code"} Gauge.builder(METRICS_NAME_SPACE + "_today_apartment_get_qr_code_count", visitorStatisticsService, VisitorStatisticsService::getTodayApartmentGetQrCodeCount) .description("today apartment get qr code count") .tag("option", "apartment get qr code") .register(meterRegistry); //visitor_today_apartment_get_qr_code_count{"application"="xxx", "option"="employee get qr code"} //員工獲取二維碼次數(shù)-當(dāng)天 Gauge.builder(METRICS_NAME_SPACE + "_today_employee_get_qr_code_count", visitorStatisticsService, VisitorStatisticsService::getTodayEmployeeGetQrCodeCount) .description("today employee get qr code count") .tag("option", "employee get qr code") .register(meterRegistry); } }
-
埋點(diǎn)服務(wù)計(jì)數(shù)
import org.springframework.beans.factory.annotation.Autowired; import org.springframework.data.redis.core.RedisTemplate; import org.springframework.scheduling.annotation.Async; import org.springframework.stereotype.Service; import java.util.Calendar; import java.util.Date; import java.util.concurrent.TimeUnit; /** * 數(shù)據(jù)統(tǒng)計(jì) * * @author laijianzhen * @date 2019/11/08 */ @Service public class VisitorStatisticsService { /** * 公寓獲取二維碼次數(shù)-當(dāng)天 */ private static final String COUNT_TODAY_APARTMENT_GET_QR_CODE_REDIS_KEY = "Visitor:Statistics:CountTodayApartmentGetQrCode"; /** * 員工獲取二維碼次數(shù)-當(dāng)天 */ private static final String COUNT_TODAY_EMPLOYEE_GET_QR_CODE_REDIS_KEY = "Visitor:Statistics:CountTodayEmployeeGetQrCode"; @Autowired private RedisTemplate<String, Object> redisTemplate; public int getTodayApartmentGetQrCodeCount() { return getCountFromRedis(COUNT_TODAY_APARTMENT_GET_QR_CODE_REDIS_KEY); } public int getTodayEmployeeGetQrCodeCount() { return getCountFromRedis(COUNT_TODAY_EMPLOYEE_GET_QR_CODE_REDIS_KEY); } @Async public void countTodayApartmentGetQrCode() { increaseCount(COUNT_TODAY_APARTMENT_GET_QR_CODE_REDIS_KEY); } @Async public void countTodayEmployeeGetQrCode() { increaseCount(COUNT_TODAY_EMPLOYEE_GET_QR_CODE_REDIS_KEY); } private int getCountFromRedis(String key) { Object object = redisTemplate.opsForValue().get(key); if (object == null) { return 0; } return Integer.parseInt(String.valueOf(object)); } private void increaseCount(String redisKey) { Object object = redisTemplate.opsForValue().get(redisKey); if (object == null) { redisTemplate.opsForValue().set(redisKey, String.valueOf(1), getTodayLeftSeconds(), TimeUnit.SECONDS); return; } redisTemplate.opsForValue().increment(redisKey, 1); } private long getTodayLeftSeconds() { Date nowDate = new Date(); Calendar midnight = Calendar.getInstance(); midnight.setTime(nowDate); midnight.add(Calendar.DAY_OF_MONTH, 1); midnight.set(Calendar.HOUR_OF_DAY, 0); midnight.set(Calendar.MINUTE, 0); midnight.set(Calendar.SECOND, 0); midnight.set(Calendar.MILLISECOND, 0); return (midnight.getTime().getTime() - nowDate.getTime()) / 1000; } }
Grafana 集成
下載安裝
https://grafana.com/grafana/download?pg=get&plcmt=selfmanaged-box1-cta1&platform=mac
#下載解壓
curl -O https://dl.grafana.com/oss/release/grafana-7.5.4.darwin-amd64.tar.gz
tar -zxvf grafana-7.5.4.darwin-amd64.tar.gz
啟動(dòng)
#啟動(dòng)
./bin/grafana-server web
#訪(fǎng)問(wèn)
http://localhost:3000
默認(rèn)賬號(hào)密碼 admin admin
數(shù)據(jù)源配置
監(jiān)控儀表盤(pán)
-
導(dǎo)入常用別人已配置的圖表信息茅茂,例如:springboot 用了micrometer 庫(kù),從granfna官網(wǎng)可以找到該庫(kù)已提供的的JVM監(jiān)控圖表
直接找到 id 導(dǎo)入即可太抓。找找地址:https://grafana.com/grafana/dashboards
搜索
查看詳情
導(dǎo)入
JVM 監(jiān)控儀表盤(pán):
-
自定義制作
報(bào)警
基于AlertManager報(bào)警
安裝
- 下載二進(jìn)制安裝
# mac下載地址 https://prometheus.io/download/
# 解壓
tar -zxvf prometheus-2.26.0.darwin-amd64.tar.gz
#啟動(dòng) 默認(rèn)配置項(xiàng)為alertmanager.yml
./alertmanager --config.file=alertmanager.yml
-
docker 安裝
docker run --name alertmanager -d -p 127.0.0.1:9093:9093 quay.io/prometheus/alertmanager
配置
-
alertmanager.yml 配置
# 全局配置項(xiàng) global: resolve_timeout: 5m #處理超時(shí)時(shí)間空闲,默認(rèn)為5min smtp_smarthost: 'smtp.sina.com:25' # 郵箱smtp服務(wù)器代理 smtp_from: '******@sina.com' # 發(fā)送郵箱名稱(chēng) smtp_auth_username: '******@sina.com' # 郵箱名稱(chēng) smtp_auth_password: '******' # 郵箱密碼或授權(quán)碼 wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/' # 企業(yè)微信地址 # 定義模板信心templates: - 'template/*.tmpl' # 定義路由樹(shù)信息 route: group_by: ['alertname'] # 報(bào)警分組依據(jù) group_wait: 10s # 最初即第一次等待多久時(shí)間發(fā)送一組警報(bào)的通知 group_interval: 10s # 在發(fā)送新警報(bào)前的等待時(shí)間 repeat_interval: 1m # 發(fā)送重復(fù)警報(bào)的周期 對(duì)于email配置中,此項(xiàng)不可以設(shè)置過(guò)低走敌,否則將會(huì)由于郵件發(fā)送太多頻繁碴倾,被smtp服務(wù)器拒絕 receiver: 'email' # 發(fā)送警報(bào)的接收者的名稱(chēng),以下receivers name的名稱(chēng) # 定義警報(bào)接收者信息 receivers: - name: 'email' # 警報(bào) email_configs: # 郵箱配置 - to: '******@163.com' # 接收警報(bào)的email配置 html: '{{ template "test.html" . }}' # 設(shè)定郵箱的內(nèi)容模板 headers: { Subject: "[WARN] 報(bào)警郵件"} # 接收郵件的標(biāo)題 webhook_configs: # webhook配置 - url: 'http://127.0.0.1:5001' send_resolved: true wechat_configs: # 企業(yè)微信報(bào)警配置 - send_resolved: true to_party: '1' # 接收組的id agent_id: '1000002' # (企業(yè)微信-->自定應(yīng)用-->AgentId) corp_id: '******' # 企業(yè)信息(我的企業(yè)-->CorpId[在底部]) api_secret: '******' # 企業(yè)微信(企業(yè)微信-->自定應(yīng)用-->Secret) message: '{{ template "test_wechat.html" . }}' # 發(fā)送消息模板的設(shè)定# 一個(gè)inhibition規(guī)則是在與另一組匹配器匹配的警報(bào)存在的條件下掉丽,使匹配一組匹配器的警報(bào)失效的規(guī)則影斑。兩個(gè)警報(bào)必須具有一組相同的標(biāo)簽。 inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance']
啟用AlertManager 報(bào)警 prometheus.yml 配置
# my global config
global:
scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ["localhost:9093"]
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "first_rules.yml"
- "second_rules.yml"
-
Rule 定義
groups: - name: <string> rules: - alert: <string> expr: <string> for: [ <duration> | default 0 ] labels: [ <lable_name>: <label_value> ] annotations: [ <lable_name>: <tmpl_string> ]
參數(shù) 描述 - name: <string> 警報(bào)規(guī)則組的名稱(chēng) - alert: <string> 警報(bào)規(guī)則的名稱(chēng) expr: <string 使用PromQL表達(dá)式完成的警報(bào)觸發(fā)條件机打,用于計(jì)算是否有滿(mǎn)足觸發(fā)條件 <lable_name>: <label_value> 自定義標(biāo)簽,允許自行定義標(biāo)簽附加在警報(bào)上片迅,比如 high
warning
annotations: <lable_name>: <tmpl_string> 用來(lái)設(shè)置有關(guān)警報(bào)的一組描述信息残邀,其中包括自定義的標(biāo)簽,以及expr計(jì)算后的值柑蛇。
-
案例
groups: - name: operations rules: - alert: node-down expr: up{env="operations"} != 1 for: 5m labels: status: High team: operations annotations: description: "Environment: {{ $labels.env }} Instance: {{ $labels.instance }} is Down ! ! !" value: '{{ $value }}' summary: "The host node was down 20 minutes ago"
查看配置報(bào)警配置
-
告警展示
基于Granfana 報(bào)警
-
創(chuàng)建Channel
-
創(chuàng)建規(guī)則
-
告警效果展示Email