AlertManager
Architecture
-
Alertmanager作為一個(gè)獨(dú)立的組件颁虐,負(fù)責(zé)接收并處理來自Prometheus Server(也可以是其它的客戶端程序)的告警信息。Alertmanager可以對(duì)這些告警信息進(jìn)行進(jìn)一步的處理仰剿,比如當(dāng)接收到大量重復(fù)告警時(shí)能夠消除重復(fù)的告警信息创淡,同時(shí)對(duì)告警信息進(jìn)行分組并且路由到正確的通知方,Prometheus內(nèi)置了對(duì)郵件南吮,Slack等多種通知方式的支持琳彩,同時(shí)還支持與Webhook的集成,以支持更多定制化的場(chǎng)景。同時(shí)AlertManager還提供了靜默和告警抑制機(jī)制來對(duì)告警通知行為進(jìn)行優(yōu)化露乏。
- 客戶端通過
POST
請(qǐng)求向AlertManager推送告警信息碧浊。 - 每條告警信息中的
labels
可用于唯一識(shí)別告警信息并用于去重。
- 客戶端通過
-
AlertManager主要分為兩個(gè)部分瘟仿,路由(
router
)和接收器(receiver
)箱锐。告警消息先被經(jīng)過路由樹,然后被分配到對(duì)應(yīng)的接收器中猾骡。路由樹是由預(yù)先設(shè)定的路由規(guī)則生成的瑞躺。其高可用架構(gòu)如上圖所示,具體流程如下:- Prometheus會(huì)通過調(diào)用AlertManager提供的告警接口將原始的告警消息發(fā)送到AlertManager兴想。
- AlertManager的API除了接收告警幢哨,還接收靜默請(qǐng)求,將其分別保存到各自的
provider
里嫂便。 -
provider
提供了一個(gè)訂閱(subscribe
)接口捞镰,這樣Dispatcher
組件便可以獲取告警數(shù)據(jù),并對(duì)數(shù)據(jù)進(jìn)行分組毙替,通過用戶預(yù)先設(shè)置的規(guī)則進(jìn)入告警抑制階段或靜默階段岸售。 - 如果通過了上面的告警靜默階段,則進(jìn)入路由分發(fā)階段厂画,最終發(fā)送通知凸丸。
上報(bào)數(shù)據(jù)格式
[
{
"labels": {
"alertname": "<requiredAlertName>",
"<labelname>": "<labelvalue>",
...
},
"annotations": {
"<labelname>": "<labelvalue>",
},
"startsAt": "<rfc3339>",
"endsAt": "<rfc3339>",
"generatorURL": "<generator_url>"
},
...
]
- 其中:
- 告警信息的
Fingerprint
是通過labels
來進(jìn)行計(jì)算的。 -
startAt
和endsAt
在provider
的告警合并會(huì)用到袱院。
- 告警信息的
Alert Provider
- Prometheus或者告警發(fā)送系統(tǒng)可以通過API的方式發(fā)送給Alertmanager屎慢,收到告警后將告警分別存儲(chǔ)在
Alert Provider
中(當(dāng)前實(shí)現(xiàn)是存儲(chǔ)在內(nèi)存
中,可以通過接口的方式自行實(shí)現(xiàn)其他存儲(chǔ)方式比如MySQL或者ES)忽洛。 - 已解決的
alerts
會(huì)定期地從provider
中清除腻惠,用戶可以定義相應(yīng)的回調(diào)函數(shù),在alerts
被刪除時(shí)進(jìn)行相應(yīng)的回調(diào)處理欲虚。 -
Alert Provider
收到告警信息后對(duì)相同的指標(biāo)數(shù)據(jù)進(jìn)行合并處理集灌。
// Alerts gives access to a set of alerts. All methods are goroutine-safe.
type Alerts interface {
// Subscribe returns an iterator over active alerts that have not been
// resolved and successfully notified about.
// They are not guaranteed to be in chronological order.
Subscribe() AlertIterator
// GetPending returns an iterator over all alerts that have
// pending notifications.
GetPending() AlertIterator
// Get returns the alert for a given fingerprint.
Get(model.Fingerprint) (*types.Alert, error)
// Put adds the given alert to the set.
Put(...*types.Alert) error
}
Silence Provider
Dispatcher
-
Dispatcher
sorts incoming alerts into aggregation groups and assigns the correct notifiers to each. - AlertManager內(nèi)部的
Dispatcher
通過訂閱的方式獲得告警信息更新(獲得Alerts的迭代器
,通過for循環(huán)不斷的獲得發(fā)送到信道中的Alerts
, 通過route的match函數(shù)獲得匹配的route對(duì)象(比如基于標(biāo)簽的正則表達(dá)复哆,傳遞到不同的郵件或者slack信道路由),并且每隔一段時(shí)間將執(zhí)行一次清理操作(當(dāng)aggregate group
中的告警數(shù)量為空的時(shí)候)欣喧,刪除之前的記錄。收到的Alert通過標(biāo)簽匹配的方式被送到不同的聚合組中等待Pipeline
流程進(jìn)行處理寂恬。
Aggregate group
- aggrGroup aggregates alert fingerprints into groups to which a common set of routing options applies.
It emits notifications in the specified intervals. - 聚合組
aggregate group
用來管理具有相同屬性信息的告警续誉,通過將相同類型的告警進(jìn)行分組可以統(tǒng)一的管理,因?yàn)橛袝r(shí)候告警處理是大量同時(shí)出現(xiàn)的(比如一個(gè)數(shù)據(jù)中心的失效將導(dǎo)致成百上千的告警產(chǎn)生初肉,通過分組可以聚合相同標(biāo)簽到一個(gè)郵件或者接收者中)酷鸦。分組創(chuàng)建將依賴于處理route
路由和告警的labels
標(biāo)簽,不同的告警labels
將產(chǎn)生不同的聚合組,所有接收到的告警將首先計(jì)算一個(gè)聚合組的Fingerprint
如果找到則直接插入到該組臼隔,否則創(chuàng)建一個(gè)新的聚合組嘹裂,每次新創(chuàng)建的聚合組都會(huì)啟動(dòng)一個(gè)goroutine來執(zhí)行實(shí)際的pipeline work
.
Inhibitor
- An
Inhibitor
determines whether a givenlabel set
is muted based on the currently active alerts and a set ofinhibition rules
. It implements theMuter
interface. -
Inhibitor
用于管理相同的告警配置,比如下面的配置定義了當(dāng)告警名稱alertname一致的時(shí)候摔握,如果嚴(yán)重告警存在的時(shí)候寄狼,途同級(jí)別告警將被過濾掉。 - 查詢流程上將獲得的alert的label進(jìn)行檢查氨淌,匹配檢查的內(nèi)容滿足target匹配但是source不匹配的標(biāo)記為Inhibited.
Silencer
-
Silencer
用來取消告警泊愧,比如直接配置告警在某一段時(shí)間內(nèi)不觸發(fā)任何消息,可以基于正則表達(dá)式的匹配盛正,該配置可以通過alertmanager的WebUI或者API接口配置删咱。 - 當(dāng)流程傳遞到
Silence
步驟時(shí)候,Silence
模塊將循環(huán)檢查每一個(gè)告警是否滿足匹配豪筝,比如設(shè)置某一個(gè)告警標(biāo)簽出現(xiàn)后取消告警痰滋。當(dāng)查詢結(jié)束后返回一個(gè)sils(Silence的結(jié)構(gòu)體,用來指定某一類告警的Silence在一段時(shí)間內(nèi)的處理對(duì)象续崖。)一個(gè)告警可能會(huì)被多個(gè)Silence同時(shí)管理敲街。 - 同時(shí)要實(shí)現(xiàn)集群管理,彼此之間的Silence狀態(tài)也要共享(告警發(fā)送給多個(gè)AM)严望,因此系統(tǒng)設(shè)計(jì)的時(shí)候加入了SilenceProvider來進(jìn)行集群之間的Silence管理,彼此之間通過protoBuf來進(jìn)行數(shù)據(jù)狀態(tài)的同步多艇。同時(shí)集群在接收到告警后也要進(jìn)行通知,告知其他的節(jié)點(diǎn)關(guān)于告警的處理狀態(tài)像吻,防止多個(gè)通知同時(shí)被發(fā)送墩蔓。
Notify Provider
Router
Receiver Stage
Wait
- 等待間隔用來設(shè)置發(fā)送告警的等待時(shí)間,對(duì)于集群操作中萧豆,需要根據(jù)不同的peer設(shè)置不同的超時(shí)時(shí)間,如果僅僅一個(gè)Server本身昏名,等待間隔設(shè)置為0涮雷;
Dedup
-
Dedup Stage
用于管理告警的去重,傳遞的參數(shù)中包含了一個(gè)NotificationLog
, 用來保存告警的發(fā)送記錄轻局。當(dāng)有多個(gè)機(jī)器組成集群的時(shí)候洪鸭,NotificationLog
會(huì)通過協(xié)議去進(jìn)行通信,傳遞彼此的記錄信息仑扑,加入集群中的A如果發(fā)送了告警览爵,該記錄會(huì)傳遞給B機(jī)器,并進(jìn)行merge
操作镇饮,這樣B機(jī)器在發(fā)送告警的時(shí)候如果查詢已經(jīng)發(fā)送蜓竹,則不再進(jìn)行告警發(fā)送。
Retry
-
Retry Stage
利用backoff
策略來管理告警的重發(fā),對(duì)于沒有發(fā)送成功的告警將不斷重試俱济,直到超時(shí)嘶是。
Set Notify
Set Notifies Stage
用來設(shè)置發(fā)送告警的信息到nfLog
,該模塊僅僅用于被該Alert Manager
發(fā)送的告警的記錄(Retry
組件傳遞的alerts
和Dedup
組件中發(fā)送出去的告警信息)蛛碌。Integration
定義一個(gè)集成路由組件聂喇,包含用戶的配置信息和名稱以及發(fā)送告警的實(shí)現(xiàn)。自定義的notify路由需要滿足該Notifier接口蔚携,實(shí)現(xiàn)Notify方法希太。
Examples
groups:
- name: httpd
rules:
- alert: httpd_down
expr: probe_success{instance="http://httpd:80",job="httpd"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "httpd is down"
- Prometheus Server will wait for 1m and if the expression meets the condition for 1m, now Prometheus Server has to fire alert and forward to AlertManager. Until now, Prometheus knows how to connect to AlertManager and when to fire and forward alerts to AlertManager.
route:
repeat_interval: 2h
receiver: email-1
routes:
- match:
alertname: httpd_down
receiver: email-1
- match:
alertname: nginx_down
receiver: email-2
- These routes and receivers are defined in the AlertManager configuration file by the parent element called route. The parent route element has child routes which an alert follows in order to reach to its receiver based upon the match label as we will see in a bit.
http://localhost:9090/api/v1/alertmanagers
{
"status": "success",
"data": {
"activeAlertmanagers": [
{
"url": "http://127.0.0.1:9093/api/v1/alerts"
}
],
"droppedAlertmanagers": []
}
}
curl -X GET http://10.255.101.73:9090/api/v1/alerts
{
"status": "success",
"data": {
"alerts": [
{
"labels": {
"alertname": "內(nèi)存使用率過高",
"instance": "127.0.0.1:9100",
"job": "node",
"severity": "warning"
},
"annotations": {
"description": "127.0.0.1:9100 of job node內(nèi)存使用率超過80%,當(dāng)前使用率[59.74335527485338].",
"summary": "Instance 127.0.0.1:9100 內(nèi)存使用率過高"
},
"state": "firing",
"activeAt": "2019-08-23T11:27:34.027571952Z",
"value": 59.74335527485338
}
]
}
}
http://10.255.101.73:9093/api/v2/alerts
[
{
"annotations": {
"description": "10.255.101.74:8051 of job default go_goroutines > 100,當(dāng)前go_goroutines: [137].",
"summary": "Instance 10.255.101.74:8051 go_goroutines > 100"
},
"endsAt": "2019-08-31T06:19:27.344Z",
"fingerprint": "0d38358ac4713623",
"receivers": [
{
"name": "default-receiver"
}
],
"startsAt": "2019-08-23T11:28:27.344Z",
"status": {
"inhibitedBy": [],
"silencedBy": [],
"state": "active"
},
"updatedAt": "2019-08-31T14:16:27.348+08:00",
"generatorURL": "http://ceph-1:9090/graph?g0.expr=go_goroutines+%3E+100&g0.tab=1",
"labels": {
"alertname": "go_goroutines大于100",
"instance": "10.255.101.74:8051",
"job": "default",
"severity": "warning"
}
}
]
http://10.255.101.73:9093/api/v1/alerts
{
"status": "success",
"data": [
{
"labels": {
"alertname": "go_goroutines大于100",
"instance": "10.255.101.74:8051",
"job": "default",
"severity": "warning"
},
"annotations": {
"description": "10.255.101.74:8051 of job default go_goroutines > 100,當(dāng)前go_goroutines: [135].",
"summary": "Instance 10.255.101.74:8051 go_goroutines > 100"
},
"startsAt": "2019-08-23T11:28:27.344378042Z",
"endsAt": "2019-08-31T06:21:27.344378042Z",
"generatorURL": "http://ceph-1:9090/graph?g0.expr=go_goroutines+%3E+100&g0.tab=1",
"status": {
"state": "active",
"silencedBy": [],
"inhibitedBy": []
},
"receivers": [
"default-receiver"
],
"fingerprint": "0d38358ac4713623"
}
]
}
References
- AlertManager github
- SENDING ALERTS
- CONFIGURATION
- NOTIFICATION TEMPLATE REFERENCE
- Understanding Prometheus AlertManager
- Prometheus高可用(4):Alertmanager高可用
- Prometheus AlertManager代碼閱讀筆記
- Prometheus AlertManager代碼閱讀筆記 Notify組件
- Alertsnitch
- Prometheus integrations
- Understanding and extending prometheus alertmanager
- Hands-On Infrastructure Monitoring with Prometheus
- 搞搞 Prometheus: Alertmanager
- Prometheus(普羅米修斯)使用過程中遇到的問題