Prometheus

Prometheus官網(wǎng)：https://prometheus.io/

prometheus 中文文檔 · GitHub https://prometheus.fuckcloudnative.io/

Grafana：https://grafana.com

onealert：https://caweb.aiops.com/#/integrate

環(huán)境規(guī)劃

主機名稱主機ip 角色

prometheus 192.168.6.109 prometheus
node_exporler 192.168.6.110 node_exporler

初始化服務(wù)器

ip地址、hostname、綁定/etc/hosts文件壮莹、時間同步

修改hosts

[root@localhost ~]#  vim /etc/hosts

添加如下：

192.168.6.110 node1

虛擬機克隆過來的修改UUID后三位储耐，檢查uuid不能一致

hostnamectl set-hostname  prometheus

時間同步

1、下載ntpdate

注：有些版本是沒有自帶ntpdate，因此需要下載

yum install -y ntpdate

2柠并、調(diào)整時區(qū)為上海耐床，也就是北京時間+8區(qū)

注：想改其他時區(qū)也可以去看看/usr/share/zoneinfo目錄

cp /usr/share/zoneinfo/Asia/Shangha /etc/localtime

3密幔、使用NTP來同步時間

ntpdate ntp6.aliyun.com

4、自動時間同步

（1）利用開機腳本進行同步

Vim /etc/rc.local

添加一條時間同步命令：

/usr/sbin/ntpdate ntp6.aliyun.com

（2）利用周期進程（crontab）進行同步
crontab -e 命令撩轰，進入一個VI的編輯界面胯甩，既可以添加或修改任務(wù)了
格式：

*/5 * * * * /usr/sbin/ntpdate ntp5.aliyun.com ntp6.aliyun.com ntp7.aliyun.com&> /dev/null

Crontab –l 查看是否已經(jīng)成功添加。

安裝prometheus軟件

 [root@promethues prometheus-2.34.0.linux-amd64]# cd /opt
[root@promethues prometheus-2.34.0.linux-amd64]# mkdir soft
[root@promethues prometheus-2.34.0.linux-amd64]# cd soft
[root@promethues prometheus-2.34.0.linux-amd64]# tar -zxvf prometheus-2.34.0.linux-amd64.tar.gz 
[root@promethues prometheus-2.34.0.linux-amd64]# rm -rf prometheus-2.34.0.linux-amd64.tar.gz 
[root@promethues prometheus-2.34.0.linux-amd64]# cd prometheus-2.34.0.linux-amd64/
[root@promethues prometheus-2.34.0.linux-amd64]# pwd
[root@promethues prometheus-2.34.0.linux-amd64]# ln -sv /opt/soft/prometheus-2.34.0.linux-amd64 /usr/local/prometheus

配置使用Systemd管理Prometheus

# 編輯腳本
vim /etc/systemd/system/prometheus.service 
# 粘貼如下內(nèi)容（內(nèi)容可酌情自行修改）

[Unit]
Description=Prometheus
Documentation=https://prometheus.io/
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/prometheus/prometheus --config.file=/usr/local/prometheus/prometheus.yml --storage.tsdb.path= /usr/local/promethues/data/  --web.listen-address=:9290 --web.enable-lifecycle
ExecStop=/usr/bin/pkill -f prometheus

[Install]
WantedBy=multi-user.target
# 保存退出
:wq

啟動Prometheus

# 重載systemd 配置堪嫂，修改完systemd配置文件后需重載才會生效偎箫。
systemctl daemon-reload
# 設(shè)置服務(wù)開機啟動
systemctl enable prometheus
# 啟動服務(wù)
systemctl start prometheus
# 查看服務(wù)狀態(tài)
systemctl status prometheus

此時就可以訪問ui了，地址：[http://ip:9290/]
頁面長這樣溉苛，可以訪問代表啟動成功

4 頁面大致介紹（可選）

Alerts界面镜廉。展示告警的信息，每個告警有3種狀態(tài)：
Inactive（正常狀態(tài)愚战，未滿足告警條件）
Pending（待辦狀態(tài)娇唯，已滿足告警條件，未滿足持續(xù)時間寂玲，未發(fā)送告警信息）
Firing（已產(chǎn)生告警塔插，已經(jīng)滿足告警條件和時間，已發(fā)送告警信息）
Graph
使用Promql語言查詢prometheus里保存的指標(biāo)（metric）拓哟，結(jié)果查看形式可以是表格想许，也可以是圖
Status
1. TSDB Status 時序數(shù)據(jù)庫的狀態(tài)
2. Configuration 配置信息
3. Rules 告警規(guī)則詳細信息
4. Targets 所有prometheus采集的指標(biāo)
Help
Classic UI

其他未提到的請自行研究。

5 常用命令（可選）

1 刪除job數(shù)據(jù)

如果一個job已經(jīng)不再使用，想要刪除對應(yīng)數(shù)據(jù)流纹，就要用到刪除命令了

注意：使用刪除命令前必須開啟管理員命令糜烹，刪除數(shù)據(jù)無法恢復(fù)

刪除名為node_exporter_local的job

curl -X POST  -g 'http://host:9290/api/v1/admin/tsdb/delete_series?match[]={__name__=~".+"}&match[]={job="jobname"}'

host 為prometheus的ip或者hostname
jobname 要刪除的job的名字。

2 熱重載配置規(guī)則

curl -XPOST http://host:9290/-/reload
復(fù)制代碼

host 為prometheus的ip或者hostname

二漱凝、安裝Grafana

建議Grafana安裝在Prometheus所在節(jié)點

1 添加repo

vim /etc/yum.repos.d/grafana.repo
# 粘貼如下內(nèi)容

[grafana]
name=grafana
baseurl=https://packages.grafana.com/oss/rpm
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
sslverify=1
sslcacert=/etc/pki/tls/certs/ca-bundle.crt

2 安裝

yum install grafana -y

3 啟動

# 設(shè)置grafana開機啟動
systemctl enable grafana-server
# 啟動grafana服務(wù)
systemctl start grafana-server

訪問grafana頁面： [http://ip:3000]用戶名密碼 admin admin

訪問地址看到如下界面疮蹦，說明啟動成功

1.png

三、監(jiān)控linux服務(wù)器

監(jiān)控linux服務(wù)器的cpu茸炒、內(nèi)存愕乎、磁盤等信息。
流程：

node_exporter采集指標(biāo)
prometheus從exporter拉取指標(biāo)保存起來
grafana從prometheus查詢數(shù)據(jù)壁公，可視化展示

1 安裝node_exporter

node_exporter的作用是報告單個節(jié)點的服務(wù)器指標(biāo)給prometheus感论，例如內(nèi)存、磁盤紊册、cpu比肄。

所有需要監(jiān)控的節(jié)點都需要按照如下流程安裝node_exporter。
1 下載包

# 1 進入安裝目錄
cd /opt/soft/
# 2 下載安裝包
wget https://github.com/prometheus/node_exporter/releases/download/v1.2.2/node_exporter-1.3.1.linux-amd64.tar.gz
# 3 解壓
tar -zxvf node_exporter-1.3.1.linux-amd64.tar.gz
# 4 創(chuàng)建鏈接湿硝，方便統(tǒng)一管理目錄
ln -sv /opt/soft/node_exporter-1.3.1.linux-amd64 /usr/local/node_exporter

2 配置使用Systemd管理node_exporter

# 編輯腳本
vim /etc/systemd/system/node_exporter.service 
# 粘貼如下內(nèi)容（內(nèi)容可酌情自行修改）

[Unit]
Description=node_exporter
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/node_exporter/node_exporter --web.listen-address=:9120
ExecStop=/usr/bin/pkill -f node_exporter

[Install]
WantedBy=multi-user.target
# 保存退出
:wq
復(fù)制代碼

因為默認端口號9100已被占用薪前，通過啟動時指定參數(shù)修改端口號為9120
–web.listen-address=:9120

3 啟動node_exporter

# 重載systemd 配置，修改完systemd配置文件后需重載才會生效关斜。
systemctl daemon-reload
# 設(shè)置服務(wù)開機啟動
systemctl enable node_exporter
# 啟動服務(wù)
systemctl start node_exporter
# 查看服務(wù)狀態(tài)
systemctl status node_exporter

每個node_exporter都會啟動一個簡易頁面：[http://ip:9120/]示括，如果可以訪問代表啟動成功

2 修改prometheus配置

默認配置

# my global config
global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9090"]

修改配置文件

注意：yml文件的縮進不能亂，亂了就識別不了

# 進入prometheus安裝目錄
cd /usr/local/prometheus
# 編輯配置文件
vim prometheus.yml

# 在文件最后粘貼如下內(nèi)容
  - job_name: 'all_node'
    static_configs:
    - targets: ['node01:9120']
      labels:
        instance: node01
    - targets: ['node02:9120']
      labels:
        instance: node02
    - targets: ['node03:9120']
      labels:
        instance: node03

- targets: ['node01:9120']這里的node01是1臺啟動了node_exporter的服務(wù)器的hostname痢畜，這里也可以換成ip垛膝。9120則是node_exporter啟動端口。其余2臺服務(wù)器配置以此類推丁稀。

這里的配置是告訴prometheus從哪個服務(wù)器的哪個端口拉取數(shù)據(jù)吼拥。

修改后的配置文件如下所示

global:
  scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: "prometheus"

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
      - targets: ["localhost:9290"]

  - job_name: 'all_node'
    static_configs:
    - targets: ['node01:9120']
      labels:
        nodename: node01
    - targets: ['node02:9120']
      labels:
        nodename: node02
    - targets: ['node03:9120']
      labels:
        nodename: node03

修改完配置后，重啟prometheus服務(wù)配置才生效

systemctl stop prometheus
systemctl start prometheus

3 驗證是否配置成功

此時打開prometheus的地址 [http://ip:9290/]
操作步驟：點擊Status —> 點擊Targets

這里圖片是后截的线衫，這里all_node(N/N up）的根據(jù)上面的配置來的凿可。如果配置了3臺，那么應(yīng)該像這樣`all_node(3/3 up）

2022-04-01_150708.png

上面配置文件里配置了幾臺授账，這里看到幾臺就是配置正確了枯跑。如果不對，可以排查一下hostname或者端口是否有誤白热，或者node_exporter是否啟動成功敛助。

4 配置grafana

4.1 創(chuàng)建數(shù)據(jù)源

1 瀏覽器打開grafana地址 http://ip:3000 用戶名密碼 admin admin

登錄后，進入主界面

2 點擊左側(cè)齒輪（設(shè)置按鈕） —> 點擊Data sources

1.png

3 點擊Add data source

1.png

4 點擊Select屋确，在搜索框輸入prometheus纳击，點擊Select

1.png

5 輸入數(shù)據(jù)源名字（名字默認即可）续扔，輸入prometheus的地址 http://localhost:9290/。這里我的prometheus和grafana部署在同一臺上所以host為localhost焕数，如果不在一臺機器上請自行更改纱昧。

1.png

6 點擊Save & test

1.png

這樣一個prometheus數(shù)據(jù)源就創(chuàng)建好了。

7 點擊左上角圖標(biāo)回到主界面

4.2 添加Dashboard

1 點擊 + 堡赔，點擊Import

1.png

2 輸入8919砌些，點擊Load

這個8919是一個其他人發(fā)布的一個Dashboard。這個id是我從Grafana官方提供的Dashboard網(wǎng)站https://grafana.com/grafana/dashboards/ 里找到的加匈。以后要添加其他類型的比如flink或者mysql監(jiān)控報表，都可以從這個網(wǎng)站找到仑荐。

1.png

3 輸入名字雕拼，選擇數(shù)據(jù)源，點擊Import

1.png

看到如下的Dashboard粘招，就說明配置成功了

2.png

四啥寇、監(jiān)控Flink

flink默認提供了報道數(shù)據(jù)的實現(xiàn)類將指標(biāo)上報給PushGateway；Prometheus再從PushGateway拉取指標(biāo)洒扎，保存起來辑甜；Grafana從Prometheus查詢數(shù)據(jù)展示出來。

注意：如果是CDH集成的Flink-yarn服務(wù)袍冷，那么任務(wù)必須提交到Flink-yarn服務(wù)啟動時隨之啟動的session中磷醋，否則無法監(jiān)控到任務(wù)運行指標(biāo)

1 安裝PushGateway

建議PushGateway安裝到Prometheus所在節(jié)點

1.1 下載PushGateway

# 進入安裝目錄
cd /opt/soft/
# 下載安裝包
wget -c https://github.com/prometheus/pushgateway/releases/download/v1.4.2/pushgateway-1.4.2.linux-amd64.tar.gz
# 解壓
tar -zxvf /pushgateway-1.4.2.linux-amd64.tar.gz
# 創(chuàng)建軟連接，方便管理
ln -sv /opt/soft/pushgateway-1.4.2.linux-amd64 /usr/local/pushgateway

1.2 用system管理push_gateway
vim /etc/systemd/system/pushgateway.service

粘貼如下內(nèi)容

[Unit]
Description=pushgateway
After=network.target

[Service]
Type=simple
ExecStart=/opt/soft/pushgateway-1.4.2.linux-amd64/pushgateway --web.listen-address=:9291
ExecStop=/usr/bin/pkill -f pushgateway

[Install]
WantedBy=multi-user.target

保存退出

:wq

–web.listen-address=:9291 默認端口為9091胡诗，避免沖突改為9291

1.3 啟動PushGateway

重載systemd 配置邓线，修改完systemd配置文件后需重載才會生效。

systemctl daemon-reload

設(shè)置服務(wù)開機啟動

systemctl enable pushgateway

啟動服務(wù)

systemctl start pushgateway

查看服務(wù)狀態(tài)

systemctl status pushgateway


# 進入prometheus安裝目錄

vim /usr/local/prometheus/prometheus.yml

在文件最后追加如下內(nèi)容

job_name: 'pushgateway'
honor_labels: true
static_configs:
- targets: ['node01:9291']
  labels:
  instance: 'pushgateway'

修改完配置后煌恢，重啟prometheus服務(wù)配置才生效

systemctl restart prometheus


1.4 驗證是否配置成功
打開prometheus地址 [http://ip:9290/targets]可以看到如下內(nèi)容即是成功
![1.png](https://upload-images.jianshu.io/upload_images/26919493-4622859824f1a226.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)


# 五骇陈、配置告警

告警流程

1.  exporter采集指標(biāo)數(shù)據(jù)
2.  prometheus從exporter拉取指標(biāo)數(shù)據(jù)保存起來
3.  prometheus向alertmanager推送觸發(fā)了的告警
4.  alertmanager通過email、dingding等方式發(fā)送告警信息

## 創(chuàng)建釘釘機器人

1 找一個群打開對話框瑰抵，點擊右上角齒輪
![1.png](https://upload-images.jianshu.io/upload_images/26919493-ace28981e16971a6.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
2 點擊智能群助手
![2.png](https://upload-images.jianshu.io/upload_images/26919493-a0027a92cac52099.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
3 點擊+
![3.png](https://upload-images.jianshu.io/upload_images/26919493-5fb0258209bc5d55.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
4 點擊+
![4.png](https://upload-images.jianshu.io/upload_images/26919493-ed2aba5990bfd67b.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
5 往下滑滑輪你雌，點擊自定義
![5.png](https://upload-images.jianshu.io/upload_images/26919493-a60b5ad0a2cf5d61.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
6 點擊添加
![6.png](https://upload-images.jianshu.io/upload_images/26919493-e1cf077672722119.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
7 填入機器人名字，選擇加簽二汛，點擊我已閱讀并同意婿崭，點擊完成
![7.png](https://upload-images.jianshu.io/upload_images/26919493-a061df3c2aa6c610.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)
這樣一個機器人就添加完成了
![8.png](https://upload-images.jianshu.io/upload_images/26919493-630e1ca1f875180b.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

這里建議點擊復(fù)制，把Webhook的url保存一下习贫，后面會用到

##2 安裝dingtalk
dingtalk是一個用來發(fā)送釘釘告警通知的prometheus插件
### 2.1 下載

cd /opt/soft

下載安裝包

wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v2.0.0/prometheus-webhook-dingtalk-2.0.0.linux-amd64.tar.gz

解壓

tar -zxvf prometheus-webhook-dingtalk-2.0.0.linux-amd64.tar.gz

創(chuàng)建軟連接

ln -sv /opt/soft/prometheus-webhook-dingtalk-2.0.0.linux-amd64/ /usr/local/prometheus-webhook-dingtalk

###2.2 修改dingtalk配置文件

拷貝一份新配置文件逛球，命名為config.yml

cp config.example.yml config.yml

編輯配置文件

vim config.yml

把配置里所有的url改為上一步保存的Webhook的url

假設(shè)我的Webhook的url=https://oapi.dingtalk.com/robot/send?access_token=abc，那么配置就像這樣

targets:
webhook1:
url: https://oapi.dingtalk.com/robot/send?access_token=abc
# secret for signature
secret: SEC000000000000000000000
webhook2:
url: https://oapi.dingtalk.com/robot/send?access_token=abc
webhook_legacy:
url: https://oapi.dingtalk.com/robot/send?access_token=abc
# Customize template content
message:
# Use legacy template
title: '{{ template "legacy.title" . }}'
text: '{{ template "legacy.content" . }}'
webhook_mention_all:
url: https://oapi.dingtalk.com/robot/send?access_token=abc
mention:
all: true
webhook_mention_users:
url: https://oapi.dingtalk.com/robot/send?access_token=abc
mention:
mobiles: ['156xxxx8827', '189xxxx8325']


### 2.3 用system管理dingtalk

編輯配置

vim /etc/systemd/system/prometheus-webhook-dingtalk.service

粘貼如下內(nèi)容

[Unit]
Description=prometheus-webhook-dingtalk
After=network-online.target

[Service]
Restart=on-failure
ExecStart=/usr/local/prometheus-webhook-dingtalk/prometheus-webhook-dingtalk --config.file=/usr/local/prometheus-webhook-dingtalk/config.yml --web.listen-address=:8260
ExecStop=/usr/bin/pkill -f prometheus-webhook-dingtalk

[Install]
WantedBy=multi-user.target

保存退出

:wq


–web.listen-address=:8260 默認端口為8060苫昌，避免沖突修改為8260

### 2.4 啟動dingtalk

重載systemd 配置颤绕，修改完systemd配置文件后需重載才會生效幸海。

systemctl daemon-reload

設(shè)置服務(wù)開機啟動

systemctl enable prometheus-webhook-dingtalk

啟動服務(wù)

systemctl start prometheus-webhook-dingtalk

查看服務(wù)狀態(tài)

systemctl status prometheus-webhook-dingtalk

查看端口

lsof -i :8260
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
prometheu 7893 root 3u IPv6 110639539 0t0 TCP *:8260 (LISTEN)


## 3 安裝alertmanager

建議alertmanager安裝到prometheus所在節(jié)點

### 3.1 下載alertmanager

cd /opt/soft

下載安裝包

wget https://github.com/prometheus/alertmanager/releases/download/v0.23.0/alertmanager-0.24.0.linux-amd64.tar.gz

解壓

tar -zxvf alertmanager-0.24.0.linux-amd64.tar.gz

創(chuàng)建軟連接

ln -sv /opt/soft/alertmanager-0.24.0.linux-amd64/ /usr/local/alertmanager


### 3.2 用system管理alertmanager

vim /etc/systemd/system/alertmanager.service

粘貼如下內(nèi)容

[Unit]
Description=alertmanager
After=network.target

[Service]
Type=simple
ExecStart=/usr/local/alertmanager/alertmanager --web.listen-address=:9293 --config.file=/usr/local/alertmanager/alertmanager.yml --storage.path=/usr/local/alertmanager/data/
ExecStop=/usr/bin/pkill -f alertmanager

[Install]
WantedBy=multi-user.target

保存退出

:wq
復(fù)制代碼


### 3.3 啟動alertmanager

重載systemd 配置，修改完systemd配置文件后需重載才會生效奥务。

systemctl daemon-reload

設(shè)置服務(wù)開機啟動

systemctl enable alertmanager

啟動服務(wù)

systemctl start alertmanager

查看服務(wù)狀態(tài)

systemctl status alertmanager


訪問alertmanager的地址 [http://localhost:9293/],如果出現(xiàn)如下界面物独，說明啟動成功，如果不能啟動請檢查配置
![1.png](https://upload-images.jianshu.io/upload_images/26919493-d9981143ed2c3bf9.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)

## 4 修改prometheus配置文件

### 1 編輯規(guī)則文件

創(chuàng)建告警規(guī)則文件目錄

mkdir /usr/local/prometheus/rules

進入目錄

cd /usr/local/prometheus/rules


#### 1 創(chuàng)建cpu告警文件氯葬，內(nèi)容如下

groups:

name: CPU報警規(guī)則
rules:
- alert: 服務(wù)器-CPU使用率告警
  expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[1m]) )) * 100 > 85
  for: 3m
  labels:
  severity: warning
  annotations:
  summary: "CPU使用率正在飆升挡篓。"
  description: "CPU使用率超過85%（當(dāng)前值：{{ $value }}%）"


#### 2 創(chuàng)建磁盤告警文件，內(nèi)容如下

groups:

name: 磁盤使用率報警規(guī)則
rules:
- alert: 服務(wù)器-磁盤使用率告警
  expr: 100 - node_filesystem_free_bytes{fstype=~"xfs|ext4"} / node_filesystem_size_bytes{fstype=~"xfs|ext4"} * 100 > 85
  for: 30m
  labels:
  severity: warning
  annotations:
  summary: "硬盤分區(qū)使用率過高"
  description: "分區(qū)使用大于85%（當(dāng)前值：{{ $value }}%）"


####3 創(chuàng)建內(nèi)存告警文件帚称，內(nèi)容如下

groups:

name: 內(nèi)存報警規(guī)則
rules:
- alert: 服務(wù)器-內(nèi)存使用率告警
  expr: (1 - (node_memory_MemAvailable_bytes{job="all_node"} / (node_memory_MemTotal_bytes{job="all_node"}))) * 100 > 85
  for: 3m
  labels:
  severity: warning
  annotations:
  summary: "服務(wù)器可用內(nèi)存不足官研。"
  description: "內(nèi)存使用率已超過85%（當(dāng)前值：{{ $value }}%）"


#### 4 創(chuàng)建flink任務(wù)存活個數(shù)告警文件

第一個文件

groups:

name: 生產(chǎn)-實時-flink-任務(wù)執(zhí)行失敗
rules:
- alert: 生產(chǎn)-實時-flink-任務(wù)失敗告警
  expr: flink_jobmanager_numRunningJobs{job=~"flink_pushgateway.*"} < 3
  for: 1m
  labels:
  severity: warning
  annotations:
  summary: "生產(chǎn)-flink 某個任務(wù)執(zhí)行失敗"
  description: "生產(chǎn)-實時-flink-任務(wù)執(zhí)行失敗, 期待正在執(zhí)行的任務(wù)數(shù)=3,（當(dāng)前正在執(zhí)行的任務(wù)數(shù)={{ $value }}）"
  復(fù)制代碼


*   expr: flink_jobmanager_numRunningJobs{job=~“flink_pushgateway.*”} < 3
    當(dāng)前正在運行的任務(wù)有3個，判斷當(dāng)前正在運行的任務(wù)數(shù)量小于3嗎闯睹，小于3說明有任務(wù)掛了
*   for: 1m
    expr條件觸發(fā)后戏羽，這種情況持續(xù)了1分鐘，則發(fā)出告警

#### [](https://link.juejin.cn?target=)5 判斷flink某任務(wù)是否存活

groups:

name: MainClassName 執(zhí)行失敗
rules:
- alert: 生產(chǎn)-實時-flink-任務(wù)失敗告警
  expr: ((flink_jobmanager_job_uptime{ job_name="MainClassName"})-(flink_jobmanager_job_uptime{ job_name="MainClassName"} offset 10s))/1000 == 0
  for: 1m
  labels:
  severity: warning
  annotations:
  summary: "生產(chǎn)-實時任務(wù)執(zhí)行失敗"
  description: "MainClassName（xxx任務(wù)）執(zhí)行失斅コ浴（當(dāng)前值：{{ $value }}）"


**注意：這里只能根據(jù)任務(wù)啟動時的MainClassName主類名監(jiān)控指定任務(wù)**

原理是根據(jù)一個不斷增加的指標(biāo)uptime來判斷

如果任務(wù)存活始花，當(dāng)前時間戳減去10秒前的時間戳就等于10秒，則表達式 `((flink_jobmanager_job_uptime{ job_name="MainClassName"})-(flink_jobmanager_job_uptime{ job_name="MainClassName"} offset 10s))/1000`的值會是一個固定值約等于10秒的一個數(shù)

如果任務(wù)失敗孩锡，那么值uptime的值不會在隨時間增加而是一個固定值酷宵，那么當(dāng)前時間戳減去10秒前的時間戳就回等于0。則當(dāng)表達式`((flink_jobmanager_job_uptime{ job_name="MainClassName"})-(flink_jobmanager_job_uptime{ job_name="MainClassName"} offset 10s))/1000 == 0`成立時觸發(fā)告警

### [](https://link.juejin.cn?target=)2 編輯配置文件

vim /usr/local/prometheus/prometheus.yml

修改如下部分

Alertmanager configuration

alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9293

Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

rule_files:

rules/*.yml

- "first_rules.yml"

- "second_rules.yml"

復(fù)制代碼


重啟prometheus

systemctl stop prometheus
systemctl start prometheus


## [](https://link.juejin.cn?target=)5 驗證配置結(jié)果

訪問prometheus的alert地址 [http://node01:9290/alerts]
可以看到如下

最后編輯于：2022.04.01 16:36:09

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末躬窜，一起剝皮案震驚了整個濱河市浇垦，隨后出現(xiàn)的幾起案子，更是在濱河造成了極大的恐慌斩披，老刑警劉巖溜族，帶你破解...
沈念sama閱讀 211,561評論 6贊 492
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件，死亡現(xiàn)場離奇詭異垦沉，居然都是意外死亡煌抒，警方通過查閱死者的電腦和手機，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 90,218評論 3贊 385
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門厕倍，熙熙樓的掌柜王于貴愁眉苦臉地迎上來寡壮，“玉大人，你說我怎么就攤上這事讹弯】黾龋” “怎么了？”我有些...
開封第一講書人閱讀 157,162評論 0贊 348
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵组民，是天一觀的道長棒仍。經(jīng)常有香客問我，道長臭胜，這世上最難降的妖魔是什么莫其？我笑而不...
開封第一講書人閱讀 56,470評論 1贊 283
?港島之戀（遺憾婚禮）
正文為了忘掉前任癞尚，我火速辦了婚禮，結(jié)果婚禮上乱陡，老公的妹妹穿的比我還像新娘浇揩。我一直安慰自己，他們只是感情好憨颠，可當(dāng)我...
茶點故事閱讀 65,550評論 6贊 385
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布胳徽。她就那樣靜靜地躺著，像睡著了一般爽彤。火紅的嫁衣襯著肌膚如雪养盗。梳的紋絲不亂的頭發(fā)上，一...
開封第一講書人閱讀 49,806評論 1贊 290
城市分裂傳說
那天适篙，我揣著相機與錄音爪瓜，去河邊找鬼。笑死匙瘪，一個胖子當(dāng)著我的面吹牛，可吹牛的內(nèi)容都是我干的蝶缀。我是一名探鬼主播丹喻，決...
沈念sama閱讀 38,951評論 3贊 407
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼，長吁一口氣：“原來是場噩夢啊……” “哼翁都！你這毒婦竟也來了碍论？” 一聲冷哼從身側(cè)響起，我...
開封第一講書人閱讀 37,712評論 0贊 266
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤柄慰，失蹤者是張志新（化名）和其女友劉穎鳍悠，沒想到半個月后，有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體坐搔，經(jīng)...
沈念sama閱讀 44,166評論 1贊 303
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡藏研，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 36,510評論 2贊 327
?白月光啟示錄
正文我和宋清朗相戀三年，在試婚紗的時候發(fā)現(xiàn)自己被綠了概行。大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片蠢挡。...
茶點故事閱讀 38,643評論 1贊 340
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖凳忙，靈堂內(nèi)的尸體忽然破棺而出业踏，到底是詐尸還是另有隱情，我是刑警寧澤涧卵，帶...
沈念sama閱讀 34,306評論 4贊 330
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布勤家，位于F島的核電站，受9級特大地震影響柳恐，放射性物質(zhì)發(fā)生泄漏伐脖。R本人自食惡果不足惜热幔，卻給世界環(huán)境...
茶點故事閱讀 39,930評論 3贊 313
男人毒藥：我在死后第九天來索命
文/蒙蒙一、第九天我趴在偏房一處隱蔽的房頂上張望晓殊。院中可真熱鬧断凶，春花似錦、人聲如沸巫俺。這莊子的主人今日做“春日...
開封第一講書人閱讀 30,745評論 0贊 21
一樁弒父案，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽介汹。三九已至却嗡，卻和暖如春，著一層夾襖步出監(jiān)牢的瞬間嘹承，已是汗流浹背窗价。一陣腳步聲響...
開封第一講書人閱讀 31,983評論 1贊 266
情欲美人皮
我被黑心中介騙來泰國打工，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留叹卷，地道東北人撼港。一個月前我還...
沈念sama閱讀 46,351評論 2贊 360
代替公主和親
正文我出身青樓，卻偏偏與公主長得像骤竹，于是被迫代替她去往敵國和親帝牡。傳聞我的和親對象是個殘疾皇子，可洞房花燭夜當(dāng)晚...
茶點故事閱讀 43,509評論 2贊 348

Prometheus

Prometheus官網(wǎng)：https://prometheus.io/

prometheus 中文文檔 · GitHub https://prometheus.fuckcloudnative.io/

Grafana：https://grafana.com

onealert：https://caweb.aiops.com/#/integrate

環(huán)境規(guī)劃

初始化服務(wù)器

ip地址、hostname、綁定/etc/hosts文件壮莹、時間同步

修改hosts

虛擬機克隆過來的修改UUID后三位储耐，檢查uuid不能一致

時間同步

1、下載ntpdate

2柠并、調(diào)整時區(qū)為上海耐床，也就是北京時間+8區(qū)

3密幔、使用NTP來同步時間

4、自動時間同步

Crontab –l 查看是否已經(jīng)成功添加。

安裝prometheus軟件

配置使用Systemd管理Prometheus

啟動Prometheus

4 頁面大致介紹（可選）

5 常用命令（可選）

1 刪除job數(shù)據(jù)

2 熱重載配置規(guī)則

二漱凝、安裝Grafana

1 添加repo

2 安裝

3 啟動

三、監(jiān)控linux服務(wù)器

1 安裝node_exporter

2 配置使用Systemd管理node_exporter

3 啟動node_exporter

2 修改prometheus配置

3 驗證是否配置成功

4 配置grafana

4.1 創(chuàng)建數(shù)據(jù)源

4.2 添加Dashboard

四啥寇、監(jiān)控Flink

1 安裝PushGateway

1.1 下載PushGateway

粘貼如下內(nèi)容

保存退出

重載systemd 配置邓线，修改完systemd配置文件后需重載才會生效。

設(shè)置服務(wù)開機啟動

啟動服務(wù)

查看服務(wù)狀態(tài)

在文件最后追加如下內(nèi)容

下載安裝包

解壓

創(chuàng)建軟連接

拷貝一份新配置文件逛球，命名為config.yml

編輯配置文件

把配置里所有的url改為上一步保存的Webhook的url

假設(shè)我的Webhook的url=https://oapi.dingtalk.com/robot/send?access_token=abc，那么配置就像這樣

編輯配置

粘貼如下內(nèi)容

保存退出

重載systemd 配置颤绕，修改完systemd配置文件后需重載才會生效幸海。

設(shè)置服務(wù)開機啟動

啟動服務(wù)

查看服務(wù)狀態(tài)

查看端口

下載安裝包

解壓

創(chuàng)建軟連接

粘貼如下內(nèi)容

保存退出

重載systemd 配置，修改完systemd配置文件后需重載才會生效奥务。

設(shè)置服務(wù)開機啟動

啟動服務(wù)

查看服務(wù)狀態(tài)

創(chuàng)建告警規(guī)則文件目錄

進入目錄

修改如下部分

Alertmanager configuration

Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

- "first_rules.yml"

- "second_rules.yml"

推薦閱讀更多精彩內(nèi)容