一、kube-state-metrics
1.1 kube-state-metrics介紹
github地址:https://github.com/kubernetes/kube-state-metrics
鏡像地址:https://hub.docker.com/r/bitnami/kube-state-metrics
博客介紹:https://xie.infoq.cn/article/9e1fff6306649e65480a96bb1
kube-state-metrics是通過監(jiān)聽API Server生成有關(guān)資源對象的狀態(tài)指標刃泌,比如Deployment凡壤、Node、Pod耙替,需要注意的是kube-state-metrics只是簡單的提供一個metrics數(shù)據(jù)亚侠,并不會存儲這些指標數(shù)據(jù),所以我們可以使用Prometheus來抓取這些數(shù)據(jù)然后存儲俗扇,主要關(guān)注的是業(yè)務(wù)相關(guān)的一些元數(shù)據(jù)硝烂,比如Deployment、Pod铜幽、副本狀態(tài)等滞谢,調(diào)度了多少個replicas?現(xiàn)在可用的有幾個除抛?多少個Pod是running/stopped/terminated狀態(tài)爹凹?Pod重啟了多少次?目前由多少job在運行中
1.2 部署kube-state-metrics
- 編寫基于deploy控制器的yaml文件
- 編寫svc的yaml文件镶殷,端口暴露為NodePort
- 部署
1.3 驗證數(shù)據(jù)
1.4 prometheus數(shù)據(jù)采集
- job_name: 'kube-state-metrics'
static_configs:
- targets: ["IP:PORT"]
k8s配置文件configmap縮進格式
1.5 驗證prometheus狀態(tài)
1.6 grafana導(dǎo)入模板
- 13824
- 14518
因為版本不同,可根據(jù)對應(yīng)版本進行設(shè)置
Dashboard模板網(wǎng)址:https://grafana.com/grafana/dashboards/
二微酬、監(jiān)控示例
基于第三方exporter實現(xiàn)對目標服務(wù)的監(jiān)控
2.1 tomcat
- 構(gòu)建鏡像
github地址:https://github.com/nlighten/tomcat_exporter
根據(jù)tomcat官方鏡像添加jar包
ADD metrics.war /data/tomcat/webapps
ADD simpleclient-0.8.0.jar /usr/local/tomcat/lib/
ADD simpleclient_common-0.8.0.jar /usr/local/tomcat/lib/
ADD simpleclient_hotspot-0.8.0.jar /usr/local/tomcat/lib/
ADD simpleclient_servlet-0.8.0.jar /usr/local/tomcat/lib/
ADD tomcat_exporter_client-0.0.12.jar /usr/local/tomcat/lib/
- prometheus采集
- job_name: 'kube-state-metrics'
static_configs:
- targets: ["IP:PORT"]
k8s配置文件configmap縮進格式
- prometheus驗證
- grafana導(dǎo)入模板
github地址:https://github.com/nlighten/tomcat_exporter/blob/master/dashboard/example.json
下載這個json文件導(dǎo)入grafana即可
2.2 redis
通過redis_exporter監(jiān)控redis服務(wù)裝態(tài)
github網(wǎng)址:https://github.com/oliver006/redis_exporter
- 部署redis
一個pod兩個容器绘趋,redis和redis-exporter
- prometheus采集
- job_name: 'redis-metrics'
static_configs:
- targets: ["IP:PORT"]
k8s配置文件configmap縮進格式
- grafana導(dǎo)入模板
- 14615
- 11692
2.3 mysql
通過mysqld_exporter監(jiān)控MySQL服務(wù)的運行狀態(tài)
github網(wǎng)址:https://github.com/prometheus/mysqld_exporter
- 安裝mariadb-server
apt install -y mariadb
- 修改配置文件/etc/mysql/mariadb.conf.d/50-server.cnf監(jiān)聽地址颤陶,修改為0.0.0.0
bind-address = 0.0.0.0
重啟mariadb
- 創(chuàng)建mysql_exporter用戶
create user 'mysql_exporter'@'localhost' identified by 'password';
- 測試用戶名密碼連接
mysql -umysql_exporter -hlocalhost -ppassword
- 下載mysql_exporter
# 下載
wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.13.0/mysqld_exporter-0.13.0.linux-amd64.tar.gz
# 解壓
tar xvf mysqld_exporter-0.13.0.linux-amd64.tar.gz
# 查看啟動參數(shù)
./mysqld_exporter --help
- 創(chuàng)建免密登陸文件/root/.my.cnf
cat >> /root/.my.cnf <<EOF
[client]
user=mysql_exporter
password=123321
EOF
- 創(chuàng)建mysqld_service文件并啟動
# 創(chuàng)建軟鏈接
ln -sv /apps/mysqld_exporter-0.13.0.linux-amd64 /apps/mysqld_exporter
# 創(chuàng)建service文件
cat >> /etc/systemd/system/mysqld_exporter.service <<EOF
[Unit]
Description=Prometheus Mysql Exporter
After=network.target
[Service]
ExecStart=/apps/mysqld_exporter/mysqld_exporter --config.my-cnf=/root/.my.cnf
[Install]
WantedBy=multi-user.target
EOF
# 重新加載配置
systemctl daemon-reload
# 啟動mysqld_exporter
systemctl start mysqld_exporter.service
- 驗證metrics
- 驗證prometheus
- grafana導(dǎo)入模板
- 13106
- 11323
2.4 haproxy
通過haproxy_exporter監(jiān)控haproxy
github網(wǎng)址:https://github.com/prometheus/haproxy_exporter
- 安裝haproxy
apt install -y haproxy
- 修改配置文件,監(jiān)聽一個服務(wù)
listen SERVICE
bind BIND_IP:PORT
mode tcp
server SERVER_NAME LISTEN_IP:PORT check inter 3s fall 3 rise 3
確保sock文件是admin用戶(level后邊的admin)
stats socket /run/haproxy/admin.sock mode 660 level admin expose-fd listeners
- 重啟haproxy
systemctl restart haproxy
檢查監(jiān)聽端口是否正常
- 下載haproxy_exporter
# 下載
wget https://github.com/prometheus/haproxy_exporter/releases/download/v0.13.0/haproxy_exporter-0.13.0.linux-amd64.tar.gz
# 解壓
tar xvf haproxy_exporter-0.13.0.linux-amd64.tar.gz
# 創(chuàng)建軟鏈接
ln -sv /apps/haproxy_exporter-0.13.0.linux-amd64 /apps/haproxy_exporter
- 配置文件啟動haproxy_expoter
./haproxy_exporter --haproxy.scrape-uri=unix:/run/haproxy/admin.sock
端口默認監(jiān)聽9101
- 配置haproxy狀態(tài)頁
listen stats
bind :PORT
stats enable
stats uri /haproxy-status
stats realm HAProxy\ Stats\ Page
stats auth haadmin:123456
stats auth admin:123456
編輯/etc/haproxy/haproxy.cfg添加如上配置內(nèi)容
- 狀態(tài)頁啟動haproxy
./haproxy_exporter --haproxy.scrape-uri="http://admin:123456@127.0.0.1:PORT/haproxy-status;csv"
需要指定用戶名密碼陷遮,csv是指定以csv形式展示
- 驗證exporter
- prometheus數(shù)據(jù)采集
- job_name: "haproxy-exporter"
static_configs:
- targets: ["127.0.0.1:9101"]
虛擬機prometheus.yml配置縮進格式
./promtool check config prometheus.yml # 修改后檢查配置文件是否正確
- 重啟prometheus
systemctl restart prometheus.service
- 驗證prometheus
- grafana導(dǎo)入模板
- 367
- 2428
2.5 nginx
通過nginx_exporter監(jiān)控ngix
github模塊依賴網(wǎng)址:https://github.com/vozlt/nginx-module-vts
- 安裝nginx
# 克隆依賴模塊
git clone https://github.com/vozlt/nginx-module-vts.git
# 下載nginx源碼
wget http://nginx.org/download/nginx-1.20.2.tar.gz
# 解壓
tar xvf nginx-1.20.2.tar.gz
# 安裝nginx編譯依賴包
apt install -y libgd-dev libgeoip-dev libpcre3 libpcre3-dev libssl-dev gcc make
# 編譯nginx
cd nginx-1.20.2
./configure --prefix=/apps/nginx \
--with-http_ssl_module \
--with-http_v2_module \
--with-http_realip_module \
--with-http_stub_status_module \
--with-http_gzip_static_module \
--with-pcre \
--with-file-aio \
--with-stream \
--with-stream_ssl_module \
--with-stream_realip_module \
--add-module=/usr/local/src/nginx-module-vts/
# make
make
# make install
make install
- 修改配置文件
http {
vhost_traffic_status_zone;
...
server {
...
location /status {
vhost_traffic_status_display;
vhost_traffic_status_display_format html;
}
}
}
- 啟動nginx
# 檢查配置文件
/apps/nginx/sbin/nginx -t
# 啟動nginx
/apps/nginx/sbin/nginx
- 配置nginx的upstream
# http模塊里邊滓走,server模塊同級
upstream SERVICE {
server IP:PORT;
}
# server模塊里邊,轉(zhuǎn)發(fā)首頁
location / {
#root html;
#index index.html index.htm;
proxy_pass http://SERVICE;
}
- 檢查狀態(tài)頁
可以以json模式顯示數(shù)據(jù)
- 安裝nginx_exporter
# 下載
wget https://github.com/hnlq715/nginx-vts-exporter/releases/download/v0.10.3/nginx-vts-exporter-0.10.3.linux-amd64.tar.gz
# 解壓
tar xvf nginx-vts-exporter-0.10.3.linux-amd64.tar.gz
# 創(chuàng)建軟鏈接
ln -sv /apps/nginx-vts-exporter-0.10.3.linux-amd64 nginx-vts-exporter
# 啟動nginx_exporter
./nginx-vts-exporter -nginx.scrape_uri http://IP/status/format/json
默認監(jiān)聽端口號9913
- 驗證數(shù)據(jù)
- prometheus數(shù)據(jù)采集
- job_name: "nginx-exporter"
static_configs:
- targets: ["127.0.0.1:9913"]
虛擬機prometheus.yml配置文件縮進格式
- prometheus驗證數(shù)據(jù)
- grafana導(dǎo)入模板
- 2949
2.6 blockbox監(jiān)控url
官方地址:https://prometheus.io/download/#blackbox_exporter
blockbox_exporter是prometheus官方提供的一個exporter帽馋,可以通過http搅方,https,dns绽族,tcp和icmp對被監(jiān)控節(jié)點進行監(jiān)控和數(shù)據(jù)采集
http/https:url/api可用性檢測
TCP:端口監(jiān)聽檢測
ICMP:主機存活檢測
DNS:域名解析
- 部署blackbox_exporter
# 下載
wget https://github.com/prometheus/blackbox_exporter/releases/download/v0.19.0/blackbox_exporter-0.19.0.linux-amd64.tar.gz
# 解壓
tar xvf blackbox_exporter-0.19.0.linux-amd64.tar.gz
# 創(chuàng)建軟鏈接
ln -sv /apps/blackbox_exporter-0.19.0.linux-amd64 blackbox_exporter
- 創(chuàng)建blackbox-exporter.service文件
cat > /etc/systemd/system/blackbox-exporter.service <<EOF
[Unit]
Description=Prometheus Blackbox Exporter
Documentation=https://prometheus.io/download/#blackbox_exporter
After=network.target
[Service]
Type=simple
User=root
Group=root
Restart=on-failure
ExecStart=/apps/blackbox_exporter/blackbox_exporter --config.file=/apps/blackbox_exporter/blackbox.yml --web.listen-address=:9115
[Install]
WantedBy=multi-user.target
EOF
- 啟動blackbox_exporter
systemctl daemon-reload
systemctl restart blackbox-exporter
- 驗證數(shù)據(jù)
默認監(jiān)聽端口9115
- blackbox exporter監(jiān)控url
prometheus數(shù)據(jù)采集
- job_name: "http_status"
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets: ["domainname1","domainname2"]
labels:
instance: http_status
group: web
relabel_configs:
- source_labels: [__address__] # relbel通過將__address__(當前目標地址)寫入__param_tartget標簽來創(chuàng)建一個label
target_label: __param_target # 監(jiān)控目標domainname姨涡,作為__address__的value
- source_labels: [__param_target] # 監(jiān)控目標
target_label: url # 將監(jiān)控目標與url創(chuàng)建一個label
- target_label: __address__
replacement: BLACKBOX_EXPORTER:PORT
虛擬機prometheus.yml配置文件縮進格式
- 驗證prometheus狀態(tài)
- 查看blackbox頁面
- blockbox_exporter監(jiān)控icmp
prometheus數(shù)據(jù)采集
- job_name: "ping_status"
metrics_path: /probe
params:
module: [icmp]
static_configs:
- targets: ["IP1","IP2"]
labels:
instance: 'ping_status'
group: 'icmp'
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: ip # 將ip與__param_target創(chuàng)建一個label
- target_label: __address__
replacement: BLACKBOX_EXPORTER:PORT
虛擬機prometheus.yml配置文件縮進格式
- 驗證prometheus狀態(tài)
- blackbox_exporter監(jiān)控端口
prometheus數(shù)據(jù)采集
# 端口監(jiān)控
- job_name: "port_status"
metrics_path: /probe
params:
module: [tcp_connect]
static_configs:
- targets: ["IP:PORT","IP:PORT"]
labels:
instance: 'port_status'
group: 'port'
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: ip
- target_label: __address__
replacement: BLACKBOX_EXPORTER:PORT
- 驗證prometheus狀態(tài)
- grafana導(dǎo)入模板
- 9965
- 13587
三、告警
3.1 Alertmanager
prometheus-->觸發(fā)閾值-->超出持續(xù)時間-->alertmanager-->分組|抑制|靜默-->媒體類型-->郵件|釘釘|微信等
prometheus server通過配置監(jiān)控規(guī)則吧慢,實現(xiàn)告警發(fā)送涛漂,然后把告警push給Alertmanager,匹配Alertmanager配置的Router检诗,以WeChat匈仗、Email或Webhook方式發(fā)送給對應(yīng)的Receiver
分組(group):將類似性質(zhì)的告警合并為單個通知,比如網(wǎng)絡(luò)通知逢慌、主機通知悠轩、服務(wù)通知
靜默(silences):是一種簡單的特定時間靜音的機制,例如:服務(wù)器要升級維護可以先設(shè)置這個時間段告警靜默
抑制(inhibition):當告警發(fā)出后攻泼,停止重復(fù)發(fā)送由此告警引發(fā)的其他告警火架;即合并由一個故障引起的多個告警事件,可以消除冗余告警
- 安裝alertmanager
# 下載
wget https://github.com/prometheus/alertmanager/releases/download/v0.23.0/alertmanager-0.23.0.linux-amd64.tar.gz
# 解壓
tar xvf alertmanager-0.23.0.linux-amd64.tar.gz
# 創(chuàng)建軟鏈接
ln -sv /apps/alertmanager-0.23.0.linux-amd64 /apps/alertmanager
- 創(chuàng)建alertmanager.service文件
cat > /etc/systemd/system/alertmanager.service <<EOF
[Unit]
Description=Prometheus alertmanager
After=network.target
[Service]
ExecStart=/apps/alertmanager/alertmamager --config.file="/apps/alertmanager/alertmanager.yml"
[Install]
WantedBy=multi-user.target
EOF
- 啟動alertmanager
systemctl start alertmanager.service
默認監(jiān)聽端口9093坠韩,9094
監(jiān)控配置官方網(wǎng)址:https://prometheus.io/docs/alerting/latest/configuration/
- 驗證alertmanager狀態(tài)
3.2 郵件
官方網(wǎng)址:https://prometheus.io/docs/alerting/latest/configuration/#email_config
- 配置文件介紹
alertmanager.yml配置文件
global:
resolve_timeout: 5m # alertmanager在持續(xù)多久沒有收到新告警后標記為resolved
smtp_from: # 發(fā)件人郵箱地址
smtp_smarthost: # 郵箱smtp地址
smtp_auth_username: # 發(fā)件人的登陸用戶名距潘,默認和發(fā)件人地址一致
smtp_auth_password: # 發(fā)件人的登陸密碼,有時候是授權(quán)碼
smtp_hello:
smtp_require_tls: # 是否需要tls協(xié)議只搁。默認是true
route:
group_by: [alertname] # 通過alertname的值對告警進行分類
group_wait: 10s # 一組告警第一次發(fā)送之前等待的時延音比,即產(chǎn)生告警10s將組內(nèi)新產(chǎn)生的消息合并發(fā)送,通常是0s~幾分鐘(默認是30s)
group_interval: 2m # 一組已發(fā)送過初始告警通知的告警氢惋,接收到新告警后洞翩,下次發(fā)送通知前等待時延,通常是5m或更久(默認是5m)
repeat_interval: 5m # 一組已經(jīng)發(fā)送過通知的告警焰望,重復(fù)發(fā)送告警的間隔骚亿,通常設(shè)置為3h或者更久(默認是4h)
receiver: 'default-receiver' # 設(shè)置告警接收人
receivers:
- name: 'default-receiver'
email_configs:
- to: 'EMAIL@DOMAIN.com'
send_resolved: true # 發(fā)送恢復(fù)告警通知
inhibit_rules: # 抑制規(guī)則
- source_match: # 源匹配級別,當匹配成功發(fā)出通知熊赖,其他級別產(chǎn)生告警將被抑制
severity: 'critical' # 告警時間級別(告警級別根據(jù)規(guī)則自定義)
target_match:
severity: 'warning' # 匹配目標成功后来屠,新產(chǎn)生的目標告警為'warning'將被抑制
equal: ['alertname','dev','instance'] # 基于這些標簽抑制匹配告警的級別
# 時間示例解析 # group_wait: 10s # 第一次產(chǎn)生告警,等待10s,組內(nèi)有新增告警俱笛,一起發(fā)出捆姜,沒有則單獨發(fā)出 # group_interval: 2m # 第二次產(chǎn)生告警,先等待2m迎膜,2m后沒有恢復(fù)就進入repeat_interval # repeat_interval: 5m # 在第二次告警時延過后泥技,再等待5m,5m后沒有恢復(fù)磕仅,就發(fā)送第二次告警
如上配置珊豹,如果告警沒有恢復(fù),第二次告警會等待2m+5m榕订,即7分鐘后發(fā)出
- 配置告警規(guī)則
groups:
- name: alertmanager_pod.rules
rules:
- alert: Pod_all_cpu_usage
expr: (sum by(name)(rate(container_cpu_usage_seconds_total{image!=""}[5m]))*100) > 1
for: 2m
labels:
serverity: critical
service: pods
annotations:
description: 容器 {{ $labels.name }} CPU 資源利用率大于 10% , (current value is {{ $value }})
summary: Dev CPU 負載告警
- alert: Pod_all_memory_usage
expr: sort_desc(avg by(name)(irate(node_memory_MemFree_bytes {name!=""}[5m]))) > 2147483648 # 內(nèi)存大于2G
for: 2m
labels:
severity: critical
annotations:
description: 容器 {{ $labels.name }} Memory 資源利用大于 2G , (current value is {{ $value }})
summary: Dev Memory 負載告警
- alert: Pod_all_network_receive_usage
expr: sum by(name)(irate(container_network_reveive_bytes_total{container_name="POD"}[1m])) > 52428800 # 大于50M
for: 2m
labels:
severity: critical
annotations:
description: 容器 {{ $labels.name }} network_receive 資源利用大于 50M , (current value is {{ $value }})
- alert: node內(nèi)存可用大小
expr: node_memory_MemFree_bytes < 4294967296 # 內(nèi)存小于4G
for: 2m
labels:
severity: critical
annotations:
description: node可用內(nèi)存小于4G
在/apps/prometheus/目錄下創(chuàng)建rules目錄店茶,創(chuàng)建pods_rule.yaml文件,內(nèi)容如上
注意縮進格式卸亮,如果文件格式有誤忽妒,重啟prometheus的時候,promethues會一直起不來兼贸,可以先用promtool檢查配置文件格式段直,因為加載告警配置的時候,引入了這個文件溶诞,所以在檢查promethues.yml文件的時候也會檢查自定義的pods_rule.yaml文件
- promethues加載告警配置
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- IP:9093
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
- "/apps/prometheus/rules/pods_rule.yaml"
注:如果修改rule_files中的內(nèi)容鸯檬,需要先重啟prometheus,加載修改后的配置螺垢,然后修改alertmanager喧务,不然修改后的告警內(nèi)容不會生效
- 重啟prometheus
systemctl restart prometheus.service
- 驗證prometheus狀態(tài)
在prometheus頁面,點擊Alerts查看告警狀態(tài)枉圃,當前為PENDING功茴,說明已經(jīng)檢測到告警,還沒滿足發(fā)郵件的時間規(guī)則
FIRING證明告警已成功孽亲,此時應(yīng)該已經(jīng)收到郵件
- 查看alertmanager告警
- 查看告警郵件
點擊Source鏈接坎穿,跳轉(zhuǎn)的是主機名加prometheus-server的端口,無法解析就跳轉(zhuǎn)不過去
- 使用amtool查看告警
./amtool alert --alertmanager.url=http://IP:9093
3.3 釘釘
- 釘釘添加機器人
創(chuàng)建機器人官方網(wǎng)址:https://open.dingtalk.com/document/robots/custom-robot-access
- 發(fā)送消息腳本
vim /data/scripts/dingding-keywords.sh
MESSAGE=$1
/usr/bin/curl -X POST 'https://oapi.dingtalk.com/robot/send?access_token=TOKEN'\
-H 'Content-Type: application/json' \
-d '{"msgtype": "text",
"text": {
"content": "${MESSAGE}"
}
}'
- 測試發(fā)送消息
發(fā)送的消息內(nèi)容中返劲,必須包含自定義的關(guān)鍵字玲昧,不然發(fā)送消息會失敗,發(fā)送腳本發(fā)送消息成功后篮绿,群里會收到
- 部署webhook-dingtalk
# 下載
wget https://github.com/timonwong/prometheus-webhook-dingtalk/releases/download/v1.4.0/prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz
# 解壓
tar xvf prometheus-webhook-dingtalk-1.4.0.linux-amd64.tar.gz
# 創(chuàng)建軟鏈接
ln -sv /apps/prometheus-webhook-dingtalk-1.4.0.linux-amd64 prometheus-webhook-dingtalk
下載的webhook-dingtalk版本最好跟這個保持一直孵延,新版本有些地方不兼容
- 啟動webhook-dingtalk
cd /apps/prometheus-webhook-dingtalk
./prometheus-webhook-dingtalk --web.listen-address="0.0.0.0:8060" --ding.profile="KEYWORD=https://oapi.dingtalk.com/robot/send?access_token=TOKEN"
指定監(jiān)聽端口8060
KEYWORD必須是創(chuàng)建機器人時的自定義關(guān)鍵字,不然告警發(fā)布出去亲配,會報錯
- 配置alertmanager
- name: 'dingding'
webhook_configs:
- url: 'http://IP:8060/dingtalk/alertsen/send'
send_resolved: true
- 重啟alertmanager
systemctl restart alertmanager.service
prometheus加載的alertmanager告警規(guī)則里邊尘应,也必須含有自定義關(guān)鍵字才可以告警出來惶凝,不然也會報錯"keywords not in content",自定義關(guān)鍵字在key還是value中菩收,都可以
- 驗證發(fā)送告警信息
- 查看dingtalk日志
返回碼為200
- 查看釘釘消息
- 釘釘標簽簽名python腳本
dingtalk簽名腳本官方網(wǎng)址:https://open.dingtalk.com/document/robots/customize-robot-security-settings
#python 3.8
import time
import hmac
import hashlib
import base64
import urllib.parse
timestamp = str(round(time.time() * 1000))
secret = 'this is secret'
secret_enc = secret.encode('utf-8')
string_to_sign = '{}\n{}'.format(timestamp, secret)
string_to_sign_enc = string_to_sign.encode('utf-8')
hmac_code = hmac.new(secret_enc, string_to_sign_enc, digestmod=hashlib.sha256).digest()
sign = urllib.parse.quote_plus(base64.b64encode(hmac_code))
print(timestamp)
print(sign)
注意python版本梨睁,按照示例指定版本,執(zhí)行用"/usr/bin/python3.8 腳本名字"啟動
- 釘釘發(fā)送消息shell腳本
#!/bin/bash
source /etc/profile
MESSAGE=$1
secret='SECxxxxxx'
getkey=$(/usr/bin/python3.8 SCRIPT)
timestamp=${getkey:0:13}
sign=$(echo "${getkey:13:100}"|tr -d '\n')
DateStamp=$(date -d @${getkey:0:10} "+%F %H:%m%s")
/usr/bin/curl -X POST "https://oapi.dingtalk.com/robot/send?access_token=xxxxxx×tamp=${timestamp}&sign=${sign}" \
-H 'Content-Type: application/json' \
-d '{"msgtype":"text",
"text":{
"content": "'${MESSAGE}'"
}
}'
- alertmanager發(fā)送告警
prometheus-webhook-dingtalk配置文件
targets:
webhook1:
url: https://oapi.dingtalk.com/robot/send?access_token=xxxxxx
# secret for signature
secret: SECxxxxxx
指定token和secret
alertmanager配置文件
- name: 'dingding'
webhook_configs:
- url: 'http://IP:8060/dingtalk/webhook1/send' # webhook1對應(yīng)prometheus-webhook-dingtalk配置文件中的名字
send_resolved: true
- 啟動prometheus-webhook-dingtalk
./prometheus-webhook-dingtalk --web.listen-address="0.0.0.0:8060" --config.file=config.yml
啟動程序和配置文件同路徑下可不指定"--config.file"
prometheus配置不需要改動
- dingtalk日志
- 釘釘
3.4 消息分類發(fā)送
根據(jù)消息中的屬性信息設(shè)置規(guī)則娜饵,將消息分類發(fā)送,修改alertmanager配置文件
route:
group_by: [alertname]
group_wait: 10s
group_interval: 10s
repeat_interval: 2m
receiver: 'dingding'
# 添加路由信息
routes:
- receiver: email
group_wait: 10s
match_re:
instance: "IP:PORT" # 匹配成功的信息發(fā)郵件出來官辈,其余信息發(fā)給釘釘
修改配置文件后箱舞,重啟alertmanager
- 驗證消息發(fā)送
釘釘發(fā)送兩臺主機告警信息
郵件發(fā)送一臺主機告警信息(匹配成功instance,需要包括端口號)
3.5 自定義告警模板
- 告警規(guī)則
groups:
- name: 'node running status'
rules:
- alert: 'Instance Down'
expr: 'up == 0'
for: 5s
annotations:
title: 'Instance Down'
description: "{{ $labels.instance }}down"
labels:
robot: 'jcss'
severity: 'warning'
owner: 'xxxxxxxxxxx'
- name: 'node memory usage'
rules:
- alert: 'memory usage'
expr: '((node_memory_MemTotal_bytes - node_memory_MemFree_bytes) / node_memory_MemTotal_bytes * 100)> 85'
for: 5s
annotations:
title: 'Mem'
description: '{{ $labels.instance }} Memusage {{ $value }}'
labels:
robot: 'jcss'
ops: 'true'
severity: 'warning'
owner: "xxxxxxxxxxx"
- 自定義模板(wechat)
{{ define "wechat.default.message" }}
{{ range $i, $alert :=.Alerts }}
=======alertmanager監(jiān)控告警======
告警狀態(tài): {{ .Status }}
告警級別: {{ $alert.Labels.severity }}
告警類型: {{ $alert.Labels.alertname }}
告警應(yīng)用: {{ $alert.Annotations.summary }}
告警主機: {{ $alert.Labels.instance }}
告警主題: {{ $alert.Annotations.summary }}
告警閥值: {{ $alert.Annotations.value }}
告警詳情: {{ $alert.Annotations.description }}
告警時間: {{ $alert.StartsAt.Format "2006-01-02 15:04:05" }}
================end===============
{{ end }}
{{ end }}
- 自定義模板(dingding)
{{ define "__subject" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
{{ end }}
{{ define "__alert_list" }}{{ range . }}
---
{{ if .Labels.owner }}@{{ .Labels.owner }}{{ end }}
**告警名稱**: {{ index .Annotations "title" }}
**告警級別**: {{ .Labels.severity }}
**告警主機**: {{ .Labels.instance }}
**告警信息**: {{ index .Annotations "description" }}
**告警時間**: {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}
{{ end }}{{ end }}
{{ define "__resolved_list" }}{{ range . }}
---
{{ if .Labels.owner }}@{{ .Labels.owner }}{{ end }}
**告警名稱**: {{ index .Annotations "title" }}
**告警級別**: {{ .Labels.severity }}
**告警主機**: {{ .Labels.instance }}
**告警信息**: {{ index .Annotations "description" }}
**告警時間**: {{ dateInZone "2006.01.02 15:04:05" (.StartsAt) "Asia/Shanghai" }}
**恢復(fù)時間**: {{ dateInZone "2006.01.02 15:04:05" (.EndsAt) "Asia/Shanghai" }}
{{ end }}{{ end }}
{{ define "default.title" }}
{{ template "__subject" . }}
{{ end }}
{{ define "default.content" }}
{{ if gt (len .Alerts.Firing) 0 }}
**====偵測到{{ .Alerts.Firing | len }}個故障====**
{{ template "__alert_list" .Alerts.Firing }}
---
{{ end }}
{{ if gt (len .Alerts.Resolved) 0 }}
**====恢復(fù){{ .Alerts.Resolved | len }}個故障====**
{{ template "__resolved_list" .Alerts.Resolved }}
{{ end }}
{{ end }}
{{ define "ding.link.title" }}{{ template "default.title" . }}{{ end }}
{{ define "ding.link.content" }}{{ template "default.content" . }}{{ end }}
{{ template "default.title" . }}
{{ template "default.content" . }}
default.yml定義在prometheus-webhook-dingtalk目錄下
- 引用模板
修改alertmanager配置文件拳亿,引用模板
template:
- "/apps/prometheus-webhook-dingtalk/templates/default.templ" # 引用模板
在alertmanager中引用模板晴股,dingtalk的告警時間,alertmanager不識別肺魁,加進去"dateInZone"那行alertmanger起不來电湘,去掉才可以
- dingtalk驗證
修改prometheus-webhook-dingtalk的config.yml文件,引用模板
templates:
- /apps/prometheus-webhook-dingtalk/templates/default.templ
這里引用模板文件鹅经,可以識別告警時間信息
- dingtalk驗證
3.6 告警抑制與靜默
- 告警抑制
基于告警規(guī)則寂呛,資源使用率超過80%就不再發(fā)60%的告警,即60%的表達式觸發(fā)的告警被抑制了
- name: alertmanager_node.rules
rules:
- alert: 磁盤容量
expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 80 #磁盤容量利用率大于80%
for: 2s
labels:
severity: critical
annotations:
summary: "{{$labels.mountpoint}} 磁盤分區(qū)使用率過高瘾晃!"
description: "{{$labels.mountpoint }} 磁盤分區(qū)使用大于80%(目前使用:{{$value}}%)"
- alert: 磁盤容量
expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 60 #磁盤容量利用率大于60%
for: 2s
labels:
severity: warning
annotations:
summary: "{{$labels.mountpoint}} 磁盤分區(qū)使用率過高贷痪!"
description: "{{$labels.mountpoint }} 磁盤分區(qū)使用大于80%(目前使用:{{$value}}%)"
- 釘釘告警驗證
- 釘釘恢復(fù)驗證
告警的時候只有>80%,恢復(fù)的時候是兩個蹦误,>60%和>80%
- 告警靜默
手動靜默:先找到要靜默的告警事件劫拢,然后手動靜默指定的事件
查看靜默事件
點擊Expire,靜默事件失效强胰,繼續(xù)發(fā)送告警
3.7 Alertmanager高可用
- 基于負載均衡
alertmanager是http的單次調(diào)用舱沧,不需要會話保持,所以可以部署多態(tài)alertmanager偶洋,在前邊加上負載均衡器熟吏,可以通過vip做成主備或者輪訓(xùn)方式
- 基于Gossip機制
Gossip機制為多個Alertmanager之間提供了信息傳遞的機制。確保及時在多個Alertmanager分別接收到相同告警信息的情況下涡真,也只有一個告警通知被發(fā)送給Receiver
gossip協(xié)議: Gossip是分布式系統(tǒng)中被廣泛使用的協(xié)議分俯,用于實現(xiàn)分布式節(jié)點之間的信息交換和狀態(tài)同步。Gossip協(xié)議同步狀態(tài)類似于流言或者病毒的傳播
- 搭建本地集群環(huán)境
為了能夠讓Alertmanager節(jié)點之間進行通訊哆料,需要在Alertmanager啟動時設(shè)置相應(yīng)的參數(shù)缸剪。其中主要的參數(shù)包括:
--cluster.listen-address string: 當前實例集群服務(wù)監(jiān)聽地址
--cluster.peer value: 初始化時關(guān)聯(lián)的其它實例的集群服務(wù)地址
3.8 PrometheusAlert
github地址:https://github.com/feiyu563/PrometheusAlert
PrometheusAlert是開源的運維告警中心消息轉(zhuǎn)發(fā)系統(tǒng),支持主流的監(jiān)控系統(tǒng)Prometheus东亦、Zabbix杏节,日志系統(tǒng)Graylog2唬渗,Graylog3、數(shù)據(jù)可視化系統(tǒng)Grafana奋渔、SonarQube镊逝。阿里云-云監(jiān)控,以及所有支持WebHook接口的系統(tǒng)發(fā)出的預(yù)警消息嫉鲸,支持將收到的這些消息發(fā)送到釘釘撑蒜,微信,email玄渗,飛書座菠,騰訊短信,騰訊電話藤树,阿里云短信浴滴,阿里云電話,華為短信岁钓,百度云短信升略,容聯(lián)云電話徐许,七陌短信注祖,七陌語音,TeleGram求冷,百度Hi(如流)等囚霸。
四腰根、Pushgateway
4.1 pushgateway簡介
github地址:https://github.com/prometheus/pushgateway
pushgateway是采用被動推送的方式,而不是類似于prometheus server主動連接exporter獲取監(jiān)控數(shù)據(jù)拓型,pushgateway可以單獨運行在一個節(jié)點额嘿,然后需要自定義監(jiān)控腳本把需要監(jiān)控的數(shù)據(jù)主動推送給pushgateway的API接口,然后pushgateway再等待prometheus server抓取數(shù)據(jù)劣挫,即pushgateway本身沒有任何抓取監(jiān)控數(shù)據(jù)的功能册养,pushgateway只是被動的等待數(shù)據(jù)從客戶端推送過來
--persistence.file="" # 數(shù)據(jù)保存的文件,默認只保存在內(nèi)存中
--persistence.interval=5m # 數(shù)據(jù)持久化的間隔時間
4.2 部署pushgateway
- 下載并啟動
# 下載
wget https://github.com/prometheus/pushgateway/releases/download/v1.4.2/pushgateway-1.4.2.linux-amd64.tar.gz
# 解壓
tar xvf pushgateway-1.4.2.linux-amd64.tar.gz
# 創(chuàng)建軟鏈接
ln -sv pushgateway-1.4.2.linux-amd64 pushgateway
# 啟動
cd pushgateway
./pushgateway
默認監(jiān)聽端口9091
- 驗證pushgateway頁面
4.3 prometheus采集數(shù)據(jù)
- job_name: "pushgateway-monitor"
scrape_interval: 5s
static_configs:
- targets: ["IP:9091"]
honor_labels: true # 保留抓取數(shù)據(jù)原標簽
配置prometheus.yml文件压固,重啟prometheus
honor_labels控制prometheus如何處理已經(jīng)存在于抓取數(shù)據(jù)中的標簽與prometheus將附加的服務(wù)器端的標簽之間的沖突("job"和"instance"標簽球拦,手動配置的目標標簽以及服務(wù)發(fā)現(xiàn)生成的標簽);如果honor_labels設(shè)置為"true"帐我,則通過保留已抓取數(shù)據(jù)的標簽值并忽略沖突的服務(wù)器端標簽坎炼,如果設(shè)置為"false",則通過將已抓取數(shù)據(jù)中的沖突標簽重命名為"exported_<original-label>"(例如"exporterd_instance"拦键,"exported_job")谣光,然后附加服務(wù)器端標簽
- 驗證prometheus頁面
4.4 推送數(shù)據(jù)
- 推送單條數(shù)據(jù)
要push數(shù)據(jù)到pushgateway中,可以通過API標準接口來添加芬为,默認URL為:
http://ip:9091/metrics/job/JOBNAME{/LABEL_NAME/LABEL_VALUE}
其中JOBNAME是必填項萄金,為job標簽值蟀悦,后邊可以跟任意數(shù)量的標簽對,一般我們會添加一個instance/INSTANCE_NAME實例名稱標簽氧敢,來方便區(qū)分各個指標
# 推送job名稱為test_job日戈,key為metrics,值為111
echo "test_job 111" | curl --data-binary @- http://PUSHGATEWAY_IP:9091/metrics/job/test_job
# 推送job名稱為test_job孙乖,key為metrics浙炼,值為111,instance為UPLOAD_IP
echo "test_job 111" | curl --data-binary @- http://PUSHGATEWAY_IP:9091/metrics/job/test_job/instance/UPLOAD_IP
往相同的metrics中push數(shù)據(jù)唯袄,之前的key會被新的key覆蓋掉
- pushgateway數(shù)據(jù)驗證
- 推送多條數(shù)據(jù)
cat << EOF | curl --data-binary @- http://PUSHGATEWAY_IP:9091/metrics/job/test_job/instance/UPLOAD_IP
#TYPE node_memory_usage gauge
node_memory_usage 4311744512
# TYPE memory_total gauge
node_memory_total 103481868288
EOF
- pushgateway數(shù)據(jù)驗證
- prometheus驗證
4.5 自定義收集數(shù)據(jù)
- 自定義腳本
#!/bin/bash
total_memory=$(free | awk '/Mem/{print $2}')
used_memory=$(free | awk '/Mem/{print $3}')
job_name="custom_memory_monitor"
instance_name=`ifconfig ens33 | grep -w inet | awk '{print $2}'`
pushgateway_server="http://PUSHGATEWAY_IP:9091/metrics/job"
cat <<EOF | curl --data-binary @- ${pushgateway_server}/${job_name}/instance/${instance_name}
#TYPE node_memory_usage gauge 4G
node_memory_usage 4311744512
# TYPE memory_total gauge 96G
node_memory_total 103481868288
EOF
內(nèi)存監(jiān)控腳本鼓拧,mem_monitor.sh 分別在不通的主機執(zhí)行腳本,驗證指標收集和推送
- pushgateway驗證
4.6 刪除數(shù)據(jù)
先對一個組寫入多個instance數(shù)據(jù)
- pushgateway驗證多條數(shù)據(jù)
- 通過api刪除指定組內(nèi)指定實例的數(shù)據(jù)
curl -X DELETE http://PUSHGATEWAY_IP:9091/metrics/job/custom_memory_monitor/instance/UPLOAD_IP
- pushgateway驗證刪除組內(nèi)實例
- 通過web界面刪除
五越妈、聯(lián)邦集群
一臺server,兩臺聯(lián)邦钮糖,兩臺node梅掠,配置prometheus和node_exporter
- 分別添加target
- job_name: "prometheus-node1"
static_configs:
- targets: ["PROMETHEUS1:9100"]
promethues聯(lián)邦二配置同理
- prometheus驗證數(shù)據(jù)
另一個數(shù)據(jù)同理
- 配置采集數(shù)據(jù)prometheus-server
- job_name: "prometheus-federate-1"
scrape_interval: 10s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="prometheus"}'
- '{__name__=~"job:.*"}'
- '{__name__=~"node.*"}'
static_configs:
- targets: ["PROMETHEUS1:9090"]
- job_name: "prometheus-federate-2"
scrape_interval: 10s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="prometheus"}'
- '{__name__=~"job:.*"}'
- '{__name__=~"node.*"}'
static_configs:
- targets: ["PROMETHEUS2:9090"]
- prometheus-server驗證
- prometheus-server查詢數(shù)據(jù)驗證
prometheus-node1和prometheus-node2的job被聯(lián)邦節(jié)點收集數(shù)據(jù)后,又被prometheus-server端收集數(shù)據(jù)
- grafana導(dǎo)入模板
模板ID:8919
六店归、Prometheus存儲
6.1 本地存儲
Prometheus有著非常高效的時間序列數(shù)據(jù)存儲方法阎抒,每個采樣數(shù)據(jù)僅僅占用3.5byte左右空間,上百萬條時間序列消痛,30s間隔且叁,保留60天,大概200多G空間(引用官方內(nèi)容)
- block簡介
每個block為一個data目錄中以01開頭的存儲目錄
# ll data
drwxr-xr-x 3 root root 4096 Mar 31 12:39 01FZFZSNX1SJFCKQYZAAN2J6AG/ # block
drwxr-xr-x 2 root root 4096 Mar 31 12:39 wal/ # write ahead log
默認情況下秩伞,prometheus將采集到的數(shù)據(jù)存儲到本地的TSDB數(shù)據(jù)庫中逞带,路徑默認為prometheus安裝目錄的data目錄,數(shù)據(jù)寫入過程為先把數(shù)據(jù)寫入wal日志并放在內(nèi)存纱新,然后2小時后將內(nèi)存數(shù)據(jù)保存至一個新的block塊展氓,同時再把新采集的數(shù)據(jù)寫入內(nèi)存,然后在2小時后再保存到一個新的block塊脸爱,以此類推
- block特性
block會壓縮遇汞、合并歷史數(shù)據(jù)庫,以及刪除過期的塊簿废,隨著壓縮空入、合并,block的數(shù)量會減少族檬,在壓縮過程中會發(fā)生三件事:定制執(zhí)行壓縮歪赢、合并小的block到大的block、清理過期的塊
tree 01FZFZSNX1SJFCKQYZAAN2J6AG/
01FZFZSNX1SJFCKQYZAAN2J6AG/
├── chunks
│ └── 000001 # 數(shù)據(jù)目錄导梆,每個大小為512M轨淌,超過會被切成多個
├── index # 索引文件迂烁,記錄存儲的數(shù)據(jù)索引信息,通過文件內(nèi)的幾個表來查找時序數(shù)據(jù)
├── meta.json # block元數(shù)據(jù)信息递鹉,包括了樣本數(shù)盟步、采集數(shù)據(jù)的起始時間、壓縮歷史
└── tombstones # 邏輯數(shù)據(jù)躏结,主要記載刪除記錄和標記要刪除的內(nèi)容却盘,刪除標記,在查詢塊時排除樣本
- 本地存儲配置參數(shù)
--config.file="prometheus.yml" # 指定配置文件
--web.listen-address="0.0.0.0:9090" # 指定監(jiān)聽地址
--storage.tsdb.path="data/" # 指定數(shù)據(jù)存儲目錄
--storage.tsdb.retention.time=y, w, d, h, m, s, ms # 數(shù)據(jù)保存時長媳拴,默認是15天
--storage.tsdb.retention.size=B, KB, MB, GB, TB, PB, EB. # 指定chunk大小黄橘,默認 "512MB"
--query.timeout=2m # 默認最大查詢超時時間
--query.max-concurrency=20 #默認最大查詢并發(fā)數(shù)
--web.read-timeout=5m # 默認最大空閑超時時間
--web.max-connections=512 # 默認最大并發(fā)鏈接數(shù)
--web.enable-lifecycle # 啟用或關(guān)閉API動態(tài)加載配置功能
6.2 遠端存儲Victoriametrics
github網(wǎng)址:https://github.com/VictoriaMetrics/VictoriaMetrics
官方文檔網(wǎng)址:https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html
- 單機部署
# 下載
wget https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/v1.71.0/victoria-metrics-amd64-v1.71.0.tar.gz
# 解壓
tar xvf victoria-metrics-amd64-v1.71.0.tar.gz
啟動參數(shù)
-httpListenAddr=0.0.0.0:8428 # 默認監(jiān)聽地址及端口
-storageDataPath # 數(shù)據(jù)存放目錄,默認在啟動目錄下創(chuàng)建victoria-metrics-data
-retentionPeriod # h (hour), d (day), w (week), y (year)屈溉,默認單位是月塞关,保留一個月
啟動
mv victoria-metrics-prod /usr/local/bin/
# 創(chuàng)建service文件
cat >> /etc/systemd/system/victoria-metrics-prod.service <<EOF
[Unit]
Description=victoria-metrics-prod
Documentation=https://docs.victoriametrics.com/Single-server-VictoriaMetrics.html
After=network.target
[Service]
Restart=on-failure
ExecStart=/usr/local/bin/victoria-metrics-prod -storageDataPath=/data/victoria -retentionPeriod=3
[Install]
WantedBy=multi-user.target
EOF
# 加載配置文件
systemctl daemon-reload
# 啟動
systemctl start victoria-metrics-prod.service
驗證victoria頁面
vitoria ui頁面
prometheus配置
remote_write:
- url: http://192.168.96.161:8428/api/v1/write
配置完成后重啟prometheus
victoria ui數(shù)據(jù)展示
若一直沒有數(shù)據(jù),需要開啟導(dǎo)航欄的Auto-refresh按鈕偏陪,然后時間選最近5分鐘
查詢的數(shù)據(jù)是vitoria存儲目錄的數(shù)據(jù),而不是prometheus本地的data數(shù)據(jù)目錄的數(shù)據(jù)
grafana配置
添加數(shù)據(jù)源:類型為prometheus秸妥,地址及端口為vitoriametrics服務(wù)器的ip和端口
模板ID:8919
- docker-compose
github網(wǎng)址:https://github.com/VictoriaMetrics/VictoriaMetrics/tree/cluster/deployment/docker
安裝docker-compose灌灾,克隆代碼逼庞,進入目錄,docker-compose up -d
是單機版啟動
- 集群部署
組件介紹
vminsert:寫入組件(寫)瞻赶,負責接收數(shù)據(jù)寫入并根據(jù)對度量名稱及其所有標簽的一直hash結(jié)果將數(shù)據(jù)分散寫入不同的后端vmstorage節(jié)點赛糟,vmstorage負責持久化數(shù)據(jù)派任,vminsert默認端口8400
vmstorage:查詢原始數(shù)據(jù)并返回給指定時間范圍內(nèi)給定標簽過濾器的查詢數(shù)據(jù),默認端口8482
vmselect:查詢組件(讀)璧南,連接vmstorage掌逛,默認端口8401
其他可選組件
vmagnet:是一個很小但功能強大的代理,它可以從node_exporter各種來源收集度量數(shù)據(jù)穆咐,并將他們存儲在victoriametrics或任何其他支持遠程寫入?yún)f(xié)議的與prometheus兼容的存儲系統(tǒng)中
vmalert:替換prometheus server颤诀,以vitoriametrics為數(shù)據(jù)源,基于兼容prometheus的告警規(guī)則对湃,判斷數(shù)據(jù)是否異常崖叫,并將產(chǎn)生的通知發(fā)送給alertmanager
vmgateway:讀寫victoriametrics數(shù)據(jù)的代理網(wǎng)關(guān),可實現(xiàn)限速和訪問控制等功能拍柒,目前企業(yè)版組件
vmctl:victoriametrics的命令行工具心傀,目前主要用于將prometheus、opentsdb等數(shù)據(jù)源的數(shù)據(jù)遷移拆讯,遷移到victoriametrics
下載
# 下載
wget https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/v1.71.0/victoria-metrics-amd64-v1.71.0-cluster.tar.gz
# 解壓
tar xvf victoria-metrics-amd64-v1.71.0-cluster.tar.gz
mv vm* /usr/local/bin/
選擇三臺主機脂男,每臺主機都需要下載安裝
主要參數(shù)
-httpListenAddr string # vmselect
Address to listen for http connections (default ":8481")
-vminsertAddr string # vmstorage
TCP address to accept connections from vminsert services (default ":8400")
-vmselectAddr string # vmstorage
TCP address to accept connections from vmselect services (default ":8401")
部署vmstorage-prod組件
vmstorage負責數(shù)據(jù)的持久化,監(jiān)聽端口:API 8482种呐,數(shù)據(jù)寫入端口8400宰翅,數(shù)據(jù)讀取端口:8401
# 創(chuàng)建service文件
cat >> /etc/systemd/system/vmstorage.service <<EOF
[Unit]
Description=Vmstorage Server
After=network.target
[Service]
Restart=on-failure
WorkingDirectory=/tmp
ExecStart=/usr/local/bin/vmstorage-prod --loggerTimezone Asia/Shanghai -storageDataPath /data/victoriametrics-cluster/vmstorage -httpListenAddr :8482 -vminsertAddr :8400 -vmselectAddr :8401
[Install]
WantedBy=multi-user.target
EOF
# 啟動
systemctl daemon-reload
systemctl start vmstorage.service
注意操作系統(tǒng)不同,可能配置文件路徑不一樣
vmstorage在cluster模式下爽室,有三個端口汁讼,8482端口是API端口,8400是端口給vminsert寫數(shù)據(jù)的阔墩,8401端口是給vmselect查詢數(shù)據(jù)的
部署vminsert-prod組件
# 創(chuàng)建service文件
cat >> /etc/systemd/system/vminsert.service <<EOF
[Unit]
Description=Vminsert Server
After=network.target
[Service]
Restart=on-failure
WorkingDirectory=/tmp
ExecStart=/usr/local/bin/vminsert-prod -httpListenAddr :8480 -storageNode=IP1:8400,IP2:8400,IP3:8400
[Install]
WantedBy=multi-user.target
EOF
# 啟動
systemctl daemon-reload
systemctl start vminsert.service
注意組件名字后邊的-prod嘿架,不要忘了寫
vminsert監(jiān)聽端口8480
部署vmselect-prod組件
# 創(chuàng)建service文件
cat >> /etc/systemd/system/vmselect.service <<EOF
[Unit]
Description=Vmselect Server
After=network.target
[Service]
Restart=on-failure
WorkingDirectory=/tmp
ExecStart=/usr/local/bin/vmselect-prod -httpListenAddr :8481 -storageNode=IP1:8401,IP2:8401,IP3:8401
[Install]
WantedBy=multi-user.target
EOF
# 啟動
systemctl daemon-reload
systemctl start vmselect.service
vmselect 監(jiān)聽端口8481
驗證集群服務(wù)端口
curl http://IP:8480/metrics # vminsert
curl http://IP:8481/metrics # vmselect
curl http://IP:8482/metrics # vmstorage
三個服務(wù)器都需要驗證
驗證vmselect監(jiān)聽端口頁面
prometheus配置
remote_write:
- url: http://IP1:8480/insert/0/prometheus # 配置的端口是vminsert的監(jiān)聽端口
- url: http://IP2:8480/insert/0/prometheus
- url: http://IP3:8480/insert/0/prometheus
配置的端口是vminsert的監(jiān)聽端口
配置grafana,添加數(shù)據(jù)源
http://IP1:8481/select/0/prometheus
grafana只能配置一個地址啸箫,可以搭建負載均衡器耸彪,寫負載均衡的ip,讀取數(shù)據(jù)的路徑需要跟prometheus的配置文件中保持一致忘苛,insert是寫蝉娜,select是讀,后邊兩段uri需要一致扎唾,不然讀不到數(shù)據(jù)
grafana模板
8919蜀肘、13824
- 開啟數(shù)據(jù)復(fù)制
官方網(wǎng)址:https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#replication-and-data-safety
默認情況下,數(shù)據(jù)被vminsert的組件基于hash算法分別將數(shù)據(jù)持久化到不同的vmstorage節(jié)點
開啟數(shù)據(jù)復(fù)制是啟用vminsert組件支持的-replicationFactor=n(n代表幾份)復(fù)制功能稽屏,將數(shù)據(jù)分別在各節(jié)點保存一份完整的副本以實現(xiàn)數(shù)據(jù)的高可用