本文將分析Prometheus的常見配置與服務(wù)發(fā)現(xiàn)坞靶,分為概述渠旁、配置詳解部凑、服務(wù)發(fā)現(xiàn)、常見場景四個部分進行講解。
一. 概述
Prometheus的配置可以用命令行參數(shù)垛吗、或者配置文件,如果是在k8s集群內(nèi),一般配置在configmap中(以下均為prometheus2.7版本)
查看可用的命令行參數(shù),可以執(zhí)行 ./prometheus -h
也可以指定對應的配置文件庆杜,參數(shù):--config.file 一般為prometheus.yml
如果配置有修改,如增添采集job碟摆,Prometheus可以重新加載它的配置晃财。只需要向其
進程發(fā)送SIGHUP或向/-/reload端點發(fā)送HTTP POST請求。如:
curl -X POST http://localhost:9090/-/reload
二. 配置詳解
2.1 命令行參數(shù)
執(zhí)行./prometheus -h 可以看到各個參數(shù)的含義典蜕,例如:
--web.listen-address="0.0.0.0:9090" 監(jiān)聽端口默認為9090拓劝,可以修改只允許本機訪問,或者為了安全起見嘉裤,可以改變其端口號(默認的web服務(wù)沒有鑒權(quán))
--web.max-connections=512 默認最大連接數(shù):512
--storage.tsdb.path="data/" 默認的存儲路徑:data目錄下
--storage.tsdb.retention.time=15d 默認的數(shù)據(jù)保留時間:15天。原有的storage.tsdb.retention配置已經(jīng)被廢棄
--alertmanager.timeout=10s 把報警發(fā)送給alertmanager的超時限制 10s
--query.timeout=2m 查詢超時時間限制默認為2min栖博,超過自動被kill掉屑宠。可以結(jié)合grafana的限時配置如60s
--query.max-concurrency=20 并發(fā)查詢數(shù) prometheus的默認采集指標中有一項prometheus_engine_queries_concurrent_max可以拿到最大查詢并發(fā)數(shù)及查詢情況
--log.level=info 日志打印等級一共四種:[debug, info, warn, error]仇让,如果調(diào)試屬性可以先改為debug等級
.....
在prometheus的頁面上典奉,status的Command-Line Flags中,可以看到當前配置丧叽,如promethues-operator的配置是:
2.2 prometheus.yml
從官方的download頁下載的promethues二進制文件卫玖,會自帶一份默認配置prometheus.yml
-rw-r--r--@ LICENSE
-rw-r--r--@ NOTICE
drwxr-xr-x@ console_libraries
drwxr-xr-x@ consoles
-rwxr-xr-x@ prometheus
-rw-r--r--@ prometheus.yml
-rwxr-xr-x@ promtool
prometheus.yml配置了很多屬性,包括遠程存儲踊淳、報警配置等很多內(nèi)容假瞬,下面將對主要屬性進行解釋:
# 默認的全局配置
global:
scrape_interval: 15s # 采集間隔15s,默認為1min一次
evaluation_interval: 15s # 計算規(guī)則的間隔15s默認為1min一次
scrape_timeout: 10s # 采集超時時間迂尝,默認為10s
external_labels: # 當和其他外部系統(tǒng)交互時的標簽脱茉,如遠程存儲、聯(lián)邦集群時
prometheus: monitoring/k8s # 如:prometheus-operator的配置
prometheus_replica: prometheus-k8s-1
# Alertmanager的配置
alerting:
alertmanagers:
- static_configs:
- targets:
- 127.0.0.1:9093 # alertmanager的服務(wù)地址垄开,如127.0.0.1:9093
alert_relabel_configs: # 在抓取之前對任何目標及其標簽進行修改琴许。
- separator: ;
regex: prometheus_replica
replacement: $1
action: labeldrop
# 一旦加載了報警規(guī)則文件,將按照evaluation_interval即15s一次進行計算溉躲,rule文件可以有多個
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
# scrape_configs為采集配置榜田,包含至少一個job
scrape_configs:
# Prometheus的自身監(jiān)控 將在采集到的時間序列數(shù)據(jù)上打上標簽job=xx
- job_name: 'prometheus'
# 采集指標的默認路徑為:/metrics,如 localhost:9090/metric
# 協(xié)議默認為http
static_configs:
- targets: ['localhost:9090']
# 遠程讀锻梳,可選配置箭券,如將監(jiān)控數(shù)據(jù)遠程讀寫到influxdb的地址,默認為本地讀寫
remote_write:
127.0.0.1:8090
# 遠程寫
remote_read:
127.0.0.1:8090
2.3 scrape_configs配置
prometheus的配置中唱蒸,最常用的就是scrape_configs配置邦鲫,比如添加新的監(jiān)控項,修改原有監(jiān)控項的地址頻率等。
最簡單配置為:
scrape_configs:
- job_name: prometheus
metrics_path: /metrics
scheme: http
static_configs:
- targets:
- localhost:9090
完整配置為(附prometheus-operator的推薦配置):
# job 將以標簽形式出現(xiàn)在指標數(shù)據(jù)中庆捺,如node-exporter采集的數(shù)據(jù)古今,job=node-exporter
job_name: node-exporter
# 采集頻率:30s
scrape_interval: 30s
# 采集超時:10s
scrape_timeout: 10s
# 采集對象的path路徑
metrics_path: /metrics
# 采集協(xié)議:http或者https
scheme: https
# 可選的采集url的參數(shù)
params:
name: demo
# 當自定義label和采集到的自帶label沖突時的處理方式,默認沖突時會重名為exported_xx
honor_labels: false
# 當采集對象需要鑒權(quán)才能獲取時滔以,配置賬號密碼等信息
basic_auth:
username: admin
password: admin
password_file: /etc/pwd
# bearer_token或者文件位置(OAuth 2.0鑒權(quán))
bearer_token: kferkhjktdgjwkgkrwg
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
# https的配置捉腥,如跳過認證,或配置證書文件
tls_config:
# insecure_skip_verify: true
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
server_name: kubernetes
insecure_skip_verify: false
# 代理地址
proxy_url: 127.9.9.0:9999
# Azure的服務(wù)發(fā)現(xiàn)配置
azure_sd_configs:
# Consul的服務(wù)發(fā)現(xiàn)配置
consul_sd_configs:
# DNS的服務(wù)發(fā)現(xiàn)配置
dns_sd_configs:
# EC2的服務(wù)發(fā)現(xiàn)配置
ec2_sd_configs:
# OpenStack的服務(wù)發(fā)現(xiàn)配置
openstack_sd_configs:
# file的服務(wù)發(fā)現(xiàn)配置
file_sd_configs:
# GCE的服務(wù)發(fā)現(xiàn)配置
gce_sd_configs:
# Marathon的服務(wù)發(fā)現(xiàn)配置
marathon_sd_configs:
# AirBnB的服務(wù)發(fā)現(xiàn)配置
nerve_sd_configs:
# Zookeeper的服務(wù)發(fā)現(xiàn)配置
serverset_sd_configs:
# Triton的服務(wù)發(fā)現(xiàn)配置
triton_sd_configs:
# Kubernetes的服務(wù)發(fā)現(xiàn)配置
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- monitoring
# 對采集對象進行一些靜態(tài)配置你画,如打特定的標簽
static_configs:
- targets: ['localhost:9090', 'localhost:9191']
labels:
my: label
your: label
# 在Prometheus采集數(shù)據(jù)之前抵碟,通過Target實例的Metadata信息,動態(tài)重新寫入Label的值坏匪。
如將原始的__meta_kubernetes_namespace直接寫成namespace拟逮,簡潔明了
relabel_configs:
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: service
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_pod_name]
separator: ;
regex: (.*)
target_label: pod
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: job
replacement: ${1}
action: replace
- separator: ;
regex: (.*)
target_label: endpoint
replacement: web
action: replace
# 指標relabel的配置,如丟掉某些無用的指標
metric_relabel_configs:
- source_labels: [__name__]
separator: ;
regex: etcd_(debugging|disk|request|server).*
replacement: $1
action: drop
# 限制最大采集樣本數(shù)适滓,超過了采集將會失敗敦迄,默認為0不限制
sample_limit: 0
三. 服務(wù)發(fā)現(xiàn)
上邊的配置文件中,有很多***_sd_configs的配置凭迹,如kubernetes_sd_configs罚屋,就是用于服務(wù)發(fā)現(xiàn)的采集配置。
支持的服務(wù)發(fā)現(xiàn)類型:
// prometheus/discovery/config/config.go
type ServiceDiscoveryConfig struct {
StaticConfigs []*targetgroup.Group `yaml:"static_configs,omitempty"`
DNSSDConfigs []*dns.SDConfig `yaml:"dns_sd_configs,omitempty"`
FileSDConfigs []*file.SDConfig `yaml:"file_sd_configs,omitempty"`
ConsulSDConfigs []*consul.SDConfig `yaml:"consul_sd_configs,omitempty"`
ServersetSDConfigs []*zookeeper.ServersetSDConfig `yaml:"serverset_sd_configs,omitempty"`
NerveSDConfigs []*zookeeper.NerveSDConfig `yaml:"nerve_sd_configs,omitempty"`
MarathonSDConfigs []*marathon.SDConfig `yaml:"marathon_sd_configs,omitempty"`
KubernetesSDConfigs []*kubernetes.SDConfig `yaml:"kubernetes_sd_configs,omitempty"`
GCESDConfigs []*gce.SDConfig `yaml:"gce_sd_configs,omitempty"`
EC2SDConfigs []*ec2.SDConfig `yaml:"ec2_sd_configs,omitempty"`
OpenstackSDConfigs []*openstack.SDConfig `yaml:"openstack_sd_configs,omitempty"`
AzureSDConfigs []*azure.SDConfig `yaml:"azure_sd_configs,omitempty"`
TritonSDConfigs []*triton.SDConfig `yaml:"triton_sd_configs,omitempty"`
}
因為prometheus采用的是pull方式來拉取監(jiān)控數(shù)據(jù)嗅绸,這種方式需要由server側(cè)決定采集的目標有哪些脾猛,即配置在scrape_configs中的各種job,pull方式的主要缺點就是無法動態(tài)感知新服務(wù)的加入鱼鸠,因此大多數(shù)監(jiān)控都默認支持服務(wù)發(fā)現(xiàn)機制猛拴,自動發(fā)現(xiàn)集群中的新端點,并加入到配置中瞧柔。
Prometheus支持多種服務(wù)發(fā)現(xiàn)機制:文件漆弄,DNS,Consul,Kubernetes,OpenStack,EC2等等造锅『惩伲基于服務(wù)發(fā)現(xiàn)的過程并不復雜,通過第三方提供的接口哥蔚,Prometheus查詢到需要監(jiān)控的Target列表倒谷,然后輪詢這些Target獲取監(jiān)控數(shù)據(jù)。
對于kubernetes而言糙箍,Promethues通過與Kubernetes API交互渤愁,然后輪詢資源端點。目前主要支持5種服務(wù)發(fā)現(xiàn)模式深夯,分別是:Node抖格、Service诺苹、Pod、Endpoints雹拄、Ingress收奔。對應配置文件中的role: node/role:service
如:動態(tài)獲取所有節(jié)點node的信息,可以添加如下配置:
- job_name: kubernetes-nodes
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
kubernetes_sd_configs:
- api_server: null
role: node
namespaces:
names: []
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
relabel_configs:
- separator: ;
regex: __meta_kubernetes_node_label_(.+)
replacement: $1
action: labelmap
- separator: ;
regex: (.*)
target_label: __address__
replacement: kubernetes.default.svc:443
action: replace
- source_labels: [__meta_kubernetes_node_name]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
action: replace
就可以在target中看到具體內(nèi)容
對應的service滓玖、pod也是同樣的方式坪哄。
需要注意的是,為了能夠讓Prometheus能夠訪問收到Kubernetes API势篡,我們要對Prometheus進行訪問授權(quán)翩肌,即serviceaccount。否則就算配置了禁悠,也沒有權(quán)限獲取念祭。
prometheus的權(quán)限配置是一組ClusterRole+ClusterRoleBinding+ServiceAccount,然后在deployment或statefulset中指定serviceaccount碍侦。
ClusterRole.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
namespace: kube-system
name: prometheus
rules:
- apiGroups: [""]
resources:
- configmaps
- secrets
- nodes
- pods
- nodes/proxy
- services
- resourcequotas
- replicationcontrollers
- limitranges
- persistentvolumeclaims
- persistentvolumes
- namespaces
- endpoints
verbs: ["get", "list", "watch"]
- apiGroups: ["extensions"]
resources:
- daemonsets
- deployments
- replicasets
- ingresses
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources:
- daemonsets
- deployments
- replicasets
- statefulsets
verbs: ["get", "list", "watch"]
- apiGroups: ["batch"]
resources:
- cronjobs
- jobs
verbs: ["get", "list", "watch"]
- apiGroups: ["autoscaling"]
resources:
- horizontalpodautoscalers
verbs: ["get", "list", "watch"]
- apiGroups: ["policy"]
resources:
- poddisruptionbudgets
verbs: ["get", list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
ClusterRoleBinding.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
namespace: kube-system
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: kube-system
ServiceAccount.yaml
apiVersion: v1
kind: ServiceAccount
metadata:
namespace: kube-system
name: prometheus
prometheus.yaml
....
spec:
serviceAccountName: prometheus
....
完整的kubernete的配置如下:
- job_name: kubernetes-apiservers
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
kubernetes_sd_configs:
- api_server: null
role: endpoints
namespaces:
names: []
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
separator: ;
regex: default;kubernetes;https
replacement: $1
action: keep
- job_name: kubernetes-nodes
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
kubernetes_sd_configs:
- api_server: null
role: node
namespaces:
names: []
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
relabel_configs:
- separator: ;
regex: __meta_kubernetes_node_label_(.+)
replacement: $1
action: labelmap
- separator: ;
regex: (.*)
target_label: __address__
replacement: kubernetes.default.svc:443
action: replace
- source_labels: [__meta_kubernetes_node_name]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
action: replace
- job_name: kubernetes-cadvisor
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: https
kubernetes_sd_configs:
- api_server: null
role: node
namespaces:
names: []
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: false
relabel_configs:
- separator: ;
regex: __meta_kubernetes_node_label_(.+)
replacement: $1
action: labelmap
- separator: ;
regex: (.*)
target_label: __address__
replacement: kubernetes.default.svc:443
action: replace
- source_labels: [__meta_kubernetes_node_name]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
action: replace
- job_name: kubernetes-service-endpoints
scrape_interval: 1m
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- api_server: null
role: endpoints
namespaces:
names: []
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
separator: ;
regex: "true"
replacement: $1
action: keep
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
separator: ;
regex: (https?)
target_label: __scheme__
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
separator: ;
regex: (.+)
target_label: __metrics_path__
replacement: $1
action: replace
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
separator: ;
regex: ([^:]+)(?::\d+)?;(\d+)
target_label: __address__
replacement: $1:$2
action: replace
- separator: ;
regex: __meta_kubernetes_service_label_(.+)
replacement: $1
action: labelmap
- source_labels: [__meta_kubernetes_namespace]
separator: ;
regex: (.*)
target_label: kubernetes_namespace
replacement: $1
action: replace
- source_labels: [__meta_kubernetes_service_name]
separator: ;
regex: (.*)
target_label: kubernetes_name
replacement: $1
action: replace
配置成功后棒卷,對應的target是:
四. 常見場景
- 1.獲取集群中各節(jié)點信息,并按可用區(qū)或地域分類
如使用k8s的role:node采集集群中node的數(shù)據(jù)祝钢,可以通過"meta_domain_beta_kubernetes_io_zone"標簽來獲取到該節(jié)點的地域,該label為集群創(chuàng)建時為node打上的標記若厚,kubectl decribe node可以看到拦英。
然后可以通過relabel_configs定義新的值
relabel_configs:
- source_labels: ["meta_domain_beta_kubernetes_io_zone"]
regex: "(.*)"
replacement: $1
action: replace
target_label: "zone"
后面可以直接通過node{zone="XX"}來進行地域篩選
- 2.過濾信息,或者按照職能(RD测秸、運維)進行監(jiān)控管理
對于不同職能(開發(fā)疤估、測試、運維)的人員可能只關(guān)心其中一部分的監(jiān)控數(shù)據(jù)霎冯,他們可能各自部署的自己的Prometheus Server用于監(jiān)控自己關(guān)心的指標數(shù)據(jù)铃拇,不必要的數(shù)據(jù)需要過濾掉,以免浪費資源沈撞,可以最類似配置;
metric_relabel_configs:
- source_labels: [__name__]
separator: ;
regex: etcd_(debugging|disk|request|server).*
replacement: $1
action: drop
action: drop代表丟棄掉符合條件的指標慷荔,不進行采集。
-
3.搭建prometheus聯(lián)邦集群缠俺,管理各IDC(地域)監(jiān)控實例
如果存在多個地域显晶,每個地域又有很多節(jié)點或者集群,可以采用默認的聯(lián)邦集群部署壹士,每個地域部署自己的prometheus server實例磷雇,采集自己地域的數(shù)據(jù)。然后由統(tǒng)一的server采集所有地域數(shù)據(jù)躏救,進行統(tǒng)一展示唯笙,并按照地域歸類
配置:
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="prometheus"}'
- '{__name__=~"job:.*"}'
- '{__name__=~"node.*"}'
static_configs:
- targets:
- '192.168.77.11:9090'
- '192.168.77.12:9090'
本文為容器監(jiān)控實踐系列文章,完整內(nèi)容見:container-monitor-book