|
参考官方文档:https://nacos.io/zh-cn/docs/monitor-guide.html
暴露 metrics 数据 需要修改配置文件,spring boot 支持通过环境变量来修改系统配置,在 Kubernets 环境下,可以通过增加环境变量的形式来暴露 metrics 数据。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
| kind: StatefulSet
apiVersion: apps/v1
metadata:
name: nacos
namespace: public
spec:
replicas: 3
selector:
matchLabels:
app: nacos
template:
metadata:
labels:
app: nacos
spec:
containers:
- name: nacos
image: nacos/nacos-server:v2.1.0
env:
...
# 增加这个环境变量
- name: management.endpoints.web.exposure.include
value: '*'
...
| 增加环境变量后,nacos 服务会重启,重启后进入 Pod,执行命令查看是否有 metrics 数据:
1
| curl localhost:8848/nacos/actuator/prometheus
| 数据接入 Prometheus
自动接入或手动接入按照需求选择一个。
自动接入
自动接入需要使用到 Prometheus 的自动发现机制,利用 servcie discovery endpoint 来接入,Prometheus 配置:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
| global:
scrape_interval: 30s
evaluation_interval: 30s
scrape_configs:
- job_name: 'service_endpoints_metrics'
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
# 额外增加的标签
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
- source_labels: [__meta_kubernetes_service_name]
target_label: service_name
| 修改 nacos 的 service:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
| apiVersion: v1
kind: Service
metadata:
name: nacos-svc
namespace: public
labels:
app: nacos
annotations:
# 增加如下注解,即可被Prometheus自动发现
prometheus.io/path: /nacos/actuator/prometheus
prometheus.io/port: "8848"
prometheus.io/scrape: "true"
spec:
ports:
- port: 8848
name: server
targetPort: 8848
- port: 9848
name: client-rpc
targetPort: 9848
- port: 9849
name: raft-rpc
targetPort: 9849
## 兼容1.4.x版本的选举端口
- port: 7848
name: old-raft-rpc
targetPort: 7848
type: ClusterIP
selector:
| 查看 Prometheus 的 targets:
手动接入
如果由于种种原因不能自动发现,可以手动配置的方式来接入到 Prometheus
1
2
3
4
5
6
7
8
9
10
11
12
13
| scrape_configs:
- job_name: nacos
honor_timestamps: true
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: '/nacos/actuator/prometheus'
scheme: http
follow_redirects: true
static_configs:
- targets:
- '{ip1}:8848'
- '{ip2}:8848'
- '{ip3}:8848'
| Nacos metrics 含义
jvm metrics
指标含义system_cpu_usageCPU 使用率system_load_average_1mloadjvm_memory_used_bytes内存使用字节,包含各种内存区jvm_memory_max_bytes内存最大字节,包含各种内存区jvm_gc_pause_seconds_countgc 次数,包含各种 gcjvm_gc_pause_seconds_sumgc 耗时,包含各种 gcjvm_threads_daemon线程数Nacos 监控指标
指标含义http_server_requests_seconds_counthttp 请求次数,包括多种 (url,方法,code)http_server_requests_seconds_sumhttp 请求总耗时,包括多种 (url,方法,code)nacos_timer_seconds_sumNacos config 水平通知耗时nacos_timer_seconds_countNacos config 水平通知次数nacos_monitor{name=’longPolling’}Nacos config 长连接数nacos_monitor{name=’configCount’}Nacos config 配置个数nacos_monitor{name=’dumpTask’}Nacos config 配置落盘任务堆积数nacos_monitor{name=’notifyTask’}Nacos config 配置水平通知任务堆积数nacos_monitor{name=’getConfig’}Nacos config 读配置统计数nacos_monitor{name=’publish’}Nacos config 写配置统计数nacos_monitor{name=’ipCount’}Nacos naming ip 个数nacos_monitor{name=’domCount’}Nacos naming 域名个数 (1.x 版本)nacos_monitor{name=’serviceCount’}Nacos naming 域名个数 (2.x 版本)nacos_monitor{name=’failedPush’}Nacos naming 推送失败数nacos_monitor{name=’avgPushCost’}Nacos naming 平均推送耗时nacos_monitor{name=’leaderStatus’}Nacos naming 角色状态nacos_monitor{name=’maxPushCost’}Nacos naming 最大推送耗时nacos_monitor{name=’mysqlhealthCheck’}Nacos naming mysql 健康检查次数nacos_monitor{name=’httpHealthCheck’}Nacos naming http 健康检查次数nacos_monitor{name=’tcpHealthCheck’}Nacos naming tcp 健康检查次数nacos_monitor{module=”naming”,name=”serviceCount”,}Nacos 注册的服务数量nacos 异常指标
指标含义nacos_exception_total{name=’db’}数据库异常nacos_exception_total{name=’configNotify’}Nacos config 水平通知失败nacos_exception_total{name=’unhealth’}Nacos config server 之间健康检查异常nacos_exception_total{name=’disk’}Nacos naming 写磁盘异常nacos_exception_total{name=’leaderSendBeatFailed’}Nacos naming leader 发送心跳异常nacos_exception_total{name=’illegalArgument’}请求参数不合法nacos_exception_total{name=’nacos’}Nacos 请求响应内部错误异常(读写失败,没权限,参数错误)client metrics
指标含义nacos_monitor{name=’subServiceCount’}订阅的服务数nacos_monitor{name=’pubServiceCount’}发布的服务数nacos_monitor{name=’configListenSize’}监听的配置数nacos_client_request_seconds_count请求的次数,包括多种 (url,方法,code)nacos_client_request_seconds_sum请求的总耗时,包括多种 (url,方法,code)告警规则制定
针对异常指标进行告警,修改 Prometheus alert 配置文件。告警规则待验证。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
| groups:
- name: nocas告警
rules:
- alert: "nocas服务注册数量异常"
expr: nacos_monitor{module="naming",name="serviceCount",} </span span10/span/spanbrspan spanfor:/span span1m/span/spanbrspan spanlabels:/span/spanbrspan spanseverity:/span spancritical/span/spanbrspan spanannotations:/span/spanbrspan spandescription:/span span"项目:span{{$labels.project}}/span nocas当前注册数量:span{{$value}}/span"/span/spanbrspan spansummary:/span span"服务注册数量异常"/span/spanbrspan/spanbrspan span-/span spanalert:/span span"Nacos naming leader发送心跳异常"/span/spanbrspan spanexpr:/span spanincrease(nacos_exception_total{name='leaderSendBeatFailed'}[1m])/span span!=0/span/spanbrspan spanfor:/span span1m/span/spanbrspan spanlabels:/span/spanbrspan spanseverity:/span spancritical/span/spanbrspan spanannotations:/span/spanbrspan spandescription:/span span"项目:span{{$labels.project}}/span 发送心跳异常数量为:span{{$value}}/span"/span/spanbrspan spansummary:/span span"Nacos naming leader发送心跳异常"/span/spanbrspan/spanbrspan span-/span spanalert:/span span"Nacos config长连接数大于5000"/span/spanbrspan spanexpr:/span spansum/span spanby/span span(project,instance,dept)/span span(irate(nacos_monitor{name='longPolling'}[5m]))/span span> 5000
for: 5m
labels:
severity: critical
annotations:
description: "项目:{{$labels.project}} 发送心跳异常数量为:{{$value}}"
summary: "Nacos config长连接数大于5000"
- alert: "Nacos config server之间健康检查异常"
expr: sum by (project,instance,dept) (rate(nacos_exception_total{name='unhealth'}[1m])) > 1
for: 1m
labels:
severity: critical
annotations:
description: "项目:{{$labels.project}} config unhealth exception alert"
summary: "Nacos config server之间健康检查异常"
- alert: "Nacos naming推送失败数大于1"
expr : sum by (project,instance,dept) (irate(nacos_monitor{name='failedPush'}[5m])) > 1
for: 5m
labels:
severity: critical
annotations:
description: "项目:{{$labels.project}} 推送失败数 {{$value}}"
summary: "Nacos naming推送失败数大于1"
- alert: "Nacos naming写磁盘异常"
expr : sum by (project,instance,dept) (rate(nacos_exception_total{name='disk'}[1m])) > 1
for: 1m
labels:
severity: critical
annotations:
description: "项目:{{$labels.project}} 写磁盘异常"
summary: "Nacos naming写磁盘异常"
- alert: "Nacos config水平通知失败"
expr : sum by (project,instance,dept) (rate(nacos_exception_total{name='configNotify'}[1m])) > 1
for: 1m
labels:
severity: critical
annotations:
description: "项目:{{$labels.project}} 水平通知失败"
summary: "Nacos config水平通知失败"
- alert: "Nacos请求响应内部错误异常(读写失败,没权限,参数错误)"
expr : sum by (project,instance,dept) (rate(nacos_exception_total{name='nacos'}[1m])) > 1
for: 1m
labels:
severity: critical
annotations:
description: "项目:{{$labels.project}} 读写失败"
summary: "Nacos请求响应内部错误异常(读写失败,没权限,参数错误)"
|
|
|