You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[libbeat] Add a metrics observer to the queue (#39774)
Add a metrics observer to the queue, reporting the metrics:
- `queue.added.{events, bytes}`, the number of events/bytes added to the queue
- `queue.consumed.{events, bytes}`, the number of events/bytes sent to the outputs
- `queue.removed.{events, bytes}`, the number of events/bytes removed from the queue after acknowledgment (`queue.removed.events` is an alias for the existing `queue.acked`).
`queue.filled.{events, bytes}`, the current number of events and bytes in the queue (gauges)
It also fixes the behavior of `queue.filled.pct.events`, renaming it `queue.filled.pct`.
All byte values reported by the memory queue are 0 if the output doesn't support early encoding.
This required some refactoring to the pipeline, which previously used a single custom callback to track its only queue metric (`queue.acked`) from `outputObserver`, and also used that to manage a wait group that was used to drain the queue on pipeline shutdown. The main changes are:
- A new interface type, `queue.Observer`, with an implementation `queueObserver` for standard metrics reporting.
- `queueMaxEvents` and `queueACKed` were removed from `pipeline.outputObserver`, since their logic is now handled by `queue.Observer`.
- A queue factory now takes a `queue.Observer` instead of an ACK callback
- The queue API now includes a `Done()` channel that signals when all events are acked / shutdown is complete, so shutdown handling now waits on that channel in `outputController.Close` instead of the shared waitgroup in `Pipeline.Close`.
- `pipeline.outputObserver` was renamed `pipeline.retryObserver` since its only remaining functions track retries and retry failures. It is now owned by `eventConsumer` (its only caller) instead of `pipeline.outputController`.
The queue previously had a `Metrics()` call that was used in the shipper but didn't integrate with Beats metrics. It had no remaining callers, so I deleted it while adding the new helpers.
Copy file name to clipboardExpand all lines: libbeat/docs/metrics-in-logs.asciidoc
+44-5Lines changed: 44 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,11 +2,11 @@
2
2
3
3
Every 30 seconds (by default), {beatname_uc} collects a _snapshot_ of metrics about itself. From this snapshot, {beatname_uc} computes a _delta snapshot_; this delta snapshot contains any metrics that have _changed_ since the last snapshot. Note that the values of the metrics are the values when the snapshot is taken, _NOT_ the _difference_ in values from the last snapshot.
4
4
5
-
If this delta snapshot contains _any_ metrics (indicating at least one metric that has changed since the last snapshot), this delta snapshot is serialized as JSON and emitted in {beatname_uc}'s logs at the `INFO` log level. Here is an example of such a log entry:
5
+
If this delta snapshot contains _any_ metrics (indicating at least one metric that has changed since the last snapshot), this delta snapshot is serialized as JSON and emitted in {beatname_uc}'s logs at the `INFO` log level. Most snapshot fields report the change in the metric since the last snapshot, however some fields are _gauges_, which always report the current value. Here is an example of such a log entry:
6
6
7
7
[source,json]
8
8
----
9
-
{"log.level":"info","@timestamp":"2023-07-14T12:50:36.811Z","log.logger":"monitoring","log.origin":{"file.name":"log/log.go","file.line":187},"message":"Non-zero metrics in the last 30s","service.name":"filebeat","monitoring":{"metrics":{"beat":{"cgroup":{"memory":{"mem":{"usage":{"bytes":0}}}},"cpu":{"system":{"ticks":692690,"time":{"ms":60}},"total":{"ticks":3167250,"time":{"ms":150},"value":3167250},"user":{"ticks":2474560,"time":{"ms":90}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":32},"info":{"ephemeral_id":"2bab8688-34c0-4522-80af-db86948d547d","uptime":{"ms":617670096},"version":"8.6.2"},"memstats":{"gc_next":57189272,"memory_alloc":43589824,"memory_total":275281335792,"rss":183574528},"runtime":{"goroutines":212}},"filebeat":{"events":{"active":5,"added":52,"done":49},"harvester":{"open_files":6,"running":6,"started":1}},"libbeat":{"config":{"module":{"running":15}},"output":{"events":{"acked":48,"active":0,"batches":6,"total":48},"read":{"bytes":210},"write":{"bytes":26923}},"pipeline":{"clients":15,"events":{"active":5,"filtered":1,"published":51,"total":52},"queue":{"acked":48}}},"registrar":{"states":{"current":14,"update":49},"writes":{"success":6,"total":6}},"system":{"load":{"1":0.91,"15":0.37,"5":0.4,"norm":{"1":0.1138,"15":0.0463,"5":0.05}}}},"ecs.version":"1.6.0"}}
9
+
{"log.level":"info","@timestamp":"2023-07-14T12:50:36.811Z","log.logger":"monitoring","log.origin":{"file.name":"log/log.go","file.line":187},"message":"Non-zero metrics in the last 30s","service.name":"filebeat","monitoring":{"metrics":{"beat":{"cgroup":{"memory":{"mem":{"usage":{"bytes":0}}}},"cpu":{"system":{"ticks":692690,"time":{"ms":60}},"total":{"ticks":3167250,"time":{"ms":150},"value":3167250},"user":{"ticks":2474560,"time":{"ms":90}}},"handles":{"limit":{"hard":1048576,"soft":1048576},"open":32},"info":{"ephemeral_id":"2bab8688-34c0-4522-80af-db86948d547d","uptime":{"ms":617670096},"version":"8.6.2"},"memstats":{"gc_next":57189272,"memory_alloc":43589824,"memory_total":275281335792,"rss":183574528},"runtime":{"goroutines":212}},"filebeat":{"events":{"active":5,"added":52,"done":49},"harvester":{"open_files":6,"running":6,"started":1}},"libbeat":{"config":{"module":{"running":15}},"output":{"events":{"acked":48,"active":0,"batches":6,"total":48},"read":{"bytes":210},"write":{"bytes":26923}},"pipeline":{"clients":15,"events":{"active":5,"filtered":1,"published":51,"total":52},"queue":{"max_events":3500,"filled":{"events":5,"bytes":6425,"pct":0.0014},"added":{"events":52,"bytes":65702},"consumed":{"events":52,"bytes":65702},"removed":{"events":48,"bytes":59277},"acked":48}}},"registrar":{"states":{"current":14,"update":49},"writes":{"success":6,"total":6}},"system":{"load":{"1":0.91,"15":0.37,"5":0.4,"norm":{"1":0.1138,"15":0.0463,"5":0.05}}}},"ecs.version":"1.6.0"}}
10
10
----
11
11
12
12
[discrete]
@@ -113,6 +113,24 @@ Focussing on the `.monitoring.metrics` field, and formatting the JSON, it's valu
113
113
"total": 52
114
114
},
115
115
"queue": {
116
+
"max_events": 3500,
117
+
"filled": {
118
+
"events": 5,
119
+
"bytes": 6425,
120
+
"pct": 0.0014
121
+
},
122
+
"added": {
123
+
"events": 52,
124
+
"bytes": 65702
125
+
},
126
+
"consumed": {
127
+
"events": 52,
128
+
"bytes": 65702
129
+
},
130
+
"removed": {
131
+
"events": 48,
132
+
"bytes": 59277
133
+
},
116
134
"acked": 48
117
135
}
118
136
}
@@ -130,12 +148,12 @@ Focussing on the `.monitoring.metrics` field, and formatting the JSON, it's valu
130
148
"system": {
131
149
"load": {
132
150
"1": 0.91,
133
-
"5": 0.4,
134
151
"15": 0.37,
152
+
"5": 0.4,
135
153
"norm": {
136
154
"1": 0.1138,
137
-
"5": 0.05,
138
-
"15": 0.0463
155
+
"15": 0.0463,
156
+
"5": 0.05
139
157
}
140
158
}
141
159
}
@@ -170,9 +188,30 @@ endif::[]
170
188
| `.output.events.total` | Integer | Number of events currently being processed by the output. | If this number grows over time, it may indicate that the output destination (e.g. {ls} pipeline or {es} cluster) is not able to accept events at the same or faster rate than what {beatname_uc} is sending to it.
171
189
| `.output.events.acked` | Integer | Number of events acknowledged by the output destination. | Generally, we want this number to be the same as `.output.events.total` as this indicates that the output destination has reliably received all the events sent to it.
172
190
| `.output.events.failed` | Integer | Number of events that {beatname_uc} tried to send to the output destination, but the destination failed to receive them. | Generally, we want this field to be absent or its value to be zero. When the value is greater than zero, it's useful to check {beatname_uc}'s logs right before this log entry's `@timestamp` to see if there are any connectivity issues with the output destination. Note that failed events are not lost or dropped; they will be sent back to the publisher pipeline for retrying later.
191
+
| `.output.events.dropped` | Integer | Number of events that {beatname_uc} gave up sending to the output destination because of a permanent (non-retryable) error.
192
+
| `.output.events.dead_letter` | Integer | Number of events that {beatname_uc} successfully sent to a configured dead letter index after they failed to ingest in the primary index.
173
193
| `.output.write.latency` | Object | Reports statistics on the time to send an event to the connected output, in milliseconds. This can be used to diagnose delays and performance issues caused by I/O or output configuration. This metric is available for the Elasticsearch, file, redis, and logstash outputs.
174
194
|===
175
195
196
+
[cols="1,1,2,2"]
197
+
|===
198
+
| Field path (relative to `.monitoring.metrics.libbeat.pipeline`) | Type | Meaning | Troubleshooting hints
199
+
200
+
| `.queue.max_events` | Integer (gauge) | The queue's maximum event count if it has one, otherwise zero.
201
+
| `.queue.max_bytes` | Integer (gauge) | The queue's maximum byte count if it has one, otherwise zero.
202
+
| `.queue.filled.events` | Integer (gauge) | Number of events currently stored by the queue. |
203
+
| `.queue.filled.bytes` | Integer (gauge) | Number of bytes currently stored by the queue. |
204
+
| `.queue.filled.pct` | Float (gauge) | How full the queue is relative to its maximum size, as a fraction from 0 to 1. | Low throughput while `queue.filled.pct` is low means congestion in the input. Low throughput while `queue.filled.pct` is high means congestion in the output.
205
+
| `.queue.added.events` | Integer | Number of events added to the queue by input workers. |
206
+
| `.queue.added.bytes` | Integer | Number of bytes added to the queue by input workers. |
207
+
| `.queue.consumed.events` | Integer | Number of events sent to output workers. |
208
+
| `.queue.consumed.bytes` | Integer | Number of bytes sent to output workers. |
209
+
| `.queue.removed.events` | Integer | Number of events removed from the queue after being processed by output workers. |
210
+
| `.queue.removed.bytes` | Integer | Number of bytes removed from the queue after being processed by output workers. |
211
+
|===
212
+
213
+
When using the memory queue, byte metrics are only set if the output supports them. Currently only the Elasticsearch output supports byte metrics.
0 commit comments