Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[libbeat] Add a metrics observer to the queue #39774

Merged
merged 40 commits into from
Jun 11, 2024
Merged

Conversation

faec
Copy link
Contributor

@faec faec commented May 30, 2024

Add a metrics observer to the queue, reporting the metrics:

  • queue.added.{events, bytes}, the number of events/bytes added to the queue
  • queue.consumed.{events, bytes}, the number of events/bytes sent to the outputs
  • queue.removed.{events, bytes}, the number of events/bytes removed from the queue after acknowledgment (queue.removed.events is an alias for the existing queue.acked).
    queue.filled.{events, bytes}, the current number of events and bytes in the queue (gauges)

It also fixes the behavior of queue.filled.pct.events, renaming it queue.filled.pct.

All byte values reported by the memory queue are 0 if the output doesn't support early encoding.

This required some refactoring to the pipeline, which previously used a single custom callback to track its only queue metric (queue.acked) from outputObserver, and also used that to manage a wait group that was used to drain the queue on pipeline shutdown. The main changes are:

  • A new interface type, queue.Observer, with an implementation queueObserver for standard metrics reporting.
  • queueMaxEvents and queueACKed were removed from pipeline.outputObserver, since their logic is now handled by queue.Observer.
  • A queue factory now takes a queue.Observer instead of an ACK callback
  • The queue API now includes a Done() channel that signals when all events are acked / shutdown is complete, so shutdown handling now waits on that channel in outputController.Close instead of the shared waitgroup in Pipeline.Close.
  • pipeline.outputObserver was renamed pipeline.retryObserver since its only remaining functions track retries and retry failures. It is now owned by eventConsumer (its only caller) instead of pipeline.outputController.

The queue previously had a Metrics() call that was used in the shipper but didn't integrate with Beats metrics. It had no remaining callers, so I deleted it while adding the new helpers.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

@faec faec added enhancement Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team labels May 30, 2024
@faec faec self-assigned this May 30, 2024
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels May 30, 2024
Copy link
Contributor

mergify bot commented May 30, 2024

This pull request does not have a backport label.
If this is a bug or security fix, could you label this PR @faec? 🙏.
For such, you'll need to label your PR with:

  • The upcoming major version of the Elastic Stack
  • The upcoming minor version of the Elastic Stack (if you're not pushing a breaking change)

To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-v8./d.0 is the label to automatically backport to the 8./d branch. /d is the digit

@faec faec marked this pull request as ready for review June 5, 2024 21:12
@faec faec requested a review from a team as a code owner June 5, 2024 21:12
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

Copy link
Contributor

mergify bot commented Jun 6, 2024

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b queue-metrics upstream/queue-metrics
git merge upstream/main
git push upstream queue-metrics

Copy link
Contributor

@leehinman leehinman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks ok, two requests.

  1. can we add some user facing documentation on the fields? might be /docs/metrics-in-logs.asciidoc or maybe there is someplace better?

  2. (optional) Diskqueue does support compression, might be worth a double check that the bytes reported make sense if compression is enabled.

@faec faec merged commit f8aedce into elastic:main Jun 11, 2024
107 of 109 checks passed
@faec faec deleted the queue-metrics branch June 11, 2024 20:51
@cmacknz
Copy link
Member

cmacknz commented Jun 12, 2024

It also fixes the behavior of queue.filled.pct.events, renaming it queue.filled.pct.

This change requires a small follow up in the elastic_agent package. See elastic/integrations#9765 which references queue.filled.pct.events

@cmacknz
Copy link
Member

cmacknz commented Jul 8, 2024

Looking at the metrics coming out of an 8.15.0-SNAPSHOT Filebeat I still see queue.filled.pct.events

      "pipeline": {
        "clients": 7,
        "events": {
          "active": 0,
          "dropped": 0,
          "failed": 0,
          "filtered": 0,
          "published": 246036,
          "retry": 0,
          "total": 246036
        },
        "queue": {
          "acked": 0,
          "filled": {
            "pct": {
              "events": 0
            }
          },
          "max_events": 0
        }
      }
    },

@nimarezainia
Copy link
Contributor

@faec (thanks for these changes) QQ: 'queue.filled.pct' is a float (%) this is a percentage of memory correct? I'd imagine it's memory since the queue size is not defined in events.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants