Skip to content

plugin: Add new reconcile metrics #1360

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

sharnoff
Copy link
Member

There's two metrics added here:

  1. Number of reconcile workers
    This mirrors the existing controller_runtime_max_concurrent_reconciles metric we have from neonvm-controller, which allows us to determine the fraction of total worker-time that we're using (a measure of saturation).

  2. Total duration with items in the queue
    This is roughly analogous to the Linux kernel's CPU PSI metric — other metrics of saturation (using total time spent reconciling or total time in the queue) are useful, but they tend to be easy to misinterpret when the amount of saturation is very skewed across the duration between metric samples. So the idea here is to help get a more accurate picture.

Resolves neondatabase/cloud#27613.

@sharnoff sharnoff force-pushed the sharnoff/plugin-reconcile-metrics branch from e4d8f3f to 02c4f80 Compare April 24, 2025 13:23
@sharnoff sharnoff changed the base branch from main to sharnoff/plugin-init-metrics April 24, 2025 13:24
Copy link
Contributor

@Omrigan Omrigan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some naming suggestions, otherwise LGTM.

@sharnoff sharnoff assigned sharnoff and unassigned Omrigan May 13, 2025
@sharnoff sharnoff force-pushed the sharnoff/plugin-init-metrics branch from e8be52f to c561790 Compare May 13, 2025 18:44
Base automatically changed from sharnoff/plugin-init-metrics to main May 13, 2025 19:08
sharnoff added 2 commits May 13, 2025 20:57
Implemented with a new WaitingTimer abstraction, which we use to track
the aggregate amount of time where the queue is non-empty.

The change to .golangci.yml is in order to exempt prometheus.Opts from
exhaustruct requirements.
@sharnoff sharnoff force-pushed the sharnoff/plugin-reconcile-metrics branch from 02c4f80 to d55af20 Compare May 13, 2025 19:58
@sharnoff
Copy link
Member Author

Realized this doesn't handle objects being retried with a delay correctly (if still waiting, they'd be incorrectly counted towards the queue being non-empty when technically they're not ready to be used yet).

Need to fix that before merging.

Copy link
Contributor

@Omrigan Omrigan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment from Em

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants