-
Notifications
You must be signed in to change notification settings - Fork 388
test: Add multi node consolidation metrics #2645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
test: Add multi node consolidation metrics #2645
Conversation
This adds three new Prometheus metrics to monitor the efficiency of the binary search algorithm used in multi-node consolidation: 1. karpenter_voluntary_disruption_multi_node_consolidation_iterations - Histogram tracking number of binary search iterations needed - Helps identify if the algorithm is performing efficiently (O(log n)) 2. karpenter_voluntary_disruption_multi_node_consolidation_batch_size - Gauge showing actual number of nodes consolidated together - Labeled by decision type (delete vs replace) - Helps understand consolidation effectiveness 3. karpenter_voluntary_disruption_multi_node_consolidation_failed_iteration_duration_seconds - Histogram tracking cumulative time spent on failed iterations - Identifies computational waste in the binary search process These metrics provide visibility into the multi-node consolidation algorithm's performance and help identify optimization opportunities.
Binary search on max 100 candidates has at most log₂(100) ≈ 7 iterations, so removed unnecessary buckets beyond 7.
|
Welcome @clbar-aws! |
|
Hi @clbar-aws. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Pull Request Test Coverage Report for Build 19585306646Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
jamesmt-aws
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sending this out! I had a couple of questions about how we're thinking about using these.
| iterationCount := 0 | ||
| var failedIterationTime time.Duration | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Imagine a case where the binary search succeeds after two rounds. So the whole process could be encoded as FFS, for Fail Fail Succeed. The time for F and S are tracked separately under the current design, so FFS increments both timers even though the aggregate operation eventually succeeded.
So FFS today, there's allocation of time to both the F timer and the S timer. But I could imagine an argument where the metrics aren't per-call, but are per-search. So FFFFFFF would be a fail, but S, FS, FFS, etc would all be successes. What's the right thing to do from an analysis point of view with FFS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we want something more like total_failed_iteration_duration, and then have a second metric called something like wasted_search_duration that we only increment when lastSavedCommand.Decision() == NoOpDecision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could add a label for the Op decision "success" or "failure" as the final decision to allow decomposing both of the metrics. So we'd have (FFS time in F, "success") and (FFF time, "failure"). In some sense for FFS the time spent FF is still time wasted.
| MultiNodeConsolidationBatchSize.Set(float64(batchSize), map[string]string{ | ||
| decisionLabel: decision, | ||
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now we won't set the batchsize if there's a NoOpDecision. Is that the right thing? Or should we still emit a zero for batchsize in that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we only want to record successful non-zero batch sizes so as to better pick a starting point for the binary search (we obviously wouldn't pick zero), I could see an argument for tracking the zeros though...
| }, | ||
| []string{}, | ||
| ) | ||
| MultiNodeConsolidationBatchSize = opmetrics.NewPrometheusGauge( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we want this to be a histogram instead of a gauge
Changes: 1. Convert batch_size from Gauge to Histogram with buckets [2-50] - Better tracks distribution of consolidation batch sizes over time - Buckets: 2, 3, 4, 5, 7, 10, 15, 20, 25, 30, 40, 50 2. Add sim_success label to failed_iteration_duration metric - Tracks whether binary search ultimately succeeded despite failed iterations - Values: "true" (found valid consolidation) or "false" (no valid consolidation) - Only records when failedIterationTime > 0
c35bcd5 to
37b0e27
Compare
Mark the three new multi-node consolidation metrics as alpha to signal they are experimental and may change: - multi_node_consolidation_iterations - multi_node_consolidation_batch_size - multi_node_consolidation_failed_iteration_duration_seconds This gives flexibility to refine buckets, labels, or metric definitions based on real-world usage patterns.
clbar-aws
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed from gauge to histogram for successful batch size metric. Added indicator of simulation success/failure with the time spent in failed simulations
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: clbar-aws The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Description
This PR adds three new Prometheus metrics to monitor the efficiency and performance of the binary search algorithm used in multi-node consolidation. These metrics provide visibility into:
How was this change tested?
Validated using KWOK provider by scaling a deployment from 50→5 replicas, which triggered multi-node consolidation that successfully consolidated 2 nodes. All three metrics correctly reported values showing 1 iteration per batch, average batch size of 2 nodes, and zero failed simulations.
Example Output
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.