test: Add multi node consolidation metrics #2645

clbar-aws · 2025-11-20T00:03:32Z

Description

This PR adds three new Prometheus metrics to monitor the efficiency and performance of the binary search algorithm used in multi-node consolidation. These metrics provide visibility into:

How many iterations the binary search takes to find optimal consolidation batches
The actual size of consolidation batches achieved
Time wasted on failed consolidation attempts

How was this change tested?

Validated using KWOK provider by scaling a deployment from 50→5 replicas, which triggered multi-node consolidation that successfully consolidated 2 nodes. All three metrics correctly reported values showing 1 iteration per batch, average batch size of 2 nodes, and zero failed simulations.

Example Output

# HELP karpenter_voluntary_disruption_multi_node_consolidation_batch_iterations Alpha metric prone to change. Number of binary search iterations to find optimal multi-node consolidation batch.
# TYPE karpenter_voluntary_disruption_multi_node_consolidation_batch_iterations histogram
karpenter_voluntary_disruption_multi_node_consolidation_batch_iterations_bucket{le="1"} 2
karpenter_voluntary_disruption_multi_node_consolidation_batch_iterations_bucket{le="2"} 2
karpenter_voluntary_disruption_multi_node_consolidation_batch_iterations_bucket{le="3"} 2
karpenter_voluntary_disruption_multi_node_consolidation_batch_iterations_bucket{le="4"} 2
karpenter_voluntary_disruption_multi_node_consolidation_batch_iterations_bucket{le="5"} 2
karpenter_voluntary_disruption_multi_node_consolidation_batch_iterations_bucket{le="6"} 2
karpenter_voluntary_disruption_multi_node_consolidation_batch_iterations_bucket{le="7"} 2
karpenter_voluntary_disruption_multi_node_consolidation_batch_iterations_bucket{le="+Inf"} 2
karpenter_voluntary_disruption_multi_node_consolidation_batch_iterations_sum 2
karpenter_voluntary_disruption_multi_node_consolidation_batch_iterations_count 2

# HELP karpenter_voluntary_disruption_multi_node_consolidation_batch_size Alpha metric prone to change. Number of nodes in successful multi-node consolidation batch, labeled with decision type.
# TYPE karpenter_voluntary_disruption_multi_node_consolidation_batch_size histogram
karpenter_voluntary_disruption_multi_node_consolidation_batch_size_bucket{decision="delete",le="2"} 2
karpenter_voluntary_disruption_multi_node_consolidation_batch_size_bucket{decision="delete",le="3"} 2
karpenter_voluntary_disruption_multi_node_consolidation_batch_size_bucket{decision="delete",le="4"} 2
karpenter_voluntary_disruption_multi_node_consolidation_batch_size_bucket{decision="delete",le="5"} 2
karpenter_voluntary_disruption_multi_node_consolidation_batch_size_bucket{decision="delete",le="7"} 2
karpenter_voluntary_disruption_multi_node_consolidation_batch_size_bucket{decision="delete",le="10"} 2
karpenter_voluntary_disruption_multi_node_consolidation_batch_size_bucket{decision="delete",le="15"} 2
karpenter_voluntary_disruption_multi_node_consolidation_batch_size_bucket{decision="delete",le="20"} 2
karpenter_voluntary_disruption_multi_node_consolidation_batch_size_bucket{decision="delete",le="25"} 2
karpenter_voluntary_disruption_multi_node_consolidation_batch_size_bucket{decision="delete",le="30"} 2
karpenter_voluntary_disruption_multi_node_consolidation_batch_size_bucket{decision="delete",le="40"} 2
karpenter_voluntary_disruption_multi_node_consolidation_batch_size_bucket{decision="delete",le="50"} 2
karpenter_voluntary_disruption_multi_node_consolidation_batch_size_bucket{decision="delete",le="+Inf"} 2
karpenter_voluntary_disruption_multi_node_consolidation_batch_size_sum{decision="delete"} 4
karpenter_voluntary_disruption_multi_node_consolidation_batch_size_count{decision="delete"} 2

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

This adds three new Prometheus metrics to monitor the efficiency of the binary search algorithm used in multi-node consolidation: 1. karpenter_voluntary_disruption_multi_node_consolidation_iterations - Histogram tracking number of binary search iterations needed - Helps identify if the algorithm is performing efficiently (O(log n)) 2. karpenter_voluntary_disruption_multi_node_consolidation_batch_size - Gauge showing actual number of nodes consolidated together - Labeled by decision type (delete vs replace) - Helps understand consolidation effectiveness 3. karpenter_voluntary_disruption_multi_node_consolidation_failed_iteration_duration_seconds - Histogram tracking cumulative time spent on failed iterations - Identifies computational waste in the binary search process These metrics provide visibility into the multi-node consolidation algorithm's performance and help identify optimization opportunities.

Binary search on max 100 candidates has at most log₂(100) ≈ 7 iterations, so removed unnecessary buckets beyond 7.

linux-foundation-easycla · 2025-11-20T00:03:39Z

The committers listed above are authorized under a signed CLA.

✅ login: Clabes / name: clbar-aws (37b0e27, 517f331, 5ca2f21, ac73836, ce7383d)

k8s-ci-robot · 2025-11-20T00:03:41Z

Welcome @clbar-aws!

It looks like this is your first PR to kubernetes-sigs/karpenter 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/karpenter has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-11-20T00:03:42Z

Hi @clbar-aws. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coveralls · 2025-11-20T00:18:08Z

Pull Request Test Coverage Report for Build 19585306646

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

22 of 26 (84.62%) changed or added relevant lines in 1 file are covered.
2 unchanged lines in 1 file lost coverage.
Overall coverage decreased (-0.007%) to 81.232%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/controllers/disruption/multinodeconsolidation.go	22	26	84.62%

Files with Coverage Reduction	New Missed Lines	%
pkg/controllers/static/provisioning/controller.go	2	58.54%

Totals
Change from base Build 19550803350:	-0.007%
Covered Lines:	11833
Relevant Lines:	14567

💛 - Coveralls

jamesmt-aws

Thanks for sending this out! I had a couple of questions about how we're thinking about using these.

jamesmt-aws · 2025-11-20T03:47:25Z

pkg/controllers/disruption/multinodeconsolidation.go

+	iterationCount := 0
+	var failedIterationTime time.Duration
+


Imagine a case where the binary search succeeds after two rounds. So the whole process could be encoded as FFS, for Fail Fail Succeed. The time for F and S are tracked separately under the current design, so FFS increments both timers even though the aggregate operation eventually succeeded.

So FFS today, there's allocation of time to both the F timer and the S timer. But I could imagine an argument where the metrics aren't per-call, but are per-search. So FFFFFFF would be a fail, but S, FS, FFS, etc would all be successes. What's the right thing to do from an analysis point of view with FFS?

I wonder if we want something more like total_failed_iteration_duration, and then have a second metric called something like wasted_search_duration that we only increment when lastSavedCommand.Decision() == NoOpDecision.

We could add a label for the Op decision "success" or "failure" as the final decision to allow decomposing both of the metrics. So we'd have (FFS time in F, "success") and (FFF time, "failure"). In some sense for FFS the time spent FF is still time wasted.

jamesmt-aws · 2025-11-20T03:50:59Z

pkg/controllers/disruption/multinodeconsolidation.go

+		MultiNodeConsolidationBatchSize.Set(float64(batchSize), map[string]string{
+			decisionLabel: decision,
+		})


Right now we won't set the batchsize if there's a NoOpDecision. Is that the right thing? Or should we still emit a zero for batchsize in that case?

I think we only want to record successful non-zero batch sizes so as to better pick a starting point for the binary search (we obviously wouldn't pick zero), I could see an argument for tracking the zeros though...

DerekFrank · 2025-11-20T16:14:25Z

pkg/controllers/disruption/metrics.go

+		},
+		[]string{},
+	)
+	MultiNodeConsolidationBatchSize = opmetrics.NewPrometheusGauge(


I think we want this to be a histogram instead of a gauge

Changes: 1. Convert batch_size from Gauge to Histogram with buckets [2-50] - Better tracks distribution of consolidation batch sizes over time - Buckets: 2, 3, 4, 5, 7, 10, 15, 20, 25, 30, 40, 50 2. Add sim_success label to failed_iteration_duration metric - Tracks whether binary search ultimately succeeded despite failed iterations - Values: "true" (found valid consolidation) or "false" (no valid consolidation) - Only records when failedIterationTime > 0

Mark the three new multi-node consolidation metrics as alpha to signal they are experimental and may change: - multi_node_consolidation_iterations - multi_node_consolidation_batch_size - multi_node_consolidation_failed_iteration_duration_seconds This gives flexibility to refine buckets, labels, or metric definitions based on real-world usage patterns.

clbar-aws

Changed from gauge to histogram for successful batch size metric. Added indicator of simulation success/failure with the time spent in failed simulations

k8s-ci-robot · 2025-11-21T22:31:24Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: clbar-aws
Once this PR has been reviewed and has the lgtm label, please assign maciekpytel for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Clabes added 2 commits November 19, 2025 15:52

Optimize iteration histogram buckets to 1-7

ac73836

Binary search on max 100 candidates has at most log₂(100) ≈ 7 iterations, so removed unnecessary buckets beyond 7.

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 20, 2025

k8s-ci-robot requested review from DerekFrank and tallaxes November 20, 2025 00:03

k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 20, 2025

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Nov 20, 2025

clbar-aws changed the title ~~Add multi node consolidation metrics~~ Test: Add multi node consolidation metrics Nov 20, 2025

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Nov 20, 2025

jamesmt-aws reviewed Nov 20, 2025

View reviewed changes

DerekFrank reviewed Nov 20, 2025

View reviewed changes

clbar-aws force-pushed the add-multi-node-consolidation-metrics branch from c35bcd5 to 37b0e27 Compare November 20, 2025 22:19

clbar-aws changed the title ~~Test: Add multi node consolidation metrics~~ test: Add multi node consolidation metrics Nov 20, 2025

clbar-aws commented Nov 20, 2025

View reviewed changes

updated metric types and names

ce7383d

test: Add multi node consolidation metrics #2645

Are you sure you want to change the base?

test: Add multi node consolidation metrics #2645

Conversation

clbar-aws commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Example Output

Uh oh!

linux-foundation-easycla bot commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Nov 20, 2025

Uh oh!

k8s-ci-robot commented Nov 20, 2025

Uh oh!

coveralls commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 19585306646

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

jamesmt-aws left a comment

Choose a reason for hiding this comment

Uh oh!

jamesmt-aws Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

jamesmt-aws Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Clabes Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jamesmt-aws Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Clabes Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

DerekFrank Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

clbar-aws left a comment

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Nov 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

clbar-aws commented Nov 20, 2025 •

edited

Loading

linux-foundation-easycla bot commented Nov 20, 2025 •

edited

Loading

coveralls commented Nov 20, 2025 •

edited

Loading

Clabes Nov 20, 2025 •

edited

Loading