Skip to content

Conversation

@jukie
Copy link

@jukie jukie commented Oct 5, 2025

Description
This is a proposal to introduce a cost savings threshold when making consolidation decisions in order to reduce churn.

Related issue: aws/karpenter-provider-aws#7146

How was this change tested?
RFC only but draft PR opened at #2561

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jukie
Once this PR has been reviewed and has the lgtm label, please assign gjtempleton for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Oct 5, 2025
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 5, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @jukie. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 5, 2025
@coveralls
Copy link

Pull Request Test Coverage Report for Build 18253400908

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 4 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.07%) to 81.678%

Files with Coverage Reduction New Missed Lines %
pkg/controllers/provisioning/scheduling/topologydomaingroup.go 4 86.21%
Totals Coverage Status
Change from base Build 18201406291: 0.07%
Covered Lines: 11577
Relevant Lines: 14174

💛 - Coveralls

@jukie jukie changed the title RFC: consolidation price improvement factor docs: RFC for consolidation price improvement factor Oct 5, 2025
@jukie
Copy link
Author

jukie commented Oct 5, 2025

/retest

@k8s-ci-robot
Copy link
Contributor

@jukie: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

The price improvement factor will be configurable at **two levels** with the following precedence:

1. **NodePool-level** (highest priority) - Per-workload control
2. **Operator-level** (fallback) - Cluster-wide default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally avoid global configuration of Karpenter and recommend things like Kyverno to control this type of behavior.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay no problem. By this do you mean drop the operator-level config and keep it NodePool only?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep.

Comment on lines +52 to +55
// 1.0 = Consolidate for any cost savings (legacy behavior)
// 0.8 = Require 20% cost savings
// 0.5 = Require 50% cost savings (very conservative)
// 0.0 = Disable price-based consolidation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sugg: reorient this around percentage, where it's an integer from [0, 100]. The kubernetes ecosystem avoids floats in APIs due to precision serialization challenges.


2. **Calculate threshold**: `maxAllowedPrice = currentPrice × priceImprovementFactor`

3. **Filter instances**: Only consider instances where `launchPrice < maxAllowedPrice`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure you're factoring multi node consolidation into this. There are a few variants:

1 -> 0 (remove)
1 -> 1 (replace)
n -> 1 (multi node)

We don't currently do this today

1 -> 2 (split)

But if we ever did, we'd need to make sure we were handling price improvement appropriately. I can't think of any reason that this would be particularly difficult, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One complication is we need to define how multi-node works with nodes from different nodepools with different price thresholds configured.

2 -> 1 Multi-Node
Node A = from nodepool with 10% threshold
Node B = from nodepool with 5% threshold

Replacement Node is 4% cheaper than A+B, should clearly not replace
Replacement Node is 6% cheaper than A+B, ??
Replacement Node is 10% cheaper than A+B, should clearly replace

I would guess it should take the most conservative value and not replace for the 6% cheaper case.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep I would agree with the approach of taking the most conservative. Said another way, each Node's improvement threshold must be met in order for the replacement to proceed.


## Backward Compatibility

- **No breaking changes**: Default value of `1.0` maintains existing behavior
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious if we should explore a better default (e.g. .05%). I'd love to see some analysis on consolidations across a cluster over time, with a histogram of price improvement for each change.

There is a reasonable argument that Karpenter's consolidation algorithm is a heuristic and that we could launch with a low value for this setting without it being a breaking change. We make implementation changes to consolidation all the time and don't guarantee any deterministic decision making around cost/disruption tradeoffs.

Copy link
Author

@jukie jukie Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to apply a different default but initially assumed the maintainers would prefer matching existing behavior upon initial release. Starting with .05% seems like it'd be a safe initial value and after user feedback it could be tuned higher.

@ellistarn
Copy link
Contributor

ellistarn commented Oct 8, 2025

@jukie, I like the thinking here, but I want to make sure we have exhausted "zero config" options. One question I always ask myself when adding new API surface is "how would a customer decide how much to configure this".

  1. If the answer is "vibes", then we probably shouldn't expose it and should just take a heuristic opinion.
  2. If the answer is "do XYZ calculation", we should explore automating that calculation
  3. If the answer is "there is a fundamental tradeoff that only the customer can know, and this is how they know", then we add the config.

@jukie
Copy link
Author

jukie commented Oct 8, 2025

Thanks @ellistarn. For workloads that don't have any negative experience associated with node churn and the resulting pod interruptions (long startup times, cache warmups, etc) users might prefer the current state of consolidating upon any form of savings but for workloads that do I think it is a tradeoff between cost savings and interruptions so I think it's justified to expose it as a configuration option.

@ellistarn
Copy link
Contributor

I think it is a tradeoff between cost savings and interruptions so I think it's justified to expose it as a configuration option.

How will you decide to set this for your organization?

@jukie
Copy link
Author

jukie commented Oct 8, 2025

An org would have the option of running multiple node pools each with different values based on the workloads that target them. However that has the end result and tradeoff of probably running more nodes than actually necessary so the next consideration might be to run a single (or some fewer number of NodePools) instead to still get the savings and adjust the improvement factor accordingly.

The decision would be a balance between cost savings and causing interruptions to workloads. Things like pod-deletion-cost or do-not-disrupt annotations could help here too.

@ellistarn
Copy link
Contributor

So you would expect organizations to run multiple load tests with different values and measure their overall costs in each version?

@jukie
Copy link
Author

jukie commented Oct 9, 2025

If an org wants it perfectly tailored to their needs that might be a reasonable approach. However I suspect users who are seeking to tweak this are coming from the other side of the spectrum where the stronger desire is to reduce node churn and workload interruption. Node cost is a factor and should still be minimized which Karpenter solves for but node churn and workload interruption incurs another form of cost.

An org could directly measure all 3 parameters and use whatever works best for them.

I don't think there's a single value that's guaranteed to work for all use cases (even within the same org) so similar to how ClusterAutoscaler handles it, I think it should be exposed.

@ellistarn
Copy link
Contributor

I'm open minded. I'm going to see if we can gather some data on this and see if it can help inform our decisions. Will get back soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants