Skip to content

Conversation

@moko-poi
Copy link
Contributor

Fixes #2084

Description
Fixes an issue where Karpenter consolidates nodes with workload pods instead of empty nodes when the empty node has lower remaining lifetime.

The root cause was that disruption cost calculation included DaemonSet pods, even though:

  • DaemonSet pods are not rescheduled (automatically recreated by DaemonSet controller)
  • All nodes have the same DaemonSet pods
  • This caused the cost to be dominated by node age rather than actual workload

Changes:

  • Use reschedulablePods (excluding DaemonSet pods) instead of all pods for disruption cost calculation in pkg/controllers/disruption/types.go
  • This ensures empty nodes (with only DaemonSet pods) have zero rescheduling cost and are prioritized for consolidation

How was this change tested?

  • All existing disruption tests pass (232 tests)
  • Existing test emptiness_test.go:555-603 validates DaemonSet-only nodes are treated as empty
  • No regressions introduced

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 12, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @moko-poi. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Nov 12, 2025
@coveralls
Copy link

coveralls commented Nov 12, 2025

Pull Request Test Coverage Report for Build 19354214253

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 32 unchanged lines in 6 files lost coverage.
  • Overall coverage decreased (-0.7%) to 81.009%

Files with Coverage Reduction New Missed Lines %
pkg/controllers/node/termination/controller.go 2 77.14%
pkg/controllers/static/provisioning/controller.go 2 58.54%
pkg/controllers/nodeoverlay/controller.go 4 75.28%
pkg/controllers/provisioning/scheduling/preferences.go 7 88.76%
pkg/controllers/state/informer/nodeclaim.go 7 72.73%
pkg/controllers/controllers.go 10 0.0%
Totals Coverage Status
Change from base Build 19250612880: -0.7%
Covered Lines: 11773
Relevant Lines: 14533

💛 - Coveralls

return nil, err
}
}
reschedulablePods := lo.Filter(pods, func(p *corev1.Pod, _ int) bool { return pod.IsReschedulable(p) })
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change makes sense logically, but I'm not sure it actually solves #2084. @rschalo's comment in the original issue is correct, emptiness should be considered prior to single node consolidation. Also, won't this only affect cases where some nodes have more daemonsets than others? Is there a common case for that?

Copy link
Contributor Author

@moko-poi moko-poi Nov 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DerekFrank Great question! Let me address both points.

On DaemonSet count differences:

No, this fix doesn't require nodes to have different DaemonSet counts. The issue occurs even when all nodes have the same number of DaemonSets, which is the typical case.

The bug isn't about differences in DaemonSet counts between nodes. It's about DaemonSets inflating the DisruptionCost, which causes node age to dominate the sorting instead of actual workload.

Example: All nodes have the SAME DaemonSet count (5 DaemonSets each, ExpireAfter=90d):

  • Empty node (80d old): 0 workload pods + 5 DaemonSets
  • Node with pod (80d old): 1 workload pod + 5 DaemonSets

Current behavior:

Empty node:      Cost = (5 DaemonSets + 0 pods) × 0.11 = 0.55
Node with pod:   Cost = (5 DaemonSets + 1 pod)  × 0.11 = 0.66
Difference: only 0.11

The DaemonSets dominate the cost (5 out of 5 vs 5 out of 6), making the difference tiny. Node age becomes the primary sorting factor.

After this PR:

Empty node:      Cost = 0 pods × 0.11 = 0.00
Node with pod:   Cost = 1 pod  × 0.11 = 0.11

Empty nodes are now clearly prioritized regardless of age.

On whether this solves #2084:

You're absolutely right that Emptiness runs before Single-node consolidation. However, this DisruptionCost bug affects both phases, not just Single-node consolidation.

The Emptiness phase also uses DisruptionCost for sorting

func (e *Emptiness) ComputeCommands(ctx context.Context, disruptionBudgetMapping map[string]int, candidates ...*Candidate) ([]Command, error) {
if e.IsConsolidated() {
return []Command{}, nil
}
candidates = e.sortCandidates(candidates)

This means that even in the Emptiness phase, when there are multiple empty nodes with the same DaemonSet count but different ages, the current bug causes them to be sorted by age (due to different LifetimeRemaining values) rather than all having the same cost of zero.

Additionally, there are scenarios where empty nodes aren't deleted in the Emptiness phase and proceed to Single-node consolidation:

  1. Cluster already marked as consolidated
    if e.IsConsolidated() {
  2. Disruption budget constraints
    if disruptionBudgetMapping[candidate.NodePool.Name] == 0 {
  3. Validation errors during command execution

In these cases, the node proceeds to Single-node consolidation, where the incorrect DisruptionCost causes the wrong node to be selected.

This PR fixes the DisruptionCost calculation in both Emptiness and Single-node consolidation phases, addressing the root cause described by @JeremyBolster in #2084:

if a nodepool has a large number of daemonsets compared to the number of "workload pods" on the nodes, then the pod eviction cost is dominated by the daemonsets, leading to the sorting being approximately the same as just sorting the nodes by age.

Copy link
Contributor

@DerekFrank DerekFrank Nov 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Disruption Cost is just used to sort the nodes, the relative difference between two nodes shouldn't matter. We only use the value ordinally, we don't care about the difference.

There is also the unintended side effect of this preventing empty nodes from having a meaningful metric to sort by. If you have two empty nodes, their lifetime will no longer be taken into account, because 0*anything is still 0. I don't know if that is an issue, I haven't fully considered it

As for the ways that nodes can be skipped in Emptiness:

  1. If the cluster is marked as consolidated for emptiness, it should also be marked as consolidated for Single Node. That could be the bug if there is a race condition there, but then that would be for the fix.
  2. If the disruption budgets for a nodepool are blocking emptiness, they should also block single node for the same nodepool
  3. If the emptiness command is invalid, then we couldn't consolidate that node. If there is a bug there that is invalidating the command when it shouldn't, that would be a bug and we should fix that.

If you can reproduce this issue regularly and demonstrate that the disruption cost is indeed to blame, I would love to see that reproduction, but as it stands I fail to see how this change will solve the linked issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are absolutely right that we should first confirm whether we can reproduce this issue, and then consider the actual response.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: moko-poi

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Karpenter consolidates node with pod over empty node

4 participants