fix: Exclude DaemonSet pods from disruption cost calculation #2624

moko-poi · 2025-11-12T12:37:53Z

Fixes #2084

Description
Fixes an issue where Karpenter consolidates nodes with workload pods instead of empty nodes when the empty node has lower remaining lifetime.

The root cause was that disruption cost calculation included DaemonSet pods, even though:

DaemonSet pods are not rescheduled (automatically recreated by DaemonSet controller)
All nodes have the same DaemonSet pods
This caused the cost to be dominated by node age rather than actual workload

Changes:

Use reschedulablePods (excluding DaemonSet pods) instead of all pods for disruption cost calculation in pkg/controllers/disruption/types.go
This ensures empty nodes (with only DaemonSet pods) have zero rescheduling cost and are prioritized for consolidation

How was this change tested?

All existing disruption tests pass (232 tests)
Existing test emptiness_test.go:555-603 validates DaemonSet-only nodes are treated as empty
No regressions introduced

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

k8s-ci-robot · 2025-11-12T12:38:03Z

Hi @moko-poi. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coveralls · 2025-11-12T12:53:34Z

Pull Request Test Coverage Report for Build 19354214253

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
32 unchanged lines in 6 files lost coverage.
Overall coverage decreased (-0.7%) to 81.009%

Files with Coverage Reduction	New Missed Lines	%
pkg/controllers/node/termination/controller.go	2	77.14%
pkg/controllers/static/provisioning/controller.go	2	58.54%
pkg/controllers/nodeoverlay/controller.go	4	75.28%
pkg/controllers/provisioning/scheduling/preferences.go	7	88.76%
pkg/controllers/state/informer/nodeclaim.go	7	72.73%
pkg/controllers/controllers.go	10	0.0%

Totals
Change from base Build 19250612880:	-0.7%
Covered Lines:	11773
Relevant Lines:	14533

💛 - Coveralls

DerekFrank · 2025-11-12T20:40:25Z

pkg/controllers/disruption/types.go

 			return nil, err
 		}
 	}
+	reschedulablePods := lo.Filter(pods, func(p *corev1.Pod, _ int) bool { return pod.IsReschedulable(p) })


I think this change makes sense logically, but I'm not sure it actually solves #2084. @rschalo's comment in the original issue is correct, emptiness should be considered prior to single node consolidation. Also, won't this only affect cases where some nodes have more daemonsets than others? Is there a common case for that?

@DerekFrank Great question! Let me address both points.

On DaemonSet count differences:

No, this fix doesn't require nodes to have different DaemonSet counts. The issue occurs even when all nodes have the same number of DaemonSets, which is the typical case.

The bug isn't about differences in DaemonSet counts between nodes. It's about DaemonSets inflating the DisruptionCost, which causes node age to dominate the sorting instead of actual workload.

Example: All nodes have the SAME DaemonSet count (5 DaemonSets each, ExpireAfter=90d):

Empty node (80d old): 0 workload pods + 5 DaemonSets

Node with pod (80d old): 1 workload pod + 5 DaemonSets

Current behavior:

Empty node: Cost = (5 DaemonSets + 0 pods) × 0.11 = 0.55 Node with pod: Cost = (5 DaemonSets + 1 pod) × 0.11 = 0.66 Difference: only 0.11

The DaemonSets dominate the cost (5 out of 5 vs 5 out of 6), making the difference tiny. Node age becomes the primary sorting factor.

After this PR:

Empty node: Cost = 0 pods × 0.11 = 0.00 Node with pod: Cost = 1 pod × 0.11 = 0.11

Empty nodes are now clearly prioritized regardless of age.

On whether this solves #2084:

You're absolutely right that Emptiness runs before Single-node consolidation. However, this DisruptionCost bug affects both phases, not just Single-node consolidation.

The Emptiness phase also uses DisruptionCost for sorting

karpenter/pkg/controllers/disruption/emptiness.go

Lines 58 to 62 in 7de3ced

func (e *Emptiness) ComputeCommands(ctx context.Context, disruptionBudgetMapping map[string]int, candidates ...*Candidate) ([]Command, error) {

if e.IsConsolidated() {

return []Command{}, nil

}

candidates = e.sortCandidates(candidates)

This means that even in the Emptiness phase, when there are multiple empty nodes with the same DaemonSet count but different ages, the current bug causes them to be sorted by age (due to different LifetimeRemaining values) rather than all having the same cost of zero.

Additionally, there are scenarios where empty nodes aren't deleted in the Emptiness phase and proceed to Single-node consolidation:

Cluster already marked as consolidated

karpenter/pkg/controllers/disruption/emptiness.go

Line 59 in 7de3ced

if e.IsConsolidated() {

Disruption budget constraints

karpenter/pkg/controllers/disruption/emptiness.go

Line 70 in 7de3ced

if disruptionBudgetMapping[candidate.NodePool.Name] == 0 {

Validation errors during command execution

In these cases, the node proceeds to Single-node consolidation, where the incorrect DisruptionCost causes the wrong node to be selected.

This PR fixes the DisruptionCost calculation in both Emptiness and Single-node consolidation phases, addressing the root cause described by @JeremyBolster in #2084:

if a nodepool has a large number of daemonsets compared to the number of "workload pods" on the nodes, then the pod eviction cost is dominated by the daemonsets, leading to the sorting being approximately the same as just sorting the nodes by age.

Disruption Cost is just used to sort the nodes, the relative difference between two nodes shouldn't matter. We only use the value ordinally, we don't care about the difference.

There is also the unintended side effect of this preventing empty nodes from having a meaningful metric to sort by. If you have two empty nodes, their lifetime will no longer be taken into account, because 0*anything is still 0. I don't know if that is an issue, I haven't fully considered it

As for the ways that nodes can be skipped in Emptiness:

If the cluster is marked as consolidated for emptiness, it should also be marked as consolidated for Single Node. That could be the bug if there is a race condition there, but then that would be for the fix.

If the disruption budgets for a nodepool are blocking emptiness, they should also block single node for the same nodepool

If the emptiness command is invalid, then we couldn't consolidate that node. If there is a bug there that is invalidating the command when it shouldn't, that would be a bug and we should fix that.

If you can reproduce this issue regularly and demonstrate that the disruption cost is indeed to blame, I would love to see that reproduction, but as it stands I fail to see how this change will solve the linked issue.

You are absolutely right that we should first confirm whether we can reproduce this issue, and then consider the actual response.

This reverts commit 1b46be2.

k8s-ci-robot · 2025-11-14T04:21:55Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: moko-poi

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fix: exclude DaemonSet pods from disruption cost calculation

1b46be2

k8s-ci-robot requested review from DerekFrank and tallaxes November 12, 2025 12:37

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 12, 2025

k8s-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Nov 12, 2025

moko-poi mentioned this pull request Nov 12, 2025

Karpenter consolidates node with pod over empty node #2084

Open

DerekFrank reviewed Nov 12, 2025

View reviewed changes

Revert "fix: exclude DaemonSet pods from disruption cost calculation"

c1500d8

This reverts commit 1b46be2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: Exclude DaemonSet pods from disruption cost calculation #2624

fix: Exclude DaemonSet pods from disruption cost calculation #2624

moko-poi commented Nov 12, 2025

Uh oh!

k8s-ci-robot commented Nov 12, 2025

Uh oh!

coveralls commented Nov 12, 2025 •

edited

Loading

Uh oh!

DerekFrank Nov 12, 2025

Uh oh!

moko-poi Nov 13, 2025 •

edited

Loading

Uh oh!

DerekFrank Nov 14, 2025 •

edited

Loading

Uh oh!

moko-poi Nov 15, 2025

Uh oh!

k8s-ci-robot commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	func (e Emptiness) ComputeCommands(ctx context.Context, disruptionBudgetMapping map[string]int, candidates ...Candidate) ([]Command, error) {
	if e.IsConsolidated() {
	return []Command{}, nil
	}
	candidates = e.sortCandidates(candidates)

fix: Exclude DaemonSet pods from disruption cost calculation #2624

Are you sure you want to change the base?

fix: Exclude DaemonSet pods from disruption cost calculation #2624

Conversation

moko-poi commented Nov 12, 2025

Uh oh!

k8s-ci-robot commented Nov 12, 2025

Uh oh!

coveralls commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 19354214253

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

DerekFrank Nov 12, 2025

Choose a reason for hiding this comment

Uh oh!

moko-poi Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DerekFrank Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

moko-poi Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coveralls commented Nov 12, 2025 •

edited

Loading

moko-poi Nov 13, 2025 •

edited

Loading

DerekFrank Nov 14, 2025 •

edited

Loading