-
Notifications
You must be signed in to change notification settings - Fork 388
fix: Exclude DaemonSet pods from disruption cost calculation #2624
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: Exclude DaemonSet pods from disruption cost calculation #2624
Conversation
|
Hi @moko-poi. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Pull Request Test Coverage Report for Build 19354214253Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
pkg/controllers/disruption/types.go
Outdated
| return nil, err | ||
| } | ||
| } | ||
| reschedulablePods := lo.Filter(pods, func(p *corev1.Pod, _ int) bool { return pod.IsReschedulable(p) }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this change makes sense logically, but I'm not sure it actually solves #2084. @rschalo's comment in the original issue is correct, emptiness should be considered prior to single node consolidation. Also, won't this only affect cases where some nodes have more daemonsets than others? Is there a common case for that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DerekFrank Great question! Let me address both points.
On DaemonSet count differences:
No, this fix doesn't require nodes to have different DaemonSet counts. The issue occurs even when all nodes have the same number of DaemonSets, which is the typical case.
The bug isn't about differences in DaemonSet counts between nodes. It's about DaemonSets inflating the DisruptionCost, which causes node age to dominate the sorting instead of actual workload.
Example: All nodes have the SAME DaemonSet count (5 DaemonSets each, ExpireAfter=90d):
- Empty node (80d old): 0 workload pods + 5 DaemonSets
- Node with pod (80d old): 1 workload pod + 5 DaemonSets
Current behavior:
Empty node: Cost = (5 DaemonSets + 0 pods) × 0.11 = 0.55
Node with pod: Cost = (5 DaemonSets + 1 pod) × 0.11 = 0.66
Difference: only 0.11
The DaemonSets dominate the cost (5 out of 5 vs 5 out of 6), making the difference tiny. Node age becomes the primary sorting factor.
After this PR:
Empty node: Cost = 0 pods × 0.11 = 0.00
Node with pod: Cost = 1 pod × 0.11 = 0.11
Empty nodes are now clearly prioritized regardless of age.
On whether this solves #2084:
You're absolutely right that Emptiness runs before Single-node consolidation. However, this DisruptionCost bug affects both phases, not just Single-node consolidation.
The Emptiness phase also uses DisruptionCost for sorting
karpenter/pkg/controllers/disruption/emptiness.go
Lines 58 to 62 in 7de3ced
| func (e *Emptiness) ComputeCommands(ctx context.Context, disruptionBudgetMapping map[string]int, candidates ...*Candidate) ([]Command, error) { | |
| if e.IsConsolidated() { | |
| return []Command{}, nil | |
| } | |
| candidates = e.sortCandidates(candidates) |
This means that even in the Emptiness phase, when there are multiple empty nodes with the same DaemonSet count but different ages, the current bug causes them to be sorted by age (due to different LifetimeRemaining values) rather than all having the same cost of zero.
Additionally, there are scenarios where empty nodes aren't deleted in the Emptiness phase and proceed to Single-node consolidation:
- Cluster already marked as consolidated
if e.IsConsolidated() { - Disruption budget constraints
if disruptionBudgetMapping[candidate.NodePool.Name] == 0 { - Validation errors during command execution
In these cases, the node proceeds to Single-node consolidation, where the incorrect DisruptionCost causes the wrong node to be selected.
This PR fixes the DisruptionCost calculation in both Emptiness and Single-node consolidation phases, addressing the root cause described by @JeremyBolster in #2084:
if a nodepool has a large number of daemonsets compared to the number of "workload pods" on the nodes, then the pod eviction cost is dominated by the daemonsets, leading to the sorting being approximately the same as just sorting the nodes by age.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disruption Cost is just used to sort the nodes, the relative difference between two nodes shouldn't matter. We only use the value ordinally, we don't care about the difference.
There is also the unintended side effect of this preventing empty nodes from having a meaningful metric to sort by. If you have two empty nodes, their lifetime will no longer be taken into account, because 0*anything is still 0. I don't know if that is an issue, I haven't fully considered it
As for the ways that nodes can be skipped in Emptiness:
- If the cluster is marked as consolidated for emptiness, it should also be marked as consolidated for Single Node. That could be the bug if there is a race condition there, but then that would be for the fix.
- If the disruption budgets for a nodepool are blocking emptiness, they should also block single node for the same nodepool
- If the emptiness command is invalid, then we couldn't consolidate that node. If there is a bug there that is invalidating the command when it shouldn't, that would be a bug and we should fix that.
If you can reproduce this issue regularly and demonstrate that the disruption cost is indeed to blame, I would love to see that reproduction, but as it stands I fail to see how this change will solve the linked issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are absolutely right that we should first confirm whether we can reproduce this issue, and then consider the actual response.
This reverts commit 1b46be2.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: moko-poi The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Fixes #2084
Description
Fixes an issue where Karpenter consolidates nodes with workload pods instead of empty nodes when the empty node has lower remaining lifetime.
The root cause was that disruption cost calculation included DaemonSet pods, even though:
Changes:
reschedulablePods(excluding DaemonSet pods) instead of allpodsfor disruption cost calculation inpkg/controllers/disruption/types.goHow was this change tested?
emptiness_test.go:555-603validates DaemonSet-only nodes are treated as emptyBy submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.