Skip to content

feat: Add karpenter_pods_disrupted_total metric #2253

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

omerap12
Copy link
Member

Fixes #2020

Description
Added a new metric karpenter_pods_disrupted_total to track the number of pods disrupted during node termination events.

How was this change tested?
Added unit tests for the new metric

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 24, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: omerap12
Once this PR has been reviewed and has the lgtm label, please assign mwielgus for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested review from engedaam and tallaxes May 24, 2025 17:04
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 24, 2025
@omerap12 omerap12 force-pushed the karpenter_pods_disrupted_total branch from ea3549d to 201d456 Compare May 24, 2025 17:08
@coveralls
Copy link

coveralls commented May 24, 2025

Pull Request Test Coverage Report for Build 15783145450

Details

  • 34 of 39 (87.18%) changed or added relevant lines in 4 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.03%) to 82.066%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/node/health/controller.go 7 9 77.78%
pkg/controllers/disruption/queue.go 7 10 70.0%
Totals Coverage Status
Change from base Build 15718404125: 0.03%
Covered Lines: 10273
Relevant Lines: 12518

💛 - Coveralls

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 3, 2025
Signed-off-by: Omer Aplatony <[email protected]>
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 5, 2025
@omerap12
Copy link
Member Author

Friendly ping @jonathan-innis @engedaam

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 14, 2025
@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 15, 2025
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 20, 2025
@@ -99,6 +99,11 @@ func (c *Controller) Reconcile(ctx context.Context) (reconcile.Result, error) {
errs[i] = client.IgnoreNotFound(err)
return
}
pods := &corev1.PodList{}
var podCount int
if err := c.kubeClient.List(ctx, pods, client.MatchingFields{"spec.nodeName": node.Name}); err == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we include DS pods in this calculation or should we just stick to reschedulable pods (running pods)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only include rescheduled pods.DS pods are automatically handled and are meant to run on every node (or specific ones using nodeSelector or affinity). So, if we include DaemonSet pods in the disruption metrics, it might give a wrong idea about real workload disruptions.

@@ -131,6 +131,10 @@ func (c *Controller) deleteNodeClaim(ctx context.Context, nodeClaim *v1.NodeClai
if err := c.kubeClient.Delete(ctx, nodeClaim); err != nil {
return reconcile.Result{}, client.IgnoreNotFound(err)
}
pods := &corev1.PodList{}
if err := c.kubeClient.List(ctx, pods, client.MatchingFields{"spec.nodeName": node.Name}); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since pods are quite big, I think we should either look at creating a helper that returns a count of reschedulable pods that are currently bound to a node and employs the client.UnsafeDisableDeepCopy method OR we should use the PartialMetadataList -- I'm personally more in favor of the first one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, will adjust

@@ -28,6 +28,7 @@ import (
"github.com/awslabs/operatorpkg/singleton"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some way that we could avoid duplicating the pod list logic and rather handle the pod disruption metric inside of the termination controller? One thought that I had when I was thinking about the implementation of this is that we could make sure that we add the Disrupted status condition on the NodeClaim always (expiration and health included) and then when we go to look-up the disruption reason for pods, we can just look-up the status condition reason to determine which type of resaon we should fire for that metric

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree this is a better idea.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 25, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add karpenter_pods_disrupted_total when NodeClaims are disrupted
4 participants