feat: Add karpenter_pods_disrupted_total metric #2253

omerap12 · 2025-05-24T17:04:00Z

Fixes #2020

Description
Added a new metric karpenter_pods_disrupted_total to track the number of pods disrupted during node termination events.

How was this change tested?
Added unit tests for the new metric

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

k8s-ci-robot · 2025-05-24T17:04:05Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: omerap12
Once this PR has been reviewed and has the lgtm label, please assign mwielgus for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Signed-off-by: Omer Aplatony <[email protected]>

coveralls · 2025-05-24T17:16:50Z

Pull Request Test Coverage Report for Build 15783145450

Details

34 of 39 (87.18%) changed or added relevant lines in 4 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.03%) to 82.066%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/controllers/node/health/controller.go	7	9	77.78%
pkg/controllers/disruption/queue.go	7	10	70.0%

Totals
Change from base Build 15718404125:	0.03%
Covered Lines:	10273
Relevant Lines:	12518

💛 - Coveralls

Signed-off-by: Omer Aplatony <[email protected]>

omerap12 · 2025-06-13T09:29:22Z

Friendly ping @jonathan-innis @engedaam

Signed-off-by: Omer Aplatony <[email protected]>

jonathan-innis · 2025-06-23T22:43:59Z

pkg/controllers/nodeclaim/garbagecollection/controller.go

@@ -99,6 +99,11 @@ func (c *Controller) Reconcile(ctx context.Context) (reconcile.Result, error) {
 			errs[i] = client.IgnoreNotFound(err)
 			return
 		}
+		pods := &corev1.PodList{}
+		var podCount int
+		if err := c.kubeClient.List(ctx, pods, client.MatchingFields{"spec.nodeName": node.Name}); err == nil {


Should we include DS pods in this calculation or should we just stick to reschedulable pods (running pods)

I think we should only include rescheduled pods.DS pods are automatically handled and are meant to run on every node (or specific ones using nodeSelector or affinity). So, if we include DaemonSet pods in the disruption metrics, it might give a wrong idea about real workload disruptions.

jonathan-innis · 2025-06-23T22:45:22Z

pkg/controllers/node/health/controller.go

@@ -131,6 +131,10 @@ func (c *Controller) deleteNodeClaim(ctx context.Context, nodeClaim *v1.NodeClai
 	if err := c.kubeClient.Delete(ctx, nodeClaim); err != nil {
 		return reconcile.Result{}, client.IgnoreNotFound(err)
 	}
+	pods := &corev1.PodList{}
+	if err := c.kubeClient.List(ctx, pods, client.MatchingFields{"spec.nodeName": node.Name}); err != nil {


Since pods are quite big, I think we should either look at creating a helper that returns a count of reschedulable pods that are currently bound to a node and employs the client.UnsafeDisableDeepCopy method OR we should use the PartialMetadataList -- I'm personally more in favor of the first one

Got it, will adjust

jonathan-innis · 2025-06-23T22:47:21Z

pkg/controllers/disruption/queue.go

@@ -28,6 +28,7 @@ import (
 	"github.com/awslabs/operatorpkg/singleton"


Is there some way that we could avoid duplicating the pod list logic and rather handle the pod disruption metric inside of the termination controller? One thought that I had when I was thinking about the implementation of this is that we could make sure that we add the Disrupted status condition on the NodeClaim always (expiration and health included) and then when we go to look-up the disruption reason for pods, we can just look-up the status condition reason to determine which type of resaon we should fire for that metric

Yeah I agree this is a better idea.

k8s-ci-robot · 2025-06-25T09:27:17Z

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label May 24, 2025

k8s-ci-robot requested review from engedaam and tallaxes May 24, 2025 17:04

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label May 24, 2025

feat: Add karpenter_pods_disrupted_total metrics

201d456

Signed-off-by: Omer Aplatony <[email protected]>

omerap12 force-pushed the karpenter_pods_disrupted_total branch from ea3549d to 201d456 Compare May 24, 2025 17:08

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 3, 2025

merged master

54690be

Signed-off-by: Omer Aplatony <[email protected]>

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 5, 2025

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 14, 2025

Merge branch 'main' into karpenter_pods_disrupted_total

36ba676

Signed-off-by: Omer Aplatony <[email protected]>

k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jun 15, 2025

Merge branch 'main' into karpenter_pods_disrupted_total

6ff6d26

Signed-off-by: Omer Aplatony <[email protected]>

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 20, 2025

jonathan-innis reviewed Jun 23, 2025

View reviewed changes

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add karpenter_pods_disrupted_total metric #2253

feat: Add karpenter_pods_disrupted_total metric #2253

omerap12 commented May 24, 2025

Uh oh!

k8s-ci-robot commented May 24, 2025

Uh oh!

coveralls commented May 24, 2025 •

edited

Loading

Uh oh!

omerap12 commented Jun 13, 2025

Uh oh!

jonathan-innis Jun 23, 2025

Uh oh!

omerap12 Jul 5, 2025

Uh oh!

jonathan-innis Jun 23, 2025

Uh oh!

omerap12 Jul 5, 2025

Uh oh!

jonathan-innis Jun 23, 2025

Uh oh!

omerap12 Jul 5, 2025

Uh oh!

k8s-ci-robot commented Jun 25, 2025

Uh oh!

Uh oh!

		@@ -28,6 +28,7 @@ import (
		"github.com/awslabs/operatorpkg/singleton"

feat: Add karpenter_pods_disrupted_total metric #2253

Are you sure you want to change the base?

feat: Add karpenter_pods_disrupted_total metric #2253

Conversation

omerap12 commented May 24, 2025

Uh oh!

k8s-ci-robot commented May 24, 2025

Uh oh!

coveralls commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 15783145450

Details

💛 - Coveralls

Uh oh!

omerap12 commented Jun 13, 2025

Uh oh!

jonathan-innis Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

omerap12 Jul 5, 2025

Choose a reason for hiding this comment

Uh oh!

jonathan-innis Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

omerap12 Jul 5, 2025

Choose a reason for hiding this comment

Uh oh!

jonathan-innis Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

omerap12 Jul 5, 2025

Choose a reason for hiding this comment

Uh oh!

k8s-ci-robot commented Jun 25, 2025

Uh oh!

Uh oh!

coveralls commented May 24, 2025 •

edited

Loading