-
Notifications
You must be signed in to change notification settings - Fork 297
feat: Add karpenter_pods_disrupted_total metric #2253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
feat: Add karpenter_pods_disrupted_total metric #2253
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: omerap12 The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Signed-off-by: Omer Aplatony <[email protected]>
ea3549d
to
201d456
Compare
Pull Request Test Coverage Report for Build 15783145450Details
💛 - Coveralls |
Signed-off-by: Omer Aplatony <[email protected]>
Friendly ping @jonathan-innis @engedaam |
Signed-off-by: Omer Aplatony <[email protected]>
Signed-off-by: Omer Aplatony <[email protected]>
@@ -99,6 +99,11 @@ func (c *Controller) Reconcile(ctx context.Context) (reconcile.Result, error) { | |||
errs[i] = client.IgnoreNotFound(err) | |||
return | |||
} | |||
pods := &corev1.PodList{} | |||
var podCount int | |||
if err := c.kubeClient.List(ctx, pods, client.MatchingFields{"spec.nodeName": node.Name}); err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we include DS pods in this calculation or should we just stick to reschedulable pods (running pods)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should only include rescheduled pods.DS pods are automatically handled and are meant to run on every node (or specific ones using nodeSelector or affinity). So, if we include DaemonSet pods in the disruption metrics, it might give a wrong idea about real workload disruptions.
@@ -131,6 +131,10 @@ func (c *Controller) deleteNodeClaim(ctx context.Context, nodeClaim *v1.NodeClai | |||
if err := c.kubeClient.Delete(ctx, nodeClaim); err != nil { | |||
return reconcile.Result{}, client.IgnoreNotFound(err) | |||
} | |||
pods := &corev1.PodList{} | |||
if err := c.kubeClient.List(ctx, pods, client.MatchingFields{"spec.nodeName": node.Name}); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since pods are quite big, I think we should either look at creating a helper that returns a count of reschedulable pods that are currently bound to a node and employs the client.UnsafeDisableDeepCopy
method OR we should use the PartialMetadataList -- I'm personally more in favor of the first one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, will adjust
@@ -28,6 +28,7 @@ import ( | |||
"github.com/awslabs/operatorpkg/singleton" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there some way that we could avoid duplicating the pod list logic and rather handle the pod disruption metric inside of the termination controller? One thought that I had when I was thinking about the implementation of this is that we could make sure that we add the Disrupted
status condition on the NodeClaim always (expiration and health included) and then when we go to look-up the disruption reason for pods, we can just look-up the status condition reason to determine which type of resaon we should fire for that metric
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree this is a better idea.
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Fixes #2020
Description
Added a new metric
karpenter_pods_disrupted_total
to track the number of pods disrupted during node termination events.How was this change tested?
Added unit tests for the new metric
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.