Skip to content

perf: Update the Node Repair Controller for requeue time #2286

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

engedaam
Copy link
Contributor

@engedaam engedaam commented Jun 4, 2025

Fixes #N/A

Description

  • Implemented 5-minute requeue interval when >20% of node pools or clusters are unhealthy
  • Added filtering to process node updates only when status conditions change
  • Added unit tests to verify node disruption protection when cluster health is degraded (>20% unhealthy)

How was this change tested?

  • Unit tests verify nodes are not disrupted when cluster health threshold is exceeded
  • Validated requeue behavior with unhealthy node pools

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 4, 2025
@k8s-ci-robot k8s-ci-robot requested review from mwielgus and tallaxes June 4, 2025 06:18
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 4, 2025
@coveralls
Copy link

Pull Request Test Coverage Report for Build 15435053326

Details

  • 2 of 7 (28.57%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.3%) to 82.318%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/node/health/controller.go 2 7 28.57%
Totals Coverage Status
Change from base Build 15430250801: 0.3%
Covered Lines: 10298
Relevant Lines: 12510

💛 - Coveralls

@jonathan-innis
Copy link
Member

/assign @jonathan-innis

For(&corev1.Node{}, builder.WithPredicates(nodeutils.IsManagedPredicateFuncs(c.cloudProvider))).
For(&corev1.Node{}, builder.WithPredicates(nodeutils.IsManagedPredicateFuncs(c.cloudProvider), predicate.Funcs{
UpdateFunc: func(e event.UpdateEvent) bool {
return !equality.Semantic.DeepEqual(e.ObjectOld.(*corev1.Node).Status.Conditions, e.ObjectNew.(*corev1.Node).Status.Conditions)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just sanity check this real quick for how this is handled with arrays, mainly just want to make sure it's looking at each of the object contents and not just looking at the items that are in the array

UpdateFunc: func(e event.UpdateEvent) bool {
return !equality.Semantic.DeepEqual(e.ObjectOld.(*corev1.Node).Status.Conditions, e.ObjectNew.(*corev1.Node).Status.Conditions)
},
DeleteFunc: func(e event.DeleteEvent) bool { return true },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not definining a func is true by default. If you want to make it false, then you should make that explicit

@@ -104,7 +111,7 @@ func (c *Controller) Reconcile(ctx context.Context, node *corev1.Node) (reconcil
return reconcile.Result{}, client.IgnoreNotFound(err)
}
if !nodePoolHealthy {
return reconcile.Result{}, c.publishNodePoolHealthEvent(ctx, node, nodeClaim, nodePoolName)
return reconcile.Result{RequeueAfter: 5 * time.Minute}, c.publishNodePoolHealthEvent(ctx, node, nodeClaim, nodePoolName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This alone won't work, right? We are going to get a heartbeat event that comes in every 30s or so and that's going to requeue us more aggressively even with the event filter that you have

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We either need a cache rate limiter that ensures that we literally don't check as aggressively as we are today OR we need to make sure that our event filter is a bit better so that it checks something like: The status state has actually changed

@@ -104,7 +111,7 @@ func (c *Controller) Reconcile(ctx context.Context, node *corev1.Node) (reconcil
return reconcile.Result{}, client.IgnoreNotFound(err)
}
if !nodePoolHealthy {
return reconcile.Result{}, c.publishNodePoolHealthEvent(ctx, node, nodeClaim, nodePoolName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Separately from just generally making sure that we don't reconcile as much as we are doing today, I do think that we should generally rate limit the cluster check that we are doing for health -- we shouldn't list as much as we are when it comes to just checking whether the cluster is healthy

@engedaam engedaam changed the title chore: Update the Node Repair Controller for requeue time perf: Update the Node Repair Controller for requeue time Jun 20, 2025
@engedaam engedaam force-pushed the update-node-repair-controller branch from 96f9118 to 8fb8d49 Compare June 20, 2025 17:53
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: engedaam
Once this PR has been reviewed and has the lgtm label, please ask for approval from jonathan-innis. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants