Skip to content

Conversation

elmiko
Copy link
Contributor

@elmiko elmiko commented Sep 12, 2025

What type of PR is this?

/kind bug

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #8494

Special notes for your reviewer:

this is a challenging scenario to debug, please see the related issue.

Does this PR introduce a user-facing change?

The ClusterAPI provider will not scale down a MachineDeployment that is undergoing an update.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area labels Sep 12, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: elmiko
Once this PR has been reviewed and has the lgtm label, please assign feiskyer for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler area/provider/alicloud Issues or PRs related to the AliCloud cloud provider implementation area/provider/aws Issues or PRs related to aws provider area/provider/azure Issues or PRs related to azure provider area/provider/cluster-api Issues or PRs related to Cluster API provider area/provider/coreweave and removed do-not-merge/needs-area labels Sep 12, 2025
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. area/provider/digitalocean Issues or PRs related to digitalocean provider area/provider/equinixmetal Issues or PRs related to the Equinix Metal cloud provider for Cluster Autoscaler area/provider/externalgrpc Issues or PRs related to the External gRPC provider area/provider/gce area/provider/hetzner Issues or PRs related to Hetzner provider area/provider/huaweicloud area/provider/ionoscloud area/provider/kwok Issues or PRs related to the kwok cloud provider for Cluster Autoscaler area/provider/linode Issues or PRs related to linode provider area/provider/magnum Issues or PRs related to the Magnum cloud provider for Cluster Autoscaler area/provider/oci Issues or PRs related to oci provider area/provider/rancher area/provider/utho Issues or PRs related to Utho provider labels Sep 12, 2025
@elmiko
Copy link
Contributor Author

elmiko commented Sep 12, 2025

i'm still working on some clusterapi specific unit tests, but they are quite challenging given the mocks that are needed.

@elmiko
Copy link
Contributor Author

elmiko commented Sep 12, 2025

cc @sbueringer @fabriziopandini this isn't quite done yet, but the business logic seems to be working as expected.

} else if err != nil && err != cloudprovider.ErrNotImplemented {
klog.Warningf("Error while checking if node is a candidate for deletion %s: %v", node.Name, err)
continue
}
nodeGroup, err := ctx.CloudProvider.NodeGroupForNode(node)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to NodeGroupForNode()'s interface comment: nil if the node should not be processed by cluster autoscaler, it looks we can also put capi rollout logics into this interface's implementation?

Copy link
Contributor Author

@elmiko elmiko Sep 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, good suggestion. that might be a much simpler way to solve this. i'll research that approach, thank you!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this also break scale up's and potentially a lot of other places where this func is used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this also break scale up's and potentially a lot of other places where this func is used?

yes, unfortunately after doing more research i don't think this would work for us for a couple reasons:

  1. there are other use of NodeGroupForNode that should always return accurate information
  2. clusterapi, and potentially other providers who perform node updates, need to know that the autoscaler will be deleting a node to make the decision about the deletion process. NodeGroupForNode does not pass this context.

this would only work in we were to assume that any node undergoing an update should be ignored completely by the autoscaler. i'm not sure we can make that assertion.

my hope is that in the future when something like the Declarative Node Maintenance api has been accepted, that we will be able to coordinate using that api.

Copy link
Member

@sbueringer sbueringer Sep 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2. clusterapi, and potentially other providers who perform node updates, need to know that the autoscaler will be deleting a node to make the decision about the deletion process. NodeGroupForNode does not pass this context.

Honestly, I don't know. But NodeGroupForNode is called in 21 places. So I do not know what other impact it has and if it produces new issues elsewhere. I think this would require extensive research and testing.

Our idea so far was to make a surgical change to ensure GetScaleDownCandidates does not return Nodes/Machines of MD in rollout as scale down candidates. Which means that Nodes/Machines of a MD are simply not considered for scale down, but everything else still works as of today.

Modifying NodeGroupForNode would significantly increase the blast radius of this fix.

})
objs, err := c.machineSetInformer.Lister().ByNamespace(r.GetNamespace()).List(selector)
if err != nil {
return nil, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe wrap the error here to provide a bit of context

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@elmiko I think you missed this one (but up to you of course :))

@aleksandra-malinowska
Copy link
Contributor

If I understand #8494 correctly, this change is meant to prevent CA from attempting to scale down a node that has already been cordoned and drained by Cluster API (and it's up to Cluster API to remove it).

Can Cluster API apply the scale-down disabled annotation when draining? https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/scaledown/eligibility/eligibility.go#L39

@sbueringer
Copy link
Member

sbueringer commented Sep 16, 2025

If I understand #8494 correctly, this change is meant to prevent CA from attempting to scale down a node that has already been cordoned and drained by Cluster API (and it's up to Cluster API to remove it).

Can Cluster API apply the scale-down disabled annotation when draining? master/cluster-autoscaler/core/scaledown/eligibility/eligibility.go#L39

No, this is about preventing cluster autoscaler to delete / scale down a Node altogether if Cluster API is doing a rollout.
The problem is that if autoscaler tries to delete / scale down a Node during a rollout there's a high chance that it will end up deleting the wrong Node (and that then repeats until we have no Nodes anymore for a node group)

@elmiko
Copy link
Contributor Author

elmiko commented Sep 16, 2025

+1 to what @sbueringer is saying, and also this problem is currently confined to clusterapi provider but it could affect any provider who does node updating in a similar fashion as clusterapi. i think we need to make the autoscaler smarter in these scenarios where a cloud provider needs to have more control over which nodes are being marked for removal during a maintenance window.

This function allows cloud providers to specify when a node is not a
good candidate for scaling down. This will occur before the autoscaler has
begun to cordon, drain, and taint any node for scale down.

Also adds a unit test for the prefiltering node processor.
The initial implementation of this function for clusterapi will return
that a node is not a good candidate for scale down when it belongs to a
MachineDeployment that is currently rolling out an upgrade.
@elmiko elmiko force-pushed the add-good-candidate-node-interface branch from cae87c7 to 150cf78 Compare September 16, 2025 19:16
@elmiko
Copy link
Contributor Author

elmiko commented Sep 16, 2025

updated with @sbueringer 's suggestions.

@sbueringer
Copy link
Member

Answered above

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/alicloud Issues or PRs related to the AliCloud cloud provider implementation area/provider/aws Issues or PRs related to aws provider area/provider/azure Issues or PRs related to azure provider area/provider/cluster-api Issues or PRs related to Cluster API provider area/provider/coreweave area/provider/digitalocean Issues or PRs related to digitalocean provider area/provider/equinixmetal Issues or PRs related to the Equinix Metal cloud provider for Cluster Autoscaler area/provider/externalgrpc Issues or PRs related to the External gRPC provider area/provider/gce area/provider/hetzner Issues or PRs related to Hetzner provider area/provider/huaweicloud area/provider/ionoscloud area/provider/kwok Issues or PRs related to the kwok cloud provider for Cluster Autoscaler area/provider/linode Issues or PRs related to linode provider area/provider/magnum Issues or PRs related to the Magnum cloud provider for Cluster Autoscaler area/provider/oci Issues or PRs related to oci provider area/provider/rancher area/provider/utho Issues or PRs related to Utho provider cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/bug Categorizes issue or PR as related to a bug. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

CA ClusterAPI provider can delete wrong node when scale-down occurs during MachineDeployment upgrade
5 participants