feat: ca: do not backoff scale up on specified errors #7777

Chase-Marino · 2025-01-28T18:50:29Z

What type of PR is this?

/kind feature

Which component this PR applies to?

cluster-autoscaler

What this PR does / why we need it:

This pr adds an option to provide errors to ignore during scaleup. When my cluster needs to scale up and there is a throttling error, I want to continue retrying.

Currently this problem can result in asgs scaling inbalanced or even refusing to scale at all if it hits all asgs, and aws throttling can block scaleup for 10m+

Which issue(s) this PR fixes:

#5271

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

cluser-autoscaler: option to provide errors to ignore during scaleup
-->

linux-foundation-easycla · 2025-01-28T18:50:34Z

The committers listed above are authorized under a signed CLA.

✅ login: Chase-Marino (9452245)

k8s-ci-robot · 2025-01-28T18:50:38Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Chase-Marino
Once this PR has been reviewed and has the lgtm label, please assign x13n for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

cluster-autoscaler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-01-28T18:50:38Z

Welcome @Chase-Marino!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-01-28T18:50:39Z

Hi @Chase-Marino. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

grosser · 2025-01-28T19:23:01Z

cluster-autoscaler/main.go

-	maxNodeProvisionTime      = flag.Duration("max-node-provision-time", 15*time.Minute, "The default maximum time CA waits for node to be provisioned - the value can be overridden per node group")
-	maxPodEvictionTime        = flag.Duration("max-pod-eviction-time", 2*time.Minute, "Maximum time CA tries to evict a pod before giving up")
-	nodeGroupsFlag            = multiStringFlag(
+	maxTotalUnreadyPercentage  = flag.Float64("max-total-unready-percentage", 45, "Maximum percentage of unready nodes in the cluster.  After this is exceeded, CA halts operations")


the whitespace change here is a bit sus and could result in merge conflicts, see if that can be avoided

weird doesn't show up on my local will check it out later

grosser · 2025-01-28T19:25:33Z

cluster-autoscaler/core/scaleup/orchestrator/executor.go

+		backoff := true
+		for _, part := range e.autoscalingContext.AutoscalingOptions.ScaleUpIgnoreBackoffErrors {
+			if strings.Contains(err.Error(), part) {
+				e.autoscalingContext.LogRecorder.Eventf(apiv1.EventTypeWarning, "ScaledUpGroup", "Scale-up: retriable error %s", aerr.Error())


would fit better into the else of if backoff

grosser · 2025-01-28T19:27:59Z

cluster-autoscaler/core/scaleup/orchestrator/executor.go

+
+		if backoff {
+			e.autoscalingContext.LogRecorder.Eventf(apiv1.EventTypeWarning, "FailedToScaleUpGroup", "Scale-up failed for group %s: %v", info.Group.Id(), err)
+			e.scaleStateNotifier.RegisterFailedScaleUp(info.Group, string(aerr.Type()), aerr.Error(), gpuResourceName, gpuType, now)


FYI we might want to still call this to get these bumped:

csr.scaleUpFailures[nodeGroup.Id()] = append(csr.scaleUpFailures[nodeGroup.Id()], ScaleUpFailure{NodeGroup: nodeGroup, Reason: reason, Time: currentTime}) metrics.RegisterFailedScaleUp(reason, gpuResourceName, gpuType)

but not call the csr.backoffNodeGroup
... RegisterFailedScaleUp says "It will mark this group as not safe to autoscale", so maybe avoiding it completely is the right call, but not sure how the metrics and failures are used

ignore backoff for specified errors

9452245

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels Jan 28, 2025

k8s-ci-robot added the cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. label Jan 28, 2025

k8s-ci-robot requested a review from BigDarkClown January 28, 2025 18:50

k8s-ci-robot added the area/cluster-autoscaler label Jan 28, 2025

k8s-ci-robot requested a review from vadasambar January 28, 2025 18:50

k8s-ci-robot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 28, 2025

Chase-Marino changed the title ~~feat: do not scale up backoff for specified errors~~ feat: ca: do not backoff scale up on specified errors Jan 28, 2025

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jan 28, 2025

grosser reviewed Jan 28, 2025

View reviewed changes

Chase-Marino marked this pull request as ready for review January 29, 2025 20:14

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 29, 2025

k8s-ci-robot requested a review from aleksandra-malinowska January 29, 2025 20:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ca: do not backoff scale up on specified errors #7777

feat: ca: do not backoff scale up on specified errors #7777

Chase-Marino commented Jan 28, 2025

linux-foundation-easycla bot commented Jan 28, 2025 •

edited

Loading

k8s-ci-robot commented Jan 28, 2025

k8s-ci-robot commented Jan 28, 2025

k8s-ci-robot commented Jan 28, 2025

grosser Jan 28, 2025

Chase-Marino Jan 28, 2025

grosser Jan 28, 2025

grosser Jan 28, 2025

feat: ca: do not backoff scale up on specified errors #7777

Are you sure you want to change the base?

feat: ca: do not backoff scale up on specified errors #7777

Conversation

Chase-Marino commented Jan 28, 2025

What type of PR is this?

Which component this PR applies to?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

linux-foundation-easycla bot commented Jan 28, 2025 • edited Loading

k8s-ci-robot commented Jan 28, 2025

k8s-ci-robot commented Jan 28, 2025

k8s-ci-robot commented Jan 28, 2025

grosser Jan 28, 2025

Choose a reason for hiding this comment

Chase-Marino Jan 28, 2025

Choose a reason for hiding this comment

grosser Jan 28, 2025

Choose a reason for hiding this comment

grosser Jan 28, 2025

Choose a reason for hiding this comment

linux-foundation-easycla bot commented Jan 28, 2025 •

edited

Loading