Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to skip similar nodegroup recomputation #6926

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

rrangith
Copy link
Contributor

@rrangith rrangith commented Jun 14, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

Related to #6940

This recomputation used to only occur when the bestOption NodeGroup did not exist, but was changed in #5802. There are cases where an expander could modify the bestOption's similar nodegroups, such as custom logic in the gRPC expander.

In cases like this, we should have a CLI option to trust expander’s similar nodegroups and skip the recomputation.

If a user does not enable this option, then by default the behaviour will stay the same. This will only skip similar nodegroup recomputation for users who enable this option.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Added `skip-similar-node-group-recomputation` flag. If enabled, skips similar NodeGroup recomputation for the best option returned by the expander during scaleup.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 14, 2024
@k8s-ci-robot
Copy link
Contributor

Hi @rrangith. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added area/cluster-autoscaler size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jun 14, 2024
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 12, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 10, 2024
@rrangith rrangith force-pushed the skip-similar-nodegroup-recomputation branch from 32774c7 to 450a753 Compare November 6, 2024 21:11
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 6, 2024
@rrangith rrangith force-pushed the skip-similar-nodegroup-recomputation branch from 450a753 to 35c514d Compare November 6, 2024 21:14
@rrangith
Copy link
Contributor Author

rrangith commented Nov 6, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 6, 2024
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there had been some discussions about trying to limit the new flags we are adding to the autoscaler, but i'm not sure if there was ever any guidance about that. regardless, i think this new flag should also be mentioned in the FAQ, see https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/FAQ.md#what-are-the-parameters-to-ca

@rrangith rrangith force-pushed the skip-similar-nodegroup-recomputation branch from 35c514d to 2627751 Compare December 2, 2024 15:51
@rrangith rrangith requested a review from elmiko December 2, 2024 15:51
@elmiko
Copy link
Contributor

elmiko commented Dec 2, 2024

just a question as i'm reviewing, is there any relationship or interaction between this flag and the balance-similar-node-groups flag? (eg does the latter need to be enabled or anything special like that)

@rrangith
Copy link
Contributor Author

rrangith commented Dec 2, 2024

just a question as i'm reviewing, is there any relationship or interaction between this flag and the balance-similar-node-groups flag? (eg does the latter need to be enabled or anything special like that)

There is no strict relationship between the 2 flags for things to function properly, however if you enable skip-similar-node-group-recomputation but have balance-similar-nodegroups disabled, then it will just do nothing https://github.com/rrangith/autoscaler/blob/26277519bfe0728a532076f22e4485f1e7ba4adb/cluster-autoscaler/core/scaleup/orchestrator/orchestrator.go#L722-L730

I was debating mentioning that in the FAQ or cli arg description, but wasn't sure if I should add even more words to it

@elmiko
Copy link
Contributor

elmiko commented Dec 2, 2024

I was debating mentioning that in the FAQ or cli arg description, but wasn't sure if I should add even more words to it

i think it's worth mentioning, if only so people know they need to have balance-similar-nodegroups enabled to get this behavior.

@rrangith rrangith force-pushed the skip-similar-nodegroup-recomputation branch from 2627751 to 0a2c91f Compare December 2, 2024 22:58
Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

we will need a review from a core maintainer for the flag change.

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 3, 2024
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 6, 2025
@rrangith rrangith force-pushed the skip-similar-nodegroup-recomputation branch from 0a2c91f to 3f9efa1 Compare January 9, 2025 16:36
@k8s-ci-robot k8s-ci-robot removed lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Jan 9, 2025
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jan 9, 2025
@jackfrancis
Copy link
Contributor

/test pull-cluster-autoscaler-e2e-azure-master

newNodes int,
nodeInfos map[string]*framework.NodeInfo,
schedulablePodGroups map[string][]estimator.PodEquivalenceGroup,
) ([]nodegroupset.ScaleUpInfo, errors.AutoscalerError) {
// Recompute similar node groups in case they need to be updated
similarNodeGroups := o.ComputeSimilarNodeGroups(nodeGroup, nodeInfos, schedulablePodGroups, now)
similarNodeGroups := bestOption.SimilarNodeGroups
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does seem like we're reliably populating bestOption.SimilarNodeGroups as part of the flow before invoking this balanceScaleUps method (L148 where we invoke o.ComputeExpansionOption).

That said, do we want to do some checking here, to make sure we have a good default value if "skip similar nodegroup recomputation" is enabled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont think there is a good default value since my thinking with this feature was to fully rely on what the expander deems as the similar nodegroups here. So for example if the bestOption is nodegroup A, and it has similar nodegroups B and C, but the expander removes both B and C as similar nodegroups, then CA should respect that and only scaleup nodegroup A

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, thx

@jackfrancis
Copy link
Contributor

/lgtm

/assign @x13n @towca

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 27, 2025
@towca
Copy link
Collaborator

towca commented Jan 29, 2025

I'm afraid I'm a bit lost here. Why should Expander be the source of truth for computing similar node groups, don't we have NodeGroupSetProcessor for that?

Expander modifying the options it gets is IMO very surprising and could lead to subtle bugs down the road. Expander's responsibility is to pick between pre-computed options.

I also don't get the reasoning for moving the recalculation in #5802, the only cases I can think of where we'd want to recalculate is if a node group needs to be created. @BigDarkClown Do you remember the context for this one?

@rrangith
Copy link
Contributor Author

I'm afraid I'm a bit lost here. Why should Expander be the source of truth for computing similar node groups, don't we have NodeGroupSetProcessor for that?

Expander modifying the options it gets is IMO very surprising and could lead to subtle bugs down the road. Expander's responsibility is to pick between pre-computed options.

@towca my thinking was that this option could be used to give more power to the expander, beyond simply picking between pre-computed options.

With the gRPC expander for example, you can have any custom logic you want. So if for example I want to filter out any nodegroups that have the label "foo", I could have logic in my gRPC expander to filter that out as a best option.

The problem occurs when balancing happens. Let's say I have nodegroups A and B, nodegroup A has the label "foo", and nodegroup B does not have this label. Let's say they are both similar nodegroups, so scaleups would be balanced across the 2 nodegroups.

Currently (without changes from my PR), Expander can omit nodegroup A and return nodegroup B as the best option (with nodegroup A as a similar nodegroup). Then scaleups will be balanced and nodegroup A will get scaled up even though I didn't want scaleups on any nodegroup with label "foo".

However if we give the power to Expander to not only control the best option, but also the best option's similar nodegroups, then we can completely block scaleups on all nodegroups with label "foo". That is what my PR will allow. And it is behind a flag so that it is completely opt-in, by default nothing will change.

I also don't get the reasoning for moving the recalculation in #5802, the only cases I can think of where we'd want to recalculate is if a node group needs to be created. @BigDarkClown Do you remember the context for this one?

Yea if #5802 was never merged, then we wouldn't need this PR. The recalculation is what is preventing the expander from having full power over the similar nodegroups, but I don't think this was the intent of the original PR

@towca
Copy link
Collaborator

towca commented Jan 31, 2025

With the gRPC expander for example, you can have any custom logic you want. So if for example I want to filter out any nodegroups that have the label "foo", I could have logic in my gRPC expander to filter that out as a best option.

You certainly can have any custom logic behind the RPC, but if that logic breaks some assumptions that CA makes about Expanders, it'll lead to bugs. Balancing is one example, but I'm not convinced that there aren't more, or that we don't introduce more down the road. Especially because all of the custom logic is out-of-tree, so it's hard to validate if it breaks while doing changes in-tree.

The problem occurs when balancing happens. Let's say I have nodegroups A and B, nodegroup A has the label "foo", and nodegroup B does not have this label. Let's say they are both similar nodegroups, so scaleups would be balanced across the 2 nodegroups.

Currently (without changes from my PR), Expander can omit nodegroup A and return nodegroup B as the best option (with nodegroup A as a similar nodegroup). Then scaleups will be balanced and nodegroup A will get scaled up even though I didn't want scaleups on any nodegroup with label "foo".

However if we give the power to Expander to not only control the best option, but also the best option's similar nodegroups, then we can completely block scaleups on all nodegroups with label "foo". That is what my PR will allow. And it is behind a flag so that it is completely opt-in, by default nothing will change.

I get the motiviation, but why can't the gRPC provider just implement its own NodeGroupSetProcessor (or customize the current implementation), and let the user hook into its FindSimilarNodeGroups()? Wouldn't this achieve the same thing without changing component responsibilities?

Yea if #5802 was never merged, then we wouldn't need this PR. The recalculation is what is preventing the expander from having full power over the similar nodegroups, but I don't think this was the intent of the original PR

I synced with @BigDarkClown offline, and it was actually the intent of that PR. We had a bug in our GKE-specific Expander that resulted in clearing the SimilarNodeGroups field. Since Expanders aren't supposed to do that, the recalculation was added as a bit of defensive programming to guard against similar bugs in Expander implementations (in addition to fixing the Expander implementation itself of course).

@rrangith
Copy link
Contributor Author

rrangith commented Feb 3, 2025

With the gRPC expander for example, you can have any custom logic you want. So if for example I want to filter out any nodegroups that have the label "foo", I could have logic in my gRPC expander to filter that out as a best option.

You certainly can have any custom logic behind the RPC, but if that logic breaks some assumptions that CA makes about Expanders, it'll lead to bugs. Balancing is one example, but I'm not convinced that there aren't more, or that we don't introduce more down the road. Especially because all of the custom logic is out-of-tree, so it's hard to validate if it breaks while doing changes in-tree.

I am curious about other types of bugs that could eventually occur from this? Since the similar nodegroups is exclusively used for balancing. And by default balancing is not enabled and similar nodegroups is not used at all, see here. A user would have to opt-in to have similar nodegroups be considered at all. And then they'd have to opt-in again to skip similar nodegroup recomputation, so in my point of view the blast radius seems very small

I get the motiviation, but why can't the gRPC provider just implement its own NodeGroupSetProcessor (or customize the current implementation), and let the user hook into its FindSimilarNodeGroups()? Wouldn't this achieve the same thing without changing component responsibilities?

We aren't using the gRPC provider, we use the regular cloud providers. For example the gce or aws providers. But is your general idea to have an additional processor that a user could opt-in to and then have that make a gRPC call to get the similar nodegroups for a given nodegroup? This makes sense in theory, but leads to a lot of extra network requests when in reality all we need to know is the best option's similar nodegroups. If i have 1000 nodegroups in my cluster for example, I'd now need to make 1000 gRPC calls (1 per FindSimilarNodegroups for each nodegroup) for each scaleup request.

Yea if #5802 was never merged, then we wouldn't need this PR. The recalculation is what is preventing the expander from having full power over the similar nodegroups, but I don't think this was the intent of the original PR

I synced with @BigDarkClown offline, and it was actually the intent of that PR. We had a bug in our GKE-specific Expander that resulted in clearing the SimilarNodeGroups field. Since Expanders aren't supposed to do that, the recalculation was added as a bit of defensive programming to guard against similar bugs in Expander implementations (in addition to fixing the Expander implementation itself of course).

Yea so my argument to that would be that #5802 was a breaking change and I see value in the original behaviour. I think CA should allow users who know what they are doing to remove the defensive programming and go back to the original CA behaviour pre #5802 and allow the expander to have control over the similar nodegroups. Since ultimately this was a bug in the expander implementation and not CA.

Right now the purpose of the expander is to choose the best option, but my proposal is to allow the expander to have a say on what the best option's similar nodegroups are too. Since the similar nodegroups could play a part in choosing the best option. And the best option might not actually be the best due to its similar nodegoups, so may need some extra filtering to remove certain similar nodegroups

Happy to discuss more over a call or at the next SIG Autoscaling meeting. We are using this change alongside #6941 in our fork and it has been working well for us and would like others to benefit from it too

@x13n
Copy link
Member

x13n commented Feb 7, 2025

The discussion in #6940 seems to be going in a quite different direction, so closing this PR for now. Feel free to reopen if you want to revisit this approach.

/close

@k8s-ci-robot
Copy link
Contributor

@x13n: Closed this PR.

In response to this:

The discussion in #6940 seems to be going in a quite different direction, so closing this PR for now. Feel free to reopen if you want to revisit this approach.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@rrangith
Copy link
Contributor Author

rrangith commented Feb 7, 2025

/reopen

@x13n as I said in #6940 (comment)

We are currently using #6926 and #6941 in our fork and it is working perfectly fine, but I'd be interested in seeing the alternative way

I am still advocating for this PR as it is working for us. Can discuss more at the next SIG autoscaling.

@k8s-ci-robot k8s-ci-robot reopened this Feb 7, 2025
@k8s-ci-robot
Copy link
Contributor

@rrangith: Reopened this PR.

In response to this:

/reopen

@x13n as I said in #6940 (comment)

We are currently using #6926 and #6941 in our fork and it is working perfectly fine, but I'd be interested in seeing the alternative way

I am still advocating for this PR as it is working for us. Can discuss more at the next SIG autoscaling.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: rrangith
Once this PR has been reviewed and has the lgtm label, please ask for approval from towca. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@x13n
Copy link
Member

x13n commented Feb 7, 2025

Ack, apologies for premature closing then! I can't join SIG meeting on Monday, will be following the discussion on the PR & issue though.

@towca
Copy link
Collaborator

towca commented Feb 10, 2025

You certainly can have any custom logic behind the RPC, but if that logic breaks some assumptions that CA makes about Expanders, it'll lead to bugs. Balancing is one example, but I'm not convinced that there aren't more, or that we don't introduce more down the road. Especially because all of the custom logic is out-of-tree, so it's hard to validate if it breaks while doing changes in-tree.

I am curious about other types of bugs that could eventually occur from this? Since the similar nodegroups is exclusively used for balancing. And by default balancing is not enabled and similar nodegroups is not used at all, see here. A user would have to opt-in to have similar nodegroups be considered at all. And then they'd have to opt-in again to skip similar nodegroup recomputation, so in my point of view the blast radius seems very small

I would really want to avoid adding a separate config option for this. It doesn't feel like something that should be configurable, the flag would essentially control a very low-level implementation detail.

I'm less worried that this PR changes behavior for existing clusters. I'm mostly worried that future changes to the balancing and adjacent logic will break your use case at some point.

I get the motiviation, but why can't the gRPC provider just implement its own NodeGroupSetProcessor (or customize the current implementation), and let the user hook into its FindSimilarNodeGroups()? Wouldn't this achieve the same thing without changing component responsibilities?

We aren't using the gRPC provider, we use the regular cloud providers. For example the gce or aws providers. But is your general idea to have an additional processor that a user could opt-in to and then have that make a gRPC call to get the similar nodegroups for a given nodegroup? This makes sense in theory, but leads to a lot of extra network requests when in reality all we need to know is the best option's similar nodegroups. If i have 1000 nodegroups in my cluster for example, I'd now need to make 1000 gRPC calls (1 per FindSimilarNodegroups for each nodegroup) for each scaleup request.

Ah, so it's just the Expander part that's behind an RPC? That makes sense, thanks for the clarification. But yeah, I'm mostly advocating for keeping component responsibilities tightly-scoped.

I know it's more work, but we could extend the NodeGroupSetProcessor to have a separate call for the best option. Or maybe we could implement caching on the Expander side?

Yea if #5802 was never merged, then we wouldn't need this PR. The recalculation is what is preventing the expander from having full power over the similar nodegroups, but I don't think this was the intent of the original PR

I synced with @BigDarkClown offline, and it was actually the intent of that PR. We had a bug in our GKE-specific Expander that resulted in clearing the SimilarNodeGroups field. Since Expanders aren't supposed to do that, the recalculation was added as a bit of defensive programming to guard against similar bugs in Expander implementations (in addition to fixing the Expander implementation itself of course).

Yea so my argument to that would be that #5802 was a breaking change and I see value in the original behaviour. I think CA should allow users who know what they are doing to remove the defensive programming and go back to the original CA behaviour pre #5802 and allow the expander to have control over the similar nodegroups. Since ultimately this was a bug in the expander implementation and not CA.

Right now the purpose of the expander is to choose the best option, but my proposal is to allow the expander to have a say on what the best option's similar nodegroups are too. Since the similar nodegroups could play a part in choosing the best option. And the best option might not actually be the best due to its similar nodegoups, so may need some extra filtering to remove certain similar nodegroups

Happy to discuss more over a call or at the next SIG Autoscaling meeting. We are using this change alongside #6941 in our fork and it has been working well for us and would like others to benefit from it too

Similar node groups definitely can play a role in choosing the best option, that's why they're passed to Expander at all. But I don't see a good case for changing the similar node groups while trying to choose the best one.

Extending Expander semantics is certainly an option we could go with, but then:

  1. We'd have to document them clearly.
  2. How would we want the semantics to work? We can't just say "Expander can modify the passed Options arbitraily", because the scale-up execution logic depends on their contents.
  3. If we say "Expander can modify only the similar node groups field in the passed Options", then we're back to "we already have a component that's supposed to do that".

Happy to discuss further during the meeting!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants