vpa-recommender: Add support for configuring global max allowed resources #7560

ialidzhikov · 2024-12-04T13:48:22Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

See #7147

Which issue(s) this PR fixes:

Fixes #7147

Special notes for your reviewer:

N/A

Does this PR introduce a user-facing change?

vpa-recommender now supports two new flags - `--container-recommendation-max-allowed-cpu` and `--container-recommendation-max-allowed-memory`. The flags make possible to configure the global max allowed resources the vpa-recommender can recommend for a container. The flags aim to address a known limitation of VPA (vpa-recommender, in particular) that it is not aware of the cluster's capacity and can make Pod unschedulable after applying a recommendation that cannot fit in the cluster (recommendation higher than largest Node's allocatable).

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

adrianmoisey · 2024-12-04T14:18:05Z

Sorry, but I've just gotten a PR merged (#7548) that will merge conflict with yours.
It's only documentation related.

vertical-pod-autoscaler/pkg/recommender/main.go

…14e06e613e531f7dd This PR adopts the changes from kubernetes/autoscaler#7560.

vertical-pod-autoscaler/pkg/utils/vpa/capping.go

adrianmoisey · 2024-12-08T12:19:03Z

vertical-pod-autoscaler/pkg/recommender/main.go

+	var globalMaxAllowed apiv1.ResourceList
+	if !maxAllowedCPU.Quantity.IsZero() {
+		setGlobalMaxAllowed(&globalMaxAllowed, apiv1.ResourceCPU, maxAllowedCPU.Quantity)
+	}
+	if !maxAllowedMemory.Quantity.IsZero() {
+		setGlobalMaxAllowed(&globalMaxAllowed, apiv1.ResourceMemory, maxAllowedMemory.Quantity)


Should there be some logic to ensure that the global max is greater than the global min?
It may be a bit strange since one is per container and the other per pod

In a perfect world, I agree that validation should exists.
However, I am not sure how to add such validation when the global min allowed flags are on Pod-level and global max allowed flags are on container-level.

In #7147 (comment) I suggested to deprecate the global Pod-level min allowed flags because:

they are on Pod-level, and not on container-level. The recommendation is on container-level. Right now, the recommender splits the Pod-level min allowed values to the number of containers in a Pod.

they are implemented as ResourceEstimator. This causes the VPA .status.uncappedTarget to be capped as well which is a contradiction of the definition of this field. This also causes the global min-allowed flags to overwrite the VPA minAllowed field (if specified), there is also no merge between the global Pod-level min allowed flags and the VPA's minAllowed field.

If you agree, I can open a dedicated issue for deprecated the global Pod-level min allowed flags and introduce new container-level min allowed equivalents. And validation can be added between the global container-level min and max allowed flags. WDYT?

However, I am not sure how to add such validation when the global min allowed flags are on Pod-level and global max allowed flags are on container-level.

Yeah, I agree. I can't figure out a way to make the validation work.

In #7147 (comment) I suggested to deprecate the global Pod-level min allowed flags because:

I don't think deprecation is necessary yet. Pod level resources may be in the VPA's future, so those flags may be used for Pod level resources.

adrianmoisey · 2024-12-08T12:23:41Z

vertical-pod-autoscaler/docs/examples.md

 Please use this option with caution as it may be possible to break Pod creation if there is a failure with the VPA.
 Using it in conjunction with `--ignored-vpa-object-namespaces=kube-system` or `--vpa-object-namespace` to reduce risk.
+
+### Specifying global maximum allowed resources to prevent pods from being unschedulable


What do you think about moving this section move to the "features.md" page?

There are many sections in examples.md which actually describe features of VPA (Starting multiple recommenders, Custom memory bump-up after OOMKill, Using CPU management with static policy, Controlling eviction behavior based on scaling direction and resource, etc.). I don't think the newly introduced section is different from the existing section.
IMO, examples and features are overlapping conceptually. If you describe a feature, you usually also add example(s) of how the feature can be used. examples.md and features.md should be merged IMO. This is out-of-scope of the existing PR.

IMO, examples and features are overlapping conceptually. If you describe a feature, you usually also add example(s) of how the feature can be used. examples.md and features.md should be merged IMO. This is out-of-scope of the existing PR.

I agree with you here. The features page is new, and I want to start moving the examples across to the features page, with more description about that feature.

I thought it would be nice for the "global max" feature to be added to features from the start, since it's a better fit there, and will require work to move at a later stage.

(but it's up to you though, the documentation needs a lot of work in general)

I would prefer to keep it consistent with the existing doc. In another PR all the examples doc can be reworked as features sections.

adrianmoisey · 2024-12-08T12:25:35Z

Unsure if you want to do this, but an e2e test for the feature would be great!
If it's not something you want do, I'm happy to try figure it out in a different PR

ialidzhikov · 2024-12-13T14:15:42Z

Unsure if you want to do this, but an e2e test for the feature would be great!

Do you have a suggestion for the e2e test? It seems that we deploy vpa components to a kind cluster and then create/update/delete VPA objects and finally assert their state. It is relatively easy to test a field in the VPA spec. But I don't find example of e2e tests for such global flags/options.
The only option I see is to deploy VPA initially with the global max allowed flags. I am not sure if this is desired or not. It would also affect other recommender tests. Or do you suggest to patch the vpa-recommender deployment as part of the e2e test?

ialidzhikov · 2024-12-13T14:16:51Z

ERROR: (gcloud.compute.networks.create) Could not fetch resource:
 - <!DOCTYPE html>
<html lang=en>
  <meta charset=utf-8>
  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
  <title>Error 502 (Server Error)!!1</title>
  <style>
    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
  </style>
  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
  <p><b>502.</b> <ins>That’s an error.</ins>
  <p>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.  <ins>That’s all we know.</ins>

/test pull-kubernetes-e2e-autoscaling-vpa-full

adrianmoisey · 2024-12-15T12:53:45Z

Unsure if you want to do this, but an e2e test for the feature would be great!

Do you have a suggestion for the e2e test? It seems that we deploy vpa components to a kind cluster and then create/update/delete VPA objects and finally assert their state. It is relatively easy to test a field in the VPA spec. But I don't find example of e2e tests for such global flags/options. The only option I see is to deploy VPA initially with the global max allowed flags. I am not sure if this is desired or not. It would also affect other recommender tests. Or do you suggest to patch the vpa-recommender deployment as part of the e2e test?

It seems that at the moment we don't have a similar test to base yours off, so I think one would need to be written to deploy the recommender with a global max, and ensure that it doesn't go above that max.

It's going to be annoying to write such a test, I think, but I think the test will help us in the long run.

What do you think?

adrianmoisey · 2024-12-15T12:54:33Z

I'm going to give this a
/lgtm

I like the idea of an e2e test, but will let an approver decide if it's needed or not

raywainman

Thanks for this!! Sorry for the delay!

vertical-pod-autoscaler/pkg/recommender/main.go

vertical-pod-autoscaler/pkg/utils/vpa/capping.go

raywainman · 2024-12-16T18:44:17Z

vertical-pod-autoscaler/pkg/utils/vpa/capping.go

+				} else {
+					// Set resources from the global maxAllowed if the VPA maxAllowed is missing them.
+					for resourceName, quantity := range globalMaxAllowed {
+						if _, ok := maxAllowed[resourceName]; !ok {


Let's add a comment here that we only override this if the user did not explicitly set a maximum in their container policy in the VPA.

There is already

autoscaler/vertical-pod-autoscaler/pkg/utils/vpa/capping.go

Line 210 in 04e3340

// Set resources from the global maxAllowed if the VPA maxAllowed is missing them.

which is more or less stating the same thing.
Let me know if you have suggestions for improvements.

vertical-pod-autoscaler/pkg/utils/vpa/capping_test.go

raywainman

I sincerely apologize for the long lag time here, please take a look at one suggestion below. I think otherwise this is good to go :)

Thanks for all of this!!

raywainman · 2025-01-31T19:14:01Z

vertical-pod-autoscaler/pkg/utils/vpa/capping.go

-			cappedToMin, _ := maybeCapToPolicyMin(recommended, resourceName, containerPolicy)
-			recommendation[resourceName] = cappedToMin
-			cappedToMax, _ := maybeCapToPolicyMax(cappedToMin, resourceName, containerPolicy)
+			var maxAllowed apiv1.ResourceList


Can we compute maxAllowed outside the for-loop? I think that simplifies the loop a little bit...

var minAllowed apiv1.ResourceList if containerPolicy != nil { minAllowed = containerPolicy.MinAllowed } // Container Policy always takes precedence over the global max allowed value. var maxAllowed apiv1.ResourceList if containerPolicy != nil { maxAllowed = containerPolicy.MaxAllowed } else if globalMaxAllowed != nil { maxAllowed = globalMaxAllowed } ... for resourceName, recommended := range recommendation { if minAllowed != nil { cappedToMin, _ := maybeCapToMin(recommendation[resourceName], resourceName, minAllowed) recommendation[resourceName] = cappedToMin } if maxAllowed != nil { cappedToMax, _ := maybeCapToMax(recommendation[resourceName], resourceName, maxAllowed) recommendation[resourceName] = cappedToMax } } }

What do you think?

Hi @raywainman ,

In general, pretty valid comment. It totally makes sense to compute the minAllowed and maxAllowed values outside of the for loop.

In your suggestion, there is no merge between the containerPolicy max allowed and the global max allowed.

The equivalent suggestion with the proper merge would be:

var minAllowed apiv1.ResourceList if containerPolicy != nil { minAllowed = containerPolicy.MinAllowed } var maxAllowed apiv1.ResourceList if containerPolicy != nil { maxAllowed = containerPolicy.MaxAllowed } if maxAllowed == nil { maxAllowed = globalMaxAllowed } else { // Set resources from the global max allowed if the VPA max allowed is missing them. // Container policy max allowed always takes precedence over the global max allowed value. for resourceName, quantity := range globalMaxAllowed { if _, ok := maxAllowed[resourceName]; !ok { maxAllowed[resourceName] = quantity } } } process := func(recommendation apiv1.ResourceList) { for resourceName, _ := range recommendation { if minAllowed != nil { cappedToMin, _ := maybeCapToMin(recommendation[resourceName], resourceName, minAllowed) recommendation[resourceName] = cappedToMin } if maxAllowed != nil { cappedToMax, _ := maybeCapToMax(recommendation[resourceName], resourceName, maxAllowed) recommendation[resourceName] = cappedToMax } } }

Tomorrow, I will perform a final test and if everything is fine, I will push a commit that incorporates the feedback.
Thank you very much!

Very nice suggestion, I like it!

After a call to maxAllowed = globalMaxAllowed if maxAllowed is modified, also the globalMaxAllowed will be modified (currently this can't happen due to the if checks, but who knows for the future 😅). Similarly, a call to maxAllowed = containerPolicy.MaxAllowed could make it so that modifying maxAllowed will modify the .containerPoliciy.maxAllowed field in the vpa object as containerPolicy is returned as a pointer from GetContainerResourcePolicy(containerName, policy) (note that I did not have time to check the entire chain of pointers here).

So maybe, to be safe, you could use something like this:

var maxAllowed = make(apiv1.ResourceList) // First we copy over any values from the globalMaxAllowed maps.Copy(maxAllowed, globalMaxAllowed) if containerPolicy != nil { // Here we copy values from the containerPolicy.MaxAllowed, overwriting any keys copied from the globalMaxAllowed. maps.Copy(maxAllowed, containerPolicy.MaxAllowed) }

@raywainman , I addressed your suggestion in 055affa.

@plkokanov , I think you are right. In

autoscaler/vertical-pod-autoscaler/pkg/utils/vpa/capping.go

Lines 201 to 204 in 055affa

var maxAllowed apiv1.ResourceList

if containerPolicy != nil {

maxAllowed = containerPolicy.MaxAllowed

}

I assign the maxAllowed var to containerPolicy.MaxAllowed. Later on, in

autoscaler/vertical-pod-autoscaler/pkg/utils/vpa/capping.go

Lines 208 to 213 in 055affa

// Set resources from the global max allowed if the VPA max allowed is missing them.

for resourceName, quantity := range globalMaxAllowed {

if _, ok := maxAllowed[resourceName]; !ok {

maxAllowed[resourceName] = quantity

}

}

, I modify the maxAllowed var with content from the globalMaxAllowed, if needed.

Let me double check this.

A snippet that demonstrates the problem that @plkokanov discovered: https://go.dev/play/p/TGgl4FA8-5w

Example output:

maxAllowed=map[cpu:200m memory:300Mi], 0xc0000a0120 containerPolicy.MaxAllowed=map[cpu:200m memory:300Mi], 0xc0000a0120

containerPolicy.MaxAllowed is modified in unexpected manner.

@plkokanov , I addressed your suggestion in bab77cd.

Good call on the deep copy. This LGTM now, thank you!

…rces

…x-allowed-{cpu,memory}`

ialidzhikov · 2025-02-10T15:10:50Z

With the latest force push, I only rebased the PR. There were no conflicts but I did a rebase because the PR was not rebased since longtime.

adrianmoisey · 2025-02-13T04:54:21Z

/lgtm

plkokanov

/lgtm

k8s-ci-robot · 2025-02-14T07:16:44Z

@plkokanov: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

raywainman · 2025-02-14T21:46:22Z

/approve

k8s-ci-robot · 2025-02-14T21:46:32Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ialidzhikov, raywainman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~vertical-pod-autoscaler/OWNERS~~ [raywainman]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 4, 2024

k8s-ci-robot requested a review from jbartosik December 4, 2024 13:48

k8s-ci-robot added the area/vertical-pod-autoscaler label Dec 4, 2024

k8s-ci-robot requested a review from raywainman December 4, 2024 13:48

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 4, 2024

ialidzhikov mentioned this pull request Dec 4, 2024

vpa-recommender: Make max allowed recommendation configurable #7147

Closed

ialidzhikov force-pushed the enh/global-max-allowed-flags-post-processor branch 2 times, most recently from 9967539 to 0a79197 Compare December 4, 2024 14:02

k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 4, 2024

ialidzhikov force-pushed the enh/global-max-allowed-flags-post-processor branch from 0a79197 to af3a158 Compare December 4, 2024 14:29

k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 4, 2024

adrianmoisey reviewed Dec 4, 2024

View reviewed changes

vertical-pod-autoscaler/pkg/recommender/main.go Outdated Show resolved Hide resolved

ialidzhikov added a commit to ialidzhikov/gardener that referenced this pull request Dec 4, 2024

[drop me] Update VPA components to v1.3.0-dev-af3a158def012d5aab41e61…

b620ce2

…14e06e613e531f7dd This PR adopts the changes from kubernetes/autoscaler#7560.

ialidzhikov requested a review from adrianmoisey December 6, 2024 08:06

adrianmoisey reviewed Dec 8, 2024

View reviewed changes

vertical-pod-autoscaler/pkg/utils/vpa/capping.go Show resolved Hide resolved

adrianmoisey reviewed Dec 8, 2024

View reviewed changes

ialidzhikov force-pushed the enh/global-max-allowed-flags-post-processor branch from 12f42ee to 04e3340 Compare December 13, 2024 12:33

ialidzhikov requested a review from adrianmoisey December 13, 2024 14:17

k8s-ci-robot assigned adrianmoisey Dec 15, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 15, 2024

raywainman reviewed Dec 16, 2024

View reviewed changes

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 18, 2024

k8s-ci-robot added this to the vertical-pod-autoscaler-1.4.0 milestone Jan 14, 2025

raywainman reviewed Jan 31, 2025

View reviewed changes

voelzmo mentioned this pull request Feb 4, 2025

[VPA] Enhancement proposal: Max allowed recommendation #7397

Closed

ialidzhikov added 4 commits February 10, 2025 17:09

vpa-recommender: Add support for configuring global max allowed resou…

2191145

…rces

Rename --max-allowed-{cpu,memory} to `--container-recommendation-ma…

47402af

…x-allowed-{cpu,memory}`

Address review comment from adrianmoisey

3314cf9

Address review feedback from raywainman

b7b30d4

ialidzhikov force-pushed the enh/global-max-allowed-flags-post-processor branch from caa47f9 to b7b30d4 Compare February 10, 2025 15:10

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2025

Address review feedback from raywainman (2)

055affa

ialidzhikov force-pushed the enh/global-max-allowed-flags-post-processor branch from 7d93b76 to 055affa Compare February 11, 2025 14:33

Address review feedback from plkokanov

bab77cd

ialidzhikov requested review from plkokanov and raywainman February 11, 2025 15:23

plkokanov mentioned this pull request Feb 12, 2025

Add minAllowed configuration option to Shoot gardener/gardener#11252

Merged

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 13, 2025

plkokanov reviewed Feb 14, 2025

View reviewed changes

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 14, 2025

k8s-ci-robot merged commit edfda70 into kubernetes:master Feb 14, 2025
7 checks passed

raywainman mentioned this pull request Feb 25, 2025

AEP-7862: Support CPU Startup Boost in VPA #7863

Merged

toredash mentioned this pull request Mar 24, 2025

flags documentation doesn't point out that flags may not have seen a release yet #7968

Closed

vitanovs mentioned this pull request May 23, 2025

VPA 1.4.0 release gardener/gardener#12027

Closed

ialidzhikov mentioned this pull request Jul 8, 2025

Add VPA maxAllowed field to the Shoot, Seed and Garden resources gardener/gardener#12481

Merged

ialidzhikov deleted the enh/global-max-allowed-flags-post-processor branch July 8, 2025 14:32

ialidzhikov mentioned this pull request Jul 9, 2025

docs: Unify the header for a section #8307

Merged

	var maxAllowed apiv1.ResourceList
	if containerPolicy != nil {
	maxAllowed = containerPolicy.MaxAllowed
	}

	// Set resources from the global max allowed if the VPA max allowed is missing them.
	for resourceName, quantity := range globalMaxAllowed {
	if _, ok := maxAllowed[resourceName]; !ok {
	maxAllowed[resourceName] = quantity
	}
	}

vpa-recommender: Add support for configuring global max allowed resources #7560

vpa-recommender: Add support for configuring global max allowed resources #7560

Uh oh!

Conversation

ialidzhikov commented Dec 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

adrianmoisey commented Dec 4, 2024

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ialidzhikov Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adrianmoisey commented Dec 8, 2024

Uh oh!

ialidzhikov commented Dec 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ialidzhikov commented Dec 13, 2024

Uh oh!

adrianmoisey commented Dec 15, 2024

Uh oh!

adrianmoisey commented Dec 15, 2024

Uh oh!

raywainman left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

raywainman left a comment

Choose a reason for hiding this comment

Uh oh!

raywainman Jan 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ialidzhikov Feb 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

ialidzhikov commented Dec 4, 2024 •

edited

Loading

ialidzhikov Dec 13, 2024 •

edited

Loading

ialidzhikov commented Dec 13, 2024 •

edited

Loading

raywainman left a comment •

edited

Loading

raywainman Jan 31, 2025 •

edited

Loading

ialidzhikov Feb 11, 2025 •

edited

Loading

ialidzhikov commented Feb 10, 2025 •

edited

Loading