Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vpa-recommender: Add support for configuring global max allowed resources #7560

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

ialidzhikov
Copy link
Contributor

@ialidzhikov ialidzhikov commented Dec 4, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

See #7147

Which issue(s) this PR fixes:

Fixes #7147

Special notes for your reviewer:

N/A

Does this PR introduce a user-facing change?

vpa-recommender now supports two new flags - `--container-recommendation-max-allowed-cpu` and `--container-recommendation-max-allowed-memory`. The flags make possible to configure the global max allowed resources the vpa-recommender can recommend for a container. The flags aim to address a known limitation of VPA (vpa-recommender, in particular) that it is not aware of the cluster's capacity and can make Pod unschedulable after applying a recommendation that cannot fit in the cluster (recommendation higher than largest Node's allocatable).

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Dec 4, 2024
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Dec 4, 2024
@ialidzhikov ialidzhikov force-pushed the enh/global-max-allowed-flags-post-processor branch 2 times, most recently from 9967539 to 0a79197 Compare December 4, 2024 14:02
@adrianmoisey
Copy link
Member

Sorry, but I've just gotten a PR merged (#7548) that will merge conflict with yours.
It's only documentation related.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 4, 2024
@ialidzhikov ialidzhikov force-pushed the enh/global-max-allowed-flags-post-processor branch from 0a79197 to af3a158 Compare December 4, 2024 14:29
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Dec 4, 2024
ialidzhikov added a commit to ialidzhikov/gardener that referenced this pull request Dec 4, 2024
Comment on lines 224 to 229
var globalMaxAllowed apiv1.ResourceList
if !maxAllowedCPU.Quantity.IsZero() {
setGlobalMaxAllowed(&globalMaxAllowed, apiv1.ResourceCPU, maxAllowedCPU.Quantity)
}
if !maxAllowedMemory.Quantity.IsZero() {
setGlobalMaxAllowed(&globalMaxAllowed, apiv1.ResourceMemory, maxAllowedMemory.Quantity)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be some logic to ensure that the global max is greater than the global min?
It may be a bit strange since one is per container and the other per pod

Copy link
Contributor Author

@ialidzhikov ialidzhikov Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a perfect world, I agree that validation should exists.
However, I am not sure how to add such validation when the global min allowed flags are on Pod-level and global max allowed flags are on container-level.

In #7147 (comment) I suggested to deprecate the global Pod-level min allowed flags because:

  • they are on Pod-level, and not on container-level. The recommendation is on container-level. Right now, the recommender splits the Pod-level min allowed values to the number of containers in a Pod.
  • they are implemented as ResourceEstimator. This causes the VPA .status.uncappedTarget to be capped as well which is a contradiction of the definition of this field. This also causes the global min-allowed flags to overwrite the VPA minAllowed field (if specified), there is also no merge between the global Pod-level min allowed flags and the VPA's minAllowed field.

If you agree, I can open a dedicated issue for deprecated the global Pod-level min allowed flags and introduce new container-level min allowed equivalents. And validation can be added between the global container-level min and max allowed flags. WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However, I am not sure how to add such validation when the global min allowed flags are on Pod-level and global max allowed flags are on container-level.

Yeah, I agree. I can't figure out a way to make the validation work.

In #7147 (comment) I suggested to deprecate the global Pod-level min allowed flags because:

I don't think deprecation is necessary yet. Pod level resources may be in the VPA's future, so those flags may be used for Pod level resources.

@@ -108,3 +109,16 @@ These options cannot be used together and are mutually exclusive.
It is possible to set the failurePolicy of the webhook to `Fail` by passing `--webhook-failure-policy-fail=true` to the VPA admission controller.
Please use this option with caution as it may be possible to break Pod creation if there is a failure with the VPA.
Using it in conjunction with `--ignored-vpa-object-namespaces=kube-system` or `--vpa-object-namespace` to reduce risk.

### Specifying global maximum allowed resources to prevent pods from being unschedulable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about moving this section move to the "features.md" page?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many sections in examples.md which actually describe features of VPA (Starting multiple recommenders, Custom memory bump-up after OOMKill, Using CPU management with static policy, Controlling eviction behavior based on scaling direction and resource, etc.). I don't think the newly introduced section is different from the existing section.
IMO, examples and features are overlapping conceptually. If you describe a feature, you usually also add example(s) of how the feature can be used. examples.md and features.md should be merged IMO. This is out-of-scope of the existing PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, examples and features are overlapping conceptually. If you describe a feature, you usually also add example(s) of how the feature can be used. examples.md and features.md should be merged IMO. This is out-of-scope of the existing PR.

I agree with you here. The features page is new, and I want to start moving the examples across to the features page, with more description about that feature.

I thought it would be nice for the "global max" feature to be added to features from the start, since it's a better fit there, and will require work to move at a later stage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(but it's up to you though, the documentation needs a lot of work in general)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to keep it consistent with the existing doc. In another PR all the examples doc can be reworked as features sections.

@adrianmoisey
Copy link
Member

Unsure if you want to do this, but an e2e test for the feature would be great!
If it's not something you want do, I'm happy to try figure it out in a different PR

@ialidzhikov ialidzhikov force-pushed the enh/global-max-allowed-flags-post-processor branch from 12f42ee to 04e3340 Compare December 13, 2024 12:33
@ialidzhikov
Copy link
Contributor Author

ialidzhikov commented Dec 13, 2024

Unsure if you want to do this, but an e2e test for the feature would be great!

Do you have a suggestion for the e2e test? It seems that we deploy vpa components to a kind cluster and then create/update/delete VPA objects and finally assert their state. It is relatively easy to test a field in the VPA spec. But I don't find example of e2e tests for such global flags/options.
The only option I see is to deploy VPA initially with the global max allowed flags. I am not sure if this is desired or not. It would also affect other recommender tests. Or do you suggest to patch the vpa-recommender deployment as part of the e2e test?

@ialidzhikov
Copy link
Contributor Author

ERROR: (gcloud.compute.networks.create) Could not fetch resource:
 - <!DOCTYPE html>
<html lang=en>
  <meta charset=utf-8>
  <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
  <title>Error 502 (Server Error)!!1</title>
  <style>
    *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
  </style>
  <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
  <p><b>502.</b> <ins>That’s an error.</ins>
  <p>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.  <ins>That’s all we know.</ins>

/test pull-kubernetes-e2e-autoscaling-vpa-full

@adrianmoisey
Copy link
Member

Unsure if you want to do this, but an e2e test for the feature would be great!

Do you have a suggestion for the e2e test? It seems that we deploy vpa components to a kind cluster and then create/update/delete VPA objects and finally assert their state. It is relatively easy to test a field in the VPA spec. But I don't find example of e2e tests for such global flags/options. The only option I see is to deploy VPA initially with the global max allowed flags. I am not sure if this is desired or not. It would also affect other recommender tests. Or do you suggest to patch the vpa-recommender deployment as part of the e2e test?

It seems that at the moment we don't have a similar test to base yours off, so I think one would need to be written to deploy the recommender with a global max, and ensure that it doesn't go above that max.

It's going to be annoying to write such a test, I think, but I think the test will help us in the long run.

What do you think?

@adrianmoisey
Copy link
Member

I'm going to give this a
/lgtm

I like the idea of an e2e test, but will let an approver decide if it's needed or not

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 15, 2024
Copy link
Contributor

@raywainman raywainman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this!! Sorry for the delay!

vertical-pod-autoscaler/pkg/recommender/main.go Outdated Show resolved Hide resolved
vertical-pod-autoscaler/pkg/utils/vpa/capping.go Outdated Show resolved Hide resolved
} else {
// Set resources from the global maxAllowed if the VPA maxAllowed is missing them.
for resourceName, quantity := range globalMaxAllowed {
if _, ok := maxAllowed[resourceName]; !ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a comment here that we only override this if the user did not explicitly set a maximum in their container policy in the VPA.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already

// Set resources from the global maxAllowed if the VPA maxAllowed is missing them.
which is more or less stating the same thing.
Let me know if you have suggestions for improvements.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 18, 2024
@ialidzhikov
Copy link
Contributor Author

Hi @raywainman ,

I addressed your review feedback in caa47f9.

@adrianmoisey
Copy link
Member

/lgtm

@adrianmoisey
Copy link
Member

/milestone vertical-pod-autoscaler-1.4.0

Copy link
Contributor

@raywainman raywainman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sincerely apologize for the long lag time here, please take a look at one suggestion below. I think otherwise this is good to go :)

Thanks for all of this!!

cappedToMin, _ := maybeCapToPolicyMin(recommended, resourceName, containerPolicy)
recommendation[resourceName] = cappedToMin
cappedToMax, _ := maybeCapToPolicyMax(cappedToMin, resourceName, containerPolicy)
var maxAllowed apiv1.ResourceList
Copy link
Contributor

@raywainman raywainman Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we compute maxAllowed outside the for-loop? I think that simplifies the loop a little bit...

var minAllowed apiv1.ResourceList
if containerPolicy != nil {
  minAllowed = containerPolicy.MinAllowed
}

// Container Policy always takes precedence over the global max allowed value.
var maxAllowed apiv1.ResourceList
if containerPolicy != nil {
	maxAllowed = containerPolicy.MaxAllowed
} else if globalMaxAllowed != nil {
	maxAllowed = globalMaxAllowed
}
...

for resourceName, recommended := range recommendation {
	if minAllowed != nil {
		cappedToMin, _ := maybeCapToMin(recommendation[resourceName], resourceName, minAllowed)
		recommendation[resourceName] = cappedToMin
	}
	if maxAllowed != nil {
		cappedToMax, _ := maybeCapToMax(recommendation[resourceName], resourceName, maxAllowed)
		recommendation[resourceName] = cappedToMax
	}
    }
}

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @raywainman ,

In general, pretty valid comment. It totally makes sense to compute the minAllowed and maxAllowed values outside of the for loop.

In your suggestion, there is no merge between the containerPolicy max allowed and the global max allowed.

The equivalent suggestion with the proper merge would be:

var minAllowed apiv1.ResourceList
if containerPolicy != nil {
	minAllowed = containerPolicy.MinAllowed
}

var maxAllowed apiv1.ResourceList
if containerPolicy != nil {
	maxAllowed = containerPolicy.MaxAllowed
}
if maxAllowed == nil {
	maxAllowed = globalMaxAllowed
} else {
	// Set resources from the global max allowed if the VPA max allowed is missing them.
	// Container policy max allowed always takes precedence over the global max allowed value.
	for resourceName, quantity := range globalMaxAllowed {
		if _, ok := maxAllowed[resourceName]; !ok {
			maxAllowed[resourceName] = quantity
		}
	}
}

process := func(recommendation apiv1.ResourceList) {
	for resourceName, _ := range recommendation {
		if minAllowed != nil {
			cappedToMin, _ := maybeCapToMin(recommendation[resourceName], resourceName, minAllowed)
			recommendation[resourceName] = cappedToMin
		}
		if maxAllowed != nil {
			cappedToMax, _ := maybeCapToMax(recommendation[resourceName], resourceName, maxAllowed)
			recommendation[resourceName] = cappedToMax
		}
	}
}

Tomorrow, I will perform a final test and if everything is fine, I will push a commit that incorporates the feedback.
Thank you very much!

Very nice suggestion, I like it!

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a call to maxAllowed = globalMaxAllowed if maxAllowed is modified, also the globalMaxAllowed will be modified (currently this can't happen due to the if checks, but who knows for the future 😅). Similarly, a call to maxAllowed = containerPolicy.MaxAllowed could make it so that modifying maxAllowed will modify the .containerPoliciy.maxAllowed field in the vpa object as containerPolicy is returned as a pointer from GetContainerResourcePolicy(containerName, policy) (note that I did not have time to check the entire chain of pointers here).

So maybe, to be safe, you could use something like this:

var maxAllowed = make(apiv1.ResourceList)
// First we copy over any values from the globalMaxAllowed
maps.Copy(maxAllowed, globalMaxAllowed)
if containerPolicy != nil {
        // Here we copy values from the containerPolicy.MaxAllowed, overwriting any keys copied from the globalMaxAllowed.
	maps.Copy(maxAllowed, containerPolicy.MaxAllowed)
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raywainman , I addressed your suggestion in 055affa.

Copy link
Contributor Author

@ialidzhikov ialidzhikov Feb 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plkokanov , I think you are right. In

var maxAllowed apiv1.ResourceList
if containerPolicy != nil {
maxAllowed = containerPolicy.MaxAllowed
}
I assign the maxAllowed var to containerPolicy.MaxAllowed. Later on, in
// Set resources from the global max allowed if the VPA max allowed is missing them.
for resourceName, quantity := range globalMaxAllowed {
if _, ok := maxAllowed[resourceName]; !ok {
maxAllowed[resourceName] = quantity
}
}
, I modify the maxAllowed var with content from the globalMaxAllowed, if needed.

Let me double check this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A snippet that demonstrates the problem that @plkokanov discovered: https://go.dev/play/p/TGgl4FA8-5w

Example output:

maxAllowed=map[cpu:200m memory:300Mi], 0xc0000a0120
containerPolicy.MaxAllowed=map[cpu:200m memory:300Mi], 0xc0000a0120

containerPolicy.MaxAllowed is modified in unexpected manner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@plkokanov , I addressed your suggestion in bab77cd.

@ialidzhikov ialidzhikov force-pushed the enh/global-max-allowed-flags-post-processor branch from caa47f9 to b7b30d4 Compare February 10, 2025 15:10
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 10, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: ialidzhikov
Once this PR has been reviewed and has the lgtm label, please assign kwiesmueller for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ialidzhikov
Copy link
Contributor Author

ialidzhikov commented Feb 10, 2025

With the latest force push, I only rebased the PR. There were no conflicts but I did a rebase because the PR was not rebased since longtime.

@ialidzhikov ialidzhikov force-pushed the enh/global-max-allowed-flags-post-processor branch from 7d93b76 to 055affa Compare February 11, 2025 14:33
@adrianmoisey
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Feb 13, 2025
Copy link

@plkokanov plkokanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot
Copy link
Contributor

@plkokanov: changing LGTM is restricted to collaborators

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/vertical-pod-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

vpa-recommender: Make max allowed recommendation configurable
5 participants