Skip to content

Conversation

abdelrahman882
Copy link
Contributor

@abdelrahman882 abdelrahman882 commented Sep 16, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR is completing the basic logic for Capacity Buffers API to cover scalable objects, resource limits and fake pods injection.

Which issue(s) this PR fixes:

none

Special notes for your reviewer:

The new parts in this PR:

  • Handle scalable object reference in buffer's controller
  • Handle resource limits in buffer's controller
  • Refactor Client calls to be more efficient
  • Add pod list processor to inject fake pods to trigger scale up
  • Add flags for controller and fake pods injector to main.go

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

proposal doc: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/proposals/buffers.md

- [AEP]: https://docs.google.com/document/d/1bcct-luMPP51YAeUhVuFV7MIXud5wqHsVBDah9WuKeo/edit?tab=t.0

@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-area area/cluster-autoscaler and removed do-not-merge/needs-area labels Sep 16, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: abdelrahman882
Once this PR has been reviewed and has the lgtm label, please assign feiskyer for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Sep 16, 2025
@abdelrahman882 abdelrahman882 changed the title Add capacity buffers scalable objects, limits and integration logic with cluster autoscaler loop Add capacity buffers scalable objects, limits and fake pods injection Sep 16, 2025
@abdelrahman882 abdelrahman882 force-pushed the capacity-buffer-ca branch 2 times, most recently from e547307 to f9f833b Compare September 16, 2025 03:03
Copy link
Contributor

@jbtk jbtk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you checked that the eventing processor does not try to send events for fake pods created from buffer?

client: client,
toProvisionFilter: buffersfilter.NewStatusFilter(map[string]string{
common.ReadyForProvisioningCondition: common.ConditionTrue,
common.ProvisioningCondition: common.ConditionTrue,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should filter out buffers that have not matching generation id to the pod template generation id.

To avoid scaling down cluster in case when they happen to be not updated yet we should have some kind of pod template cache for a short period of time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we excluded buffers with stale generation id we will have scale downs as you mentioned until the controller reacts and if we cached for some period it would be until the generation updated by the controller.

So my suggestion is to

  1. have CA not reacting on generation change (for buffers and for pods templates)
  2. The controller will pick up those and filter them to be processed as soon as a loop kicks in
  3. controller will fix and update the buffer status and CA will correctly react

I think this way it will be smoother as CA will have -most probably- no loop without injection as the fake pods number will change as soon as the controller updates the buffer status

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "CA not reacting on generation change"? What if the autoscaler starts and these do not match from the start of cluster autoscaler?

Copy link
Contributor Author

@abdelrahman882 abdelrahman882 Sep 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by "CA not reacting on generation change"?

By CA I meant the fake pods injector, and by not reacting I mean just inject the stale number of replicas

What if the autoscaler starts and these do not match from the start of cluster autoscaler?

If the autoscaler starts and the generation do not matching, we would have stale injected fake pods until the controller fixes that in ~5s

samplePod := getPodFromTemplate(samplePodTemplate)

for i := 1; i <= podCount; i++ {
newPod := fake.WithFakePodAnnotation(samplePod)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to somehow mark a fake pod as originating from buffer vs for example provisioning request?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can have separate annotation but we also mark those injected for proactive scale up same way so I believe it's better to do it the same way so the fake pods are handled the same way like the others

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in the end it would be nice to emit an event for a buffer if it triggered scale up in the eventing procesor. There we need to differentiate not only omit these: https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/processors/status/eventing_scale_up_processor.go#L39 (and also check which buffer they were generated from)

@abdelrahman882 abdelrahman882 force-pushed the capacity-buffer-ca branch 7 times, most recently from c742f9c to 6e7caa0 Compare September 17, 2025 05:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. kind/feature Categorizes issue or PR as related to a new feature. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants