docs: RFC for Capacity Buffer API Support #2611

moko-poi · 2025-11-03T03:24:58Z

Description

This is a proposal to add support for the standard Kubernetes CapacityBuffer API (autoscaling.x-k8s.io/v1alpha1) to enable pre-provisioned spare capacity in
Karpenter clusters.

The RFC introduces a virtual pod approach that integrates buffer capacity into Karpenter's scheduling and consolidation algorithms while maintaining compatibility
with the existing Cluster Autoscaler Buffer API.

Related issue: #2571

How was this change tested?

RFC only - implementation will follow in subsequent PRs

Key Features

Standard CapacityBuffer CRD support
Virtual pod generation for buffer capacity
Integration with Karpenter's NodeClaim-based architecture
Consolidation protection for buffer capacity
Cross-autoscaler compatibility (Cluster Autoscaler → Karpenter migration)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

k8s-ci-robot · 2025-11-03T03:25:05Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: moko-poi
Once this PR has been reviewed and has the lgtm label, please assign njtran for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-11-03T03:25:08Z

Hi @moko-poi. Thanks for your PR.

I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

coveralls · 2025-11-03T03:38:58Z

Pull Request Test Coverage Report for Build 19022966184

Details

0 of 0 changed or added relevant lines in 0 files are covered.
2 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.09%) to 81.706%

Files with Coverage Reduction	New Missed Lines	%
pkg/controllers/node/termination/controller.go	2	77.14%

Totals
Change from base Build 18947767733:	0.09%
Covered Lines:	11581
Relevant Lines:	14174

💛 - Coveralls

DerekFrank · 2025-11-03T22:50:39Z

designs/capacity-buffer.md

@@ -0,0 +1,594 @@
+# Capacity Buffer API Support


Thanks for the design! Couple high level notes:

I don't think the capacity buffer api exists yet. The sig-autoscaling RFC was merged but I don't think the api itself has been released. Doesn't mean we can't think ahead, but implementation for this will have to wait until the api exists.

Overall I think the doc is a bit messy. I think it would be a stronger proposal if you started from the CX and derived implementation from that. Similarly, I think the implementation section could be much stronger if it started from the requirements of the existing controllers and worked backwards to the API's that the capacity buffer controller should provide.

DerekFrank · 2025-11-03T22:51:03Z

designs/capacity-buffer.md

+- Using pause containers with resource requests to reserve capacity
+- Over-provisioning through static NodePools
+
+The Kubernetes SIG Autoscaling has standardized a CapacityBuffer API to declare spare capacity/headroom in clusters. Cluster Autoscaler supports this API (autoscaling.x-k8s.io/v1alpha1), providing a vendor-agnostic way to express capacity requirements.


I'm not sure CAS has support for that API yet

DerekFrank · 2025-11-03T22:52:10Z

designs/capacity-buffer.md

+1. **Performance-critical applications** where just-in-time provisioning latency is unacceptable
+2. **Burst workloads** that need immediate scheduling for CI/CD, batch jobs, or event-driven applications
+3. **High-availability services** that require buffer capacity to handle traffic spikes or node failures
+4. **Consistent user experience** across different autoscaling solutions in the Kubernetes ecosystem


I don't think we care all that much about consistent UX. In fact, the two autoscaling solutions work very differently. I do think we could say we care about intent driven configuration though

DerekFrank · 2025-11-03T22:53:43Z

designs/capacity-buffer.md

+
+## Proposal
+
+Extend Karpenter to support the standard CapacityBuffer API (autoscaling.x-k8s.io/v1alpha1) by integrating buffer capacity into scheduling and consolidation algorithms.


Given that the API is alpha, whatever design we create should include the standard set of alpha protections we use. I don't see where you've discussed feature gating this and the opt-in opt-out behavior, but the RFC should include details on that

DerekFrank · 2025-11-03T22:56:15Z

designs/capacity-buffer.md

+```
+
+Key aspects:
+1. **Virtual Pod Approach**: Follow Cluster Autoscaler's pattern using in-memory virtual pods


I think we can reduce this to a single goal, something along the lines of 'Karpenter respects configured CapacityBuffers, maintaining additional capacity as if they were pods'. 1-3 in this list are implementation details that get us towards that goal

DerekFrank · 2025-11-03T23:11:55Z

designs/capacity-buffer.md

+
+5. **Graceful Degradation**: If buffer capacity cannot be maintained, prioritize user workloads and log buffer capacity warnings
+
+### API Integration


I think this section is repeated

DerekFrank · 2025-11-03T23:12:15Z

designs/capacity-buffer.md

+
+**Revised Protection Strategy**:
+
+1. **NodeClaim-Level Tracking**: Buffer capacity is tracked at the NodeClaim level, not just pod level


I don't think this is correct. What are you trying to say with this?

DerekFrank · 2025-11-03T23:13:07Z

designs/capacity-buffer.md

+5. **Update buffer status** with translation results
+6. **Inject virtual pods** into Karpenter's scheduling pipeline
+
+### Implementation Phases


Could this section enumerate what requirements must be met before implementation can begin, and then what all functionality is required for the alpha release of buffer support?

DerekFrank · 2025-11-03T23:14:11Z

designs/capacity-buffer.md

+- Memory overhead of virtual pods
+- Watch performance with many buffers
+
+## Migration & Compatibility


I think this section is missing a feature flag discussion

DerekFrank · 2025-11-03T23:14:43Z

designs/capacity-buffer.md

+   **A**: Follow Karpenter's provisioning behavior - create suitable NodeClaims through scheduler constraint solving
+
+3. **Q**: How does buffer capacity interact with NodePool limits?
+   **A**: Buffer NodeClaims must respect NodePool resource limits and budget constraints


Again, what is a buffer nodeclaim?

docs: RFC for Capacity Buffer API Support

3fc1738

k8s-ci-robot requested review from engedaam and tallaxes November 3, 2025 03:25

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Nov 3, 2025

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Nov 3, 2025

moko-poi mentioned this pull request Nov 3, 2025

Support K8s Buffer Api #2571

Open

DerekFrank suggested changes Nov 3, 2025

View reviewed changes


		## Proposal

		Extend Karpenter to support the standard CapacityBuffer API (autoscaling.x-k8s.io/v1alpha1) by integrating buffer capacity into scheduling and consolidation algorithms.


		5. Graceful Degradation: If buffer capacity cannot be maintained, prioritize user workloads and log buffer capacity warnings

		### API Integration


		Revised Protection Strategy:

		1. NodeClaim-Level Tracking: Buffer capacity is tracked at the NodeClaim level, not just pod level

docs: RFC for Capacity Buffer API Support #2611

Are you sure you want to change the base?

docs: RFC for Capacity Buffer API Support #2611

Uh oh!

Conversation

moko-poi commented Nov 3, 2025

Description

How was this change tested?

Key Features

Uh oh!

k8s-ci-robot commented Nov 3, 2025

Uh oh!

k8s-ci-robot commented Nov 3, 2025

Uh oh!

coveralls commented Nov 3, 2025

Pull Request Test Coverage Report for Build 19022966184

Details

💛 - Coveralls

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants