-
Notifications
You must be signed in to change notification settings - Fork 297
docs: Node Overlay RFC #2166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
docs: Node Overlay RFC #2166
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: engedaam The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Pull Request Test Coverage Report for Build 15567714722Details
💛 - Coveralls |
ce506d2
to
c155548
Compare
designs/node-overlay.md
Outdated
- key: karpenter.sh/capacity-type | ||
operator: In | ||
values: ["on-demand"] | ||
priceAdjustment: 0.101 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd avoid floats in APIs. They're notorious for math issues: https://stackoverflow.com/questions/36043597/why-is-1-001-0-001-not-equal-to-1-002
For this reason, you'd be hard pressed to see them in any k8s API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, ideally you would keep this in the integer numbers world. Maybe use something like basis point https://en.wikipedia.org/wiki/Basis_point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm using integer value to denote the notion of cents and that will be used as a signed value to reflect the price adjustment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using an integer value, It would make sense to have the floats as a string, since the field will accept both percentage and float value. It would make sense to do the conversion within the controller
@@ -0,0 +1,305 @@ | |||
# Karpenter - Node Overlay |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excited to see this rev!
We skip over any discussion of why this can't be included in the Node Pool API. I'd be curious to explore those tradeoffs. I wrote the original RFC, but I've forgotten much of the why behind some of these implicit decisions.
Off the top of my head, I think there's a classic simplicity / flexibility tradeoff.
Including in the NodePool API could lead customers to create more NodePools to handle different override values for different shapes of nodes. There's nothing fundamentally wrong with this from an API perspective, though I know our scheduler isn't particularly good at simulating multiple node pools at the same time (with the same weight). This may be a decent impetus to drive that work.
Adding a new API increases nontrivial complexity to determine the factors that will go into a scheduling decision. We're going from 2 -> N (overlapping overlays) different CRs that will be factored in.
I'm not suggesting you're going the wrong direction with the current proposal, but I do think it's a foundational question that needs some analysis in the RFC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the thoughtful feedback! I've added a new "Alternative Proposal" section that explores the tradeoffs of integrating this functionality into the NodePool API.
The key consideration that led us away from the NodePool integration was the balance between API simplicity and operational flexibility. While incorporating these features into NodePool might seem simpler initially (one CRD vs. two), it would likely lead to NodePool proliferation as users create separate pools for different override scenarios. This not only increases operational complexity but also puts additional strain on our scheduler, which currently has limitations handling multiple NodePools with similar weights.
I agree that our current proposal of introducing a new CRD does add complexity to the scheduling decision process. However, we believe this complexity is justified by the benefits:
- Better separation of concerns between selecting instance types (NodePool) and modifying instance type properties (NodeOverlay)
- More flexible overlay stacking capabilities
- Easier API evolution path
- Simpler implementation of cross-cutting modifications
I've documented these tradeoffs in detail in the new section, including a sample of how the NodePool API extension might have looked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for exploring this @engedaam! A few more thoughts on this.
I'm reminded of this exact discussion when we defined Disruption Budgets. We faced a similar opportunity to define NodeDisruptionBudgets as a separate CRD, which was really attractive given the flexibility it afforded (e.g. limit 1 disruption globally) and given the analogy to PDB. There were similar analogs for how to deal with overlapping NDBs, etc. Ultimately, we decided that the reliability and simplicity benefits outweighed the flexibility.
For 1 (Separation of concerns), I'd suggest that this is not a value prop in an of itself. Separation of concerns is a tool that you use to achieve an outcome -- I'd argue that 2-4 are outcomes of 1.
For 2 (Flexibility), I'd argue that you can achieve the exact same flexibility through multiple overlays within NodePool, e.g. NodePool.spec.overlays
(properties? overrides?). Yes there is a 1MB limit that may impact extreme use cases. Yes the configuration would need to be duplicated as needed across multiple Node Pools, mitigated by client side templating (e.g. helm, cdk8s, kustomize, kro, etc) in the same way it's done for configuration duplication across many other Kubernetes APIs.
For 3 (Compatibility), API maturity is a huge value prop, but again I'd suggest that this can be achieved in NodePool. Kubernetes regularly supports alpha/beta fields in core APIs like Pod. I don't think Karpenter has done one of these yet, but in an of itself, I think it would be a win for the project to iron out what that looks like.
For 4 (Simplicity), I think we're at the crux of what makes me uneasy with the separate CRD approach. There is a significant reliability tradeoff to the simplicity that's suggested here.
NodeOverlays introduce cluster-wide blast radius. A user with permission to modify any NodeOverlay gains access to change scheduling calculations for every NodePool. For customers that treat NodePools as a tenancy boundary, this is a big risk vector. Before changing a NodeOverlay, you have to understand all NodePools and all NodeOverlays.
Many customers use NodePools to gradually roll out changes across a cluster. They trial a setting on a specific NodePool, and if things are going as expected, it's rolled out to more NodePools. This is especially important to get right if there's any dependency between the NodeOverlay and the NodePool (e.g. AMI settings).
It's straightforward to imagine cases where cluster-wide NodeOverlays break in surprising ways. Imagine two teams running on two NodePools using different AMI configurations. One team adds in huge-pages support all instance-types of a specific family, but the other team's AMIs aren't configured to support it. Imagine an overlay that changes the memory overhead calculations and all teams are using AL23. If someone wants to try Bottlerocket (different memory overhead) they'd have to go back and carve-out support for the new NodePool in the central NodeOverlay. As the number of of NodePools and NodeOverlays grows, this problem gets multiplicatively harder.
I suspect that most customers would need to implement (or enforce w/ a policy engine) a best practice of "NodeOverlay requirements should be scoped to a specific NodePool". At this point, wouldn't we be better off colocating this functionality in a single API?
Ultimately, I'm torn. For simple use cases, I like how minimal and expressive the separate CRD approach is. As configurations get more complicated, I worry that customers will rapidly lose the ability to reason about how layered/weighted overlays might apply to many NodePools in a cluster.
I'd love to hear perspectives from the customers and use cases that this RFC is targeting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g.
kind: NodePool
spec:
template:
requirements:
- key: node.kubernetes.io/instance-type
operator: In
values: ["m5.large", "m5.2xlarge", "m5.4xlarge", "m5.8xlarge", "m5.12xlarge"]
overlays:
- capacity:
smarter-devices/fuse: 1
- capacity:
smarter-devices/fuse: 2
weight: 1
requirements:
key: node.kubernetes.io/instance-type
operator: In
values: ["m5.12xlarge"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your thoughtful analysis and perspective. While I appreciate the concerns raised about potential complexity and reliability issues with a separate NodeOverlay CRD, I respectfully maintain that keeping this functionality separate from NodePool is the better approach. The flexibility and granular control offered by distinct NodeOverlays outweigh the potential drawbacks, in my view. It's crucial to recognize that the complexity of overlays extends beyond simple scenarios. There are use cases require modifying all instance types within a given NodePool, and maintaining this functionality within the NodePool itself would actually increase complexity for users IMHO. The separate NodeOverlay approach provides a more elegant solution for these broader modifications. Regarding blast radius and security concerns, these can be mitigated through careful RBAC policies and best practices documentation. The ability to apply overlays across multiple NodePools without duplicating configuration is a significant advantage, especially for large-scale or multi-team environments. Moreover, keeping NodeOverlays separate allows for more rapid iteration and feature development without impacting the core NodePool API.
It's important to consider organizations with centralized planning teams who manage infrastructure for multiple application teams. These central teams would benefit from the ability to apply consistent policies across various NodePools without the need to modify each one individually, which the separate NodeOverlay approach facilitates. While your points about gradual rollouts and potential conflicts between teams are valid, I believe these challenges can be addressed through proper planning, communication, and tooling rather than constraining the API design. Ultimately, the separate CRD approach provides a more scalable and adaptable solution for complex Kubernetes environments, particularly for organizations with centralized infrastructure management and those requiring broad instance type modifications.
I agree that additional customer feedback would be instrumental for the graduation of this API, however, I think we need teams to actually work with and implement this API in real-world scenarios to provide us with meaningful direction going forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding blast radius and security concerns, these can be mitigated through careful RBAC policies and best practices documentation.
RBAC is not flexible to limit the blast radius problem I described. If a NodeOverlay can impact simulation for multiple NodePools, there's nothing RBAC can do. Even a policy engine won't be able to do anything beyond enforcing that NodeOverlays are scoped to NodePools.
Moreover, keeping NodeOverlays separate allows for more rapid iteration and feature development without impacting the core NodePool API.
As I say above, I don't think this is material. You can have new alpha CRDs or new alpha fields in a v1 CRD with similar compatibility guarantees.
I think we can agree that all functional use cases can be met with both approaches. It's really a question of ergonomics vs risk. Simply put, "Is it worth it to deduplicate scheduling configurations if it means increasing the risk of changes." Perhaps we can agree to disagree on this statement.
Luckily, this is an alpha, and Karpenter can always change course after a feedback period.
designs/node-overlay.md
Outdated
|
||
### Example | ||
|
||
**Price Adjustment**: Define a pricing adjustment for all instance types option that fit the match labels field. User can set a scaler value or a percent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you give more examples to make price-adjustment clearer.
I'm guessing the current example means 90% of the OD price
(ie 10% off)?
So a scaler would always be negative and look like -0.1
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 you should explicitly state if it is an increase or decrease in price either with a sign (-/+) or a diff field for the sign
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I added a section on how exactly this would reflect and removed the notion of floating point values
d3ef8d2
to
1987b1c
Compare
|
||
The integration of Node Overlay adjustment values and deeper observability mechanisms will be deferred until clear users requirements emerge. This decision is partly driven by concerns that excessive events or logs produced by some controllers could be highly noisy or negatively impact the API server. By deferring these features, we can focus on delivering core functionality and validation tooling first, while leaving room for future enhancements. These enhancements will be based on real-world usage patterns and specific users needs for additional visibility into Node Overlay operations. | ||
|
||
## Launch Scope |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should also add something on the NodeClaim that calls out the overlay used + the hash of the overlay
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the hash of the overlay
What do you mean by the hash of the overlay?
We're currently deferring the implementation of overlay status reporting until we receive more customer feedback to better understand the most valuable approach. While NodeClaim is one potential option, we're also considering exposing overlay information through NodePool status or the NodeOverlay status itself, including details about which instance types had overlays applied. We're particularly mindful of the signal-to-noise ratio, especially if we implement event or log-based tracking. What kind of visibility would be most useful for your use case?
Thanks for tackling this problem with a new revision! |
fa43c6a
to
e2e038b
Compare
designs/node-overlay.md
Outdated
### Example | ||
|
||
**Price Adjustment:** Define a pricing adjustment for instance types that match the specified labels. Users can adjust prices using either: | ||
- A signed integer representing the price adjustment in cents (e.g., 500 for $5.00) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we considered using a string representation for currency as well, that might be more natural than the number of cents (at least when using USD).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate more what you mean hear? Using a string representation get us away from having to use serialize and deserialize the floating values that would be passed, as the ultimate goal is to be focused on using integer values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cents isn't granular enough. For example, a t4g.small on Spot is currently $0.0026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm that's a fair point. Instead of using a float, I think we can just use a string a do a conversion within the controller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good!
designs/node-overlay.md
Outdated
|
||
**Fail Open Approach (recommended):** | ||
|
||
* Using alphabetical ordering or validate last applied resource to resolve conflicts when weights are equal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does "validate last applied resource" mean here? Are you able to elaborate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dropped that part. The idea there would be that Karpenter would be able to understand change between node overlays to determine if we should the current overlay or block the update to the overlay due to a conflict
**Fail Open Approach (recommended):** | ||
|
||
* Using alphabetical ordering or validate last applied resource to resolve conflicts when weights are equal | ||
* Setting overlay Ready status condition to false when conflicts occur |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens when you have a NodeOverlay which results in a conflict for one instance type but not another? Consider this scenario:
- NodeOverlay A + B applies to instance type A (No Confilicts)
- NodeOverlay C + B applies to instance type B (Conflict)
In this example, NodeOverlay B is the lower priority overlay in both cases. I imagine we'd still want to use it for instance type A and not B, but I think the recommendation suggests we wouldn't want to use it for either. I think this approach simplifies a fair bit, but I also think it will be surprising when applying an overlay which results in a conflict for InstanceType B also causes changes for InstanceType A. This isn't something that necessarily needs to block this v1alpha1 revision, but is something we can iterate on and get feedback on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current proposal, we would apply the overlay based on the alphabetical ordering. So in the scenario that you have outline, both A and B overlay would be applied and node overlay C would not be applied. The recommended approach will give the case you have outlined.
* Can be implemented at two levels: | ||
|
||
1. Global: Stops all provisioning across the cluster | ||
2. NodePool-specific: Only affects misconfigured NodePools |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious how this would work given that NodeOverlays aren't directly linked to NodePools. Would this affect any NodePool which has overlapping instance type selection with the impacted NodeOverlay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would resolve to all the nodepools that the overlay's apply. This would either be through the requirements or understand the instance types that match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would someone use this in a Reserved Instances (RIs) setup if the Overlay isn't tied to the NodePool?
For RIs, I'd want a weighted NodePool with a limit based on how many RIs I have. The InstanceTypes launched by that NodePool should be cheaper. But other NodePools that launch those Instance Types should not as cheap. Consolidation should also understand the differences between prices of the same instance-type capacity-type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that would be the current expectation as to how user would us reserved instances, and compute saving plans. The interesting aspect here is that there is a follow-up that we can add to overlays that say only apply the overlay to certain number of instances. That way customer can collapse their nodepools and have reserved instances and normal priced instances in one nodepool.
1. Global: Stops all provisioning across the cluster | ||
2. NodePool-specific: Only affects misconfigured NodePools | ||
|
||
* Similar to Karpenter's existing behavior with .spec.limits and InsufficientCapacityErrors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this comparable to ICE errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to hitting capacity limits, this ICEd instance types today precise control over provisioning by allowing Karpenter to halt operations either across entire nodepools or for specific instance types affecting all nodepools. The solution provides flexible management of Karpenter's scheduling behavior, ensuring that resource allocation strictly adheres to intended configurations while preventing unplanned provisioning outside defined parameters.
9ce09f9
to
72592c5
Compare
72592c5
to
2241bf1
Compare
Hello all, I wanted to provide a short consumer perspective on this proposed feature. Our organization has been in demand for several changes that would benefit from the proposed
We're still exploring where new nodepools would be necessary, but if these features were to be implemented directly into the NodePool API I fear it would near exponentially increase the number of node pools we'd need to maintain as Karpenter admins. I would welcome the ease of an overlay, especially one that would target specific instances through a familiar spec. I would accept this increased blast radius and put it on our teams to ensure proper testing is done in case poor changes are performed on an overlay. Just an opinion of one Karpenter admin of course. 👍 In short, it's fantastic that Karpenter can smartly chose the right instance for the demand, but we're starting to see some shortcomings when it comes to more complex logic. An optional API to implement complex logic on a subset of nodes in a given nodepool seems like a good potential path forward In transparency, we are also looking for the ability to dynamically adjust EBS throughput and IOPS depending on instance size, but I don't think NodeOverlay would apply here since this is part of the EC2NodeClass. |
Great feedback @Beardface123! Can you expand a little bit on why you'd imagine this would increase the number of NodePools? I initially thought this, but realized an alternative path where Your comment here seems to suggest that you would scope these overrides to a nodepool anyways:
It's a bit nuanced, so I'm very curious to your thinking on this. |
What I wanted to avoid is an implementation that will require NodePools to propagate out exponentially. Given an example of a general node pool:
Given the scenario, we might need to take a single NodePools, split it to two to support the
I didn't express myself correctly, sorry about that. To correct myself I should've said "An optional API to implement complex logic on a subset of nodes across specified NodePools" If I were to try to represent my opinions concisely in a few points now that I've re-read the threads and given it another round of thought:
Ultimately, the overlay concept in general is very appealing in either implementation, but I think the preference is to use a separate API object to define overlays and have that propagate to all nodepools that reference it. The alternative is a more cumbersome experience if there are many NodePools with diverse overlays. If manually set, I'd worry about user error. If templating overlays with Helm or other tools is done, well that's time my team needs to figure it out. Again, sorry for the misunderstanding up front. |
Out of curiosity where happens the Karpenter logic and the NodeOverlay in this schema ?
|
Fixes
Description
This RFC is an updated Node Overlay RFC original opened a 6 month ago. This RFC plans to go over what is planned
How was this change tested?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.