Skip to content

docs: Node Overlay RFC #2166

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

engedaam
Copy link
Contributor

@engedaam engedaam commented Apr 25, 2025

Fixes

Description
This RFC is an updated Node Overlay RFC original opened a 6 month ago. This RFC plans to go over what is planned

kind: NodeOverlay
metadata:
  name: default
spec:
  priceDiscount: "90%"
  capacity:
    smarter-devices/fuse: 1
---
kind: NodeOverlay
metadata:
  name: memory
spec:
  requirements:
  - key: karpenter.k8s.aws/instance-memory
    operator: Gt
    values: [2048]
   priceAdjustment: "0.109"
  capacity:
    smarter-devices/fuse: 23

How was this change tested?

  • N/A

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 25, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: engedaam
Once this PR has been reviewed and has the lgtm label, please assign mwielgus for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 25, 2025
@engedaam engedaam mentioned this pull request Apr 25, 2025
@engedaam engedaam changed the title RFC: Node Overlay docs: Node Overlay RFC Apr 25, 2025
@coveralls
Copy link

coveralls commented Apr 25, 2025

Pull Request Test Coverage Report for Build 15567714722

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 3 unchanged lines in 1 file lost coverage.
  • Overall coverage decreased (-0.02%) to 82.051%

Files with Coverage Reduction New Missed Lines %
pkg/controllers/provisioning/scheduling/nodeclaim.go 3 89.66%
Totals Coverage Status
Change from base Build 15452759565: -0.02%
Covered Lines: 10263
Relevant Lines: 12508

💛 - Coveralls

@engedaam engedaam force-pushed the node-overlay-rfc branch 3 times, most recently from ce506d2 to c155548 Compare April 25, 2025 22:22
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand"]
priceAdjustment: 0.101
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd avoid floats in APIs. They're notorious for math issues: https://stackoverflow.com/questions/36043597/why-is-1-001-0-001-not-equal-to-1-002

For this reason, you'd be hard pressed to see them in any k8s API.

kubernetes-sigs/controller-tools#245

Copy link

@GnatorX GnatorX May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, ideally you would keep this in the integer numbers world. Maybe use something like basis point https://en.wikipedia.org/wiki/Basis_point

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using integer value to denote the notion of cents and that will be used as a signed value to reflect the price adjustment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using an integer value, It would make sense to have the floats as a string, since the field will accept both percentage and float value. It would make sense to do the conversion within the controller

@@ -0,0 +1,305 @@
# Karpenter - Node Overlay
Copy link
Contributor

@ellistarn ellistarn Apr 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excited to see this rev!

We skip over any discussion of why this can't be included in the Node Pool API. I'd be curious to explore those tradeoffs. I wrote the original RFC, but I've forgotten much of the why behind some of these implicit decisions.

Off the top of my head, I think there's a classic simplicity / flexibility tradeoff.

Including in the NodePool API could lead customers to create more NodePools to handle different override values for different shapes of nodes. There's nothing fundamentally wrong with this from an API perspective, though I know our scheduler isn't particularly good at simulating multiple node pools at the same time (with the same weight). This may be a decent impetus to drive that work.

Adding a new API increases nontrivial complexity to determine the factors that will go into a scheduling decision. We're going from 2 -> N (overlapping overlays) different CRs that will be factored in.

I'm not suggesting you're going the wrong direction with the current proposal, but I do think it's a foundational question that needs some analysis in the RFC.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the thoughtful feedback! I've added a new "Alternative Proposal" section that explores the tradeoffs of integrating this functionality into the NodePool API.

The key consideration that led us away from the NodePool integration was the balance between API simplicity and operational flexibility. While incorporating these features into NodePool might seem simpler initially (one CRD vs. two), it would likely lead to NodePool proliferation as users create separate pools for different override scenarios. This not only increases operational complexity but also puts additional strain on our scheduler, which currently has limitations handling multiple NodePools with similar weights.

I agree that our current proposal of introducing a new CRD does add complexity to the scheduling decision process. However, we believe this complexity is justified by the benefits:

  1. Better separation of concerns between selecting instance types (NodePool) and modifying instance type properties (NodeOverlay)
  2. More flexible overlay stacking capabilities
  3. Easier API evolution path
  4. Simpler implementation of cross-cutting modifications

I've documented these tradeoffs in detail in the new section, including a sample of how the NodePool API extension might have looked.

Copy link
Contributor

@ellistarn ellistarn Jun 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for exploring this @engedaam! A few more thoughts on this.

I'm reminded of this exact discussion when we defined Disruption Budgets. We faced a similar opportunity to define NodeDisruptionBudgets as a separate CRD, which was really attractive given the flexibility it afforded (e.g. limit 1 disruption globally) and given the analogy to PDB. There were similar analogs for how to deal with overlapping NDBs, etc. Ultimately, we decided that the reliability and simplicity benefits outweighed the flexibility.

For 1 (Separation of concerns), I'd suggest that this is not a value prop in an of itself. Separation of concerns is a tool that you use to achieve an outcome -- I'd argue that 2-4 are outcomes of 1.

For 2 (Flexibility), I'd argue that you can achieve the exact same flexibility through multiple overlays within NodePool, e.g. NodePool.spec.overlays (properties? overrides?). Yes there is a 1MB limit that may impact extreme use cases. Yes the configuration would need to be duplicated as needed across multiple Node Pools, mitigated by client side templating (e.g. helm, cdk8s, kustomize, kro, etc) in the same way it's done for configuration duplication across many other Kubernetes APIs.

For 3 (Compatibility), API maturity is a huge value prop, but again I'd suggest that this can be achieved in NodePool. Kubernetes regularly supports alpha/beta fields in core APIs like Pod. I don't think Karpenter has done one of these yet, but in an of itself, I think it would be a win for the project to iron out what that looks like.

For 4 (Simplicity), I think we're at the crux of what makes me uneasy with the separate CRD approach. There is a significant reliability tradeoff to the simplicity that's suggested here.


NodeOverlays introduce cluster-wide blast radius. A user with permission to modify any NodeOverlay gains access to change scheduling calculations for every NodePool. For customers that treat NodePools as a tenancy boundary, this is a big risk vector. Before changing a NodeOverlay, you have to understand all NodePools and all NodeOverlays.

Many customers use NodePools to gradually roll out changes across a cluster. They trial a setting on a specific NodePool, and if things are going as expected, it's rolled out to more NodePools. This is especially important to get right if there's any dependency between the NodeOverlay and the NodePool (e.g. AMI settings).

It's straightforward to imagine cases where cluster-wide NodeOverlays break in surprising ways. Imagine two teams running on two NodePools using different AMI configurations. One team adds in huge-pages support all instance-types of a specific family, but the other team's AMIs aren't configured to support it. Imagine an overlay that changes the memory overhead calculations and all teams are using AL23. If someone wants to try Bottlerocket (different memory overhead) they'd have to go back and carve-out support for the new NodePool in the central NodeOverlay. As the number of of NodePools and NodeOverlays grows, this problem gets multiplicatively harder.

I suspect that most customers would need to implement (or enforce w/ a policy engine) a best practice of "NodeOverlay requirements should be scoped to a specific NodePool". At this point, wouldn't we be better off colocating this functionality in a single API?


Ultimately, I'm torn. For simple use cases, I like how minimal and expressive the separate CRD approach is. As configurations get more complicated, I worry that customers will rapidly lose the ability to reason about how layered/weighted overlays might apply to many NodePools in a cluster.

I'd love to hear perspectives from the customers and use cases that this RFC is targeting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g.

kind: NodePool
spec:
  template: 
    requirements:
    - key: node.kubernetes.io/instance-type
      operator: In
      values: ["m5.large", "m5.2xlarge", "m5.4xlarge", "m5.8xlarge", "m5.12xlarge"]
    overlays: 
    - capacity: 
        smarter-devices/fuse: 1
    - capacity:
        smarter-devices/fuse: 2
      weight: 1
      requirements: 
        key: node.kubernetes.io/instance-type
        operator: In
        values: ["m5.12xlarge"]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your thoughtful analysis and perspective. While I appreciate the concerns raised about potential complexity and reliability issues with a separate NodeOverlay CRD, I respectfully maintain that keeping this functionality separate from NodePool is the better approach. The flexibility and granular control offered by distinct NodeOverlays outweigh the potential drawbacks, in my view. It's crucial to recognize that the complexity of overlays extends beyond simple scenarios. There are use cases require modifying all instance types within a given NodePool, and maintaining this functionality within the NodePool itself would actually increase complexity for users IMHO. The separate NodeOverlay approach provides a more elegant solution for these broader modifications. Regarding blast radius and security concerns, these can be mitigated through careful RBAC policies and best practices documentation. The ability to apply overlays across multiple NodePools without duplicating configuration is a significant advantage, especially for large-scale or multi-team environments. Moreover, keeping NodeOverlays separate allows for more rapid iteration and feature development without impacting the core NodePool API.

It's important to consider organizations with centralized planning teams who manage infrastructure for multiple application teams. These central teams would benefit from the ability to apply consistent policies across various NodePools without the need to modify each one individually, which the separate NodeOverlay approach facilitates. While your points about gradual rollouts and potential conflicts between teams are valid, I believe these challenges can be addressed through proper planning, communication, and tooling rather than constraining the API design. Ultimately, the separate CRD approach provides a more scalable and adaptable solution for complex Kubernetes environments, particularly for organizations with centralized infrastructure management and those requiring broad instance type modifications.

I agree that additional customer feedback would be instrumental for the graduation of this API, however, I think we need teams to actually work with and implement this API in real-world scenarios to provide us with meaningful direction going forward.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding blast radius and security concerns, these can be mitigated through careful RBAC policies and best practices documentation.

RBAC is not flexible to limit the blast radius problem I described. If a NodeOverlay can impact simulation for multiple NodePools, there's nothing RBAC can do. Even a policy engine won't be able to do anything beyond enforcing that NodeOverlays are scoped to NodePools.

Moreover, keeping NodeOverlays separate allows for more rapid iteration and feature development without impacting the core NodePool API.

As I say above, I don't think this is material. You can have new alpha CRDs or new alpha fields in a v1 CRD with similar compatibility guarantees.


I think we can agree that all functional use cases can be met with both approaches. It's really a question of ergonomics vs risk. Simply put, "Is it worth it to deduplicate scheduling configurations if it means increasing the risk of changes." Perhaps we can agree to disagree on this statement.

Luckily, this is an alpha, and Karpenter can always change course after a feedback period.


### Example

**Price Adjustment**: Define a pricing adjustment for all instance types option that fit the match labels field. User can set a scaler value or a percent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give more examples to make price-adjustment clearer.

I'm guessing the current example means 90% of the OD price (ie 10% off)?

So a scaler would always be negative and look like -0.1 or something?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 you should explicitly state if it is an increase or decrease in price either with a sign (-/+) or a diff field for the sign

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I added a section on how exactly this would reflect and removed the notion of floating point values


The integration of Node Overlay adjustment values and deeper observability mechanisms will be deferred until clear users requirements emerge. This decision is partly driven by concerns that excessive events or logs produced by some controllers could be highly noisy or negatively impact the API server. By deferring these features, we can focus on delivering core functionality and validation tooling first, while leaving room for future enhancements. These enhancements will be based on real-world usage patterns and specific users needs for additional visibility into Node Overlay operations.

## Launch Scope
Copy link

@GnatorX GnatorX May 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should also add something on the NodeClaim that calls out the overlay used + the hash of the overlay

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the hash of the overlay

What do you mean by the hash of the overlay?

We're currently deferring the implementation of overlay status reporting until we receive more customer feedback to better understand the most valuable approach. While NodeClaim is one potential option, we're also considering exposing overlay information through NodePool status or the NodeOverlay status itself, including details about which instance types had overlays applied. We're particularly mindful of the signal-to-noise ratio, especially if we implement event or log-based tracking. What kind of visibility would be most useful for your use case?

@GnatorX
Copy link

GnatorX commented May 5, 2025

Thanks for tackling this problem with a new revision!

@engedaam engedaam force-pushed the node-overlay-rfc branch 4 times, most recently from fa43c6a to e2e038b Compare June 4, 2025 15:46
### Example

**Price Adjustment:** Define a pricing adjustment for instance types that match the specified labels. Users can adjust prices using either:
- A signed integer representing the price adjustment in cents (e.g., 500 for $5.00)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we considered using a string representation for currency as well, that might be more natural than the number of cents (at least when using USD).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate more what you mean hear? Using a string representation get us away from having to use serialize and deserialize the floating values that would be passed, as the ultimate goal is to be focused on using integer values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cents isn't granular enough. For example, a t4g.small on Spot is currently $0.0026

Copy link
Contributor Author

@engedaam engedaam Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that's a fair point. Instead of using a float, I think we can just use a string a do a conversion within the controller.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good!


**Fail Open Approach (recommended):**

* Using alphabetical ordering or validate last applied resource to resolve conflicts when weights are equal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does "validate last applied resource" mean here? Are you able to elaborate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dropped that part. The idea there would be that Karpenter would be able to understand change between node overlays to determine if we should the current overlay or block the update to the overlay due to a conflict

**Fail Open Approach (recommended):**

* Using alphabetical ordering or validate last applied resource to resolve conflicts when weights are equal
* Setting overlay Ready status condition to false when conflicts occur
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens when you have a NodeOverlay which results in a conflict for one instance type but not another? Consider this scenario:

  • NodeOverlay A + B applies to instance type A (No Confilicts)
  • NodeOverlay C + B applies to instance type B (Conflict)

In this example, NodeOverlay B is the lower priority overlay in both cases. I imagine we'd still want to use it for instance type A and not B, but I think the recommendation suggests we wouldn't want to use it for either. I think this approach simplifies a fair bit, but I also think it will be surprising when applying an overlay which results in a conflict for InstanceType B also causes changes for InstanceType A. This isn't something that necessarily needs to block this v1alpha1 revision, but is something we can iterate on and get feedback on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current proposal, we would apply the overlay based on the alphabetical ordering. So in the scenario that you have outline, both A and B overlay would be applied and node overlay C would not be applied. The recommended approach will give the case you have outlined.

* Can be implemented at two levels:

1. Global: Stops all provisioning across the cluster
2. NodePool-specific: Only affects misconfigured NodePools
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious how this would work given that NodeOverlays aren't directly linked to NodePools. Would this affect any NodePool which has overlapping instance type selection with the impacted NodeOverlay?

Copy link
Contributor Author

@engedaam engedaam Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would resolve to all the nodepools that the overlay's apply. This would either be through the requirements or understand the instance types that match.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would someone use this in a Reserved Instances (RIs) setup if the Overlay isn't tied to the NodePool?

For RIs, I'd want a weighted NodePool with a limit based on how many RIs I have. The InstanceTypes launched by that NodePool should be cheaper. But other NodePools that launch those Instance Types should not as cheap. Consolidation should also understand the differences between prices of the same instance-type capacity-type

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that would be the current expectation as to how user would us reserved instances, and compute saving plans. The interesting aspect here is that there is a follow-up that we can add to overlays that say only apply the overlay to certain number of instances. That way customer can collapse their nodepools and have reserved instances and normal priced instances in one nodepool.

1. Global: Stops all provisioning across the cluster
2. NodePool-specific: Only affects misconfigured NodePools

* Similar to Karpenter's existing behavior with .spec.limits and InsufficientCapacityErrors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this comparable to ICE errors?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to hitting capacity limits, this ICEd instance types today precise control over provisioning by allowing Karpenter to halt operations either across entire nodepools or for specific instance types affecting all nodepools. The solution provides flexible management of Karpenter's scheduling behavior, ensuring that resource allocation strictly adheres to intended configurations while preventing unplanned provisioning outside defined parameters.

@engedaam engedaam force-pushed the node-overlay-rfc branch 3 times, most recently from 9ce09f9 to 72592c5 Compare June 10, 2025 16:22
@Beardface123
Copy link

Hello all, I wanted to provide a short consumer perspective on this proposed feature.

Our organization has been in demand for several changes that would benefit from the proposed NodeOverlay feature.

  1. We're exploring nodepools to use ARM, with at least one nodepool requiring the fuse-plugin. (We're currently using a very undesirable workaround for this and are eagerly awaiting a solution for this in Karpenter).
  2. Cost is extremely important so any ability to overlay savings plans on top of particular instances for better node allocations is great.
  3. We're daemonset heavy, so including licensing costs of these daemonsets where applicable would be very helpful.

We're still exploring where new nodepools would be necessary, but if these features were to be implemented directly into the NodePool API I fear it would near exponentially increase the number of node pools we'd need to maintain as Karpenter admins. I would welcome the ease of an overlay, especially one that would target specific instances through a familiar spec. I would accept this increased blast radius and put it on our teams to ensure proper testing is done in case poor changes are performed on an overlay. Just an opinion of one Karpenter admin of course. 👍

In short, it's fantastic that Karpenter can smartly chose the right instance for the demand, but we're starting to see some shortcomings when it comes to more complex logic. An optional API to implement complex logic on a subset of nodes in a given nodepool seems like a good potential path forward

In transparency, we are also looking for the ability to dynamically adjust EBS throughput and IOPS depending on instance size, but I don't think NodeOverlay would apply here since this is part of the EC2NodeClass.

@ellistarn
Copy link
Contributor

but if these features were to be implemented directly into the NodePool API I fear it would near exponentially increase the number of node pools we'd need to maintain as Karpenter admins.

Great feedback @Beardface123! Can you expand a little bit on why you'd imagine this would increase the number of NodePools? I initially thought this, but realized an alternative path where .spec.overlays (plural) would let you do complex overrides within a single nodepool. The downside being that if you wanted this config everywhere you'd need to duplicate it across nodepools.

Your comment here seems to suggest that you would scope these overrides to a nodepool anyways:

An optional API to implement complex logic on a subset of nodes in a given nodepool


It's a bit nuanced, so I'm very curious to your thinking on this.

@Beardface123
Copy link

Can you expand a little bit on why you'd imagine this would increase the number of NodePools?

What I wanted to avoid is an implementation that will require NodePools to propagate out exponentially.

Given an example of a general node pool:

  • requires amd and arn architectures and the arn nodes need a fuse-plugin custom resource
  • a wide range of sizes from 2xlarge to 16xlarge. The idea of adding estimated daemonset overhead to these nodes is appealing, but sometimes licensing varies by total CPU for the node. The flat cost adjustment may vary for every instance size.

Given the scenario, we might need to take a single NodePools, split it to two to support the fuse-plugin, and then multiply by 5 for each instance size available. This is what I had in mind in my first post. I think I would be accepting of several implementations as long as it keeps the number of NodePools we manage under control. Also, upon re-reading the thread from a month ago, this might not be what was intended at all. 🙃

Your comment here seems to suggest that you would scope these overrides to a nodepool anyways

I didn't express myself correctly, sorry about that. To correct myself I should've said "An optional API to implement complex logic on a subset of nodes across specified NodePools"

If I were to try to represent my opinions concisely in a few points now that I've re-read the threads and given it another round of thought:

  1. I think the NodeOverlay API object is good idea, because it will create re-useable overlays that could be applied to multiple NodePools.
  2. In my opinion NodeOverlay API objects should either have an opt-out concept in the NodePool or an opt-in which would somewhat blend its concept with your proposed solution of .spec.overlay within the NodePool. I prefer the latter.
  3. It's up to people smarter than me, but if a NodeOverlay could be referenced in a NodePool, it could be an ordered list. This might invalidate the need for a NodeOverlay weight since different nodepools might have different priorities on overlays.

Ultimately, the overlay concept in general is very appealing in either implementation, but I think the preference is to use a separate API object to define overlays and have that propagate to all nodepools that reference it. The alternative is a more cumbersome experience if there are many NodePools with diverse overlays. If manually set, I'd worry about user error. If templating overlays with Helm or other tools is done, well that's time my team needs to figure it out.

Again, sorry for the misunderstanding up front.

@gillg
Copy link

gillg commented Jul 3, 2025

Out of curiosity where happens the Karpenter logic and the NodeOverlay in this schema ?
Does it will fit properly with a special scheduler extension + device plugin ?

                                                                        +----------------------------+
                                                                        | POD Manifest               |
                                                                        | with Request               |
                                                                        | aws.amazon.com/neuroncore:2|
                                                                        |                            |
                                                                        |                            |
                                                    2                   +-------------+--------------+
                                         +--------------------------------+           |
                                         |                                |           |
                                         |                                |           | 3
          +------------------------------+-----+                          |           |
          |           Kubelet in INF1/TRN1 Node|                          |           |
          |                                    +<-----------+             |           |
          +-----+---------------------+--------+            |       +-----v-----------v--------------+
                |                     ^                     |       |          Kube-Scheduler        |
                |                     |                     |       |                                |
                |                     |                     |       +--^------+---------------+------+
              9 |                  1  |                     |          |      |               |
                |                     |                    8|         5|      |4              |
                |                     |                     |          |      |               |
                |                     |                     |          |      |               |6
                v                     |                     |          |      |               |
          +-----+---------------------+--------+            |       +--+------v---------------v------+
          |    neuron-device-plugin            |            +-------+       neuron|scheduler|ext     |
          |    in INF1/TRN1 node               |                    +---------------------+----------+
          +----+----------------------+--------+                                          |
               |                      |                                                   |7
               |                      |10                                                 |
               |                      |                                                   v
             11|                      |                                         +---------+-------+
               |                      |                                         |POD Manifest:    |
               |                      |                                         |Annotation:      |
               |                      |                                         |NEURON_CORES:2,3 |
               v                      +---------------------------------------->+                 |
--device=/dev/neuron1 --env NEURON_RT_VISIBLE_CORES=2,3                         |                 |
                                                                                |                 |
                                                                                +-----------------+

1. neuron-device-plugin returns the list of Neuron cores/devices to kublet
2. Kubelet advertises the Core/Device list to K8s API server (in turn to kube-scheduler)
3. POD Request for neuron cores/devices [Kube-Scheduler picks up the POD creation request]
4. kube-scheduler calls the neuron-scheduler-extn filter function with list of nodes and POD Specification
5. neuron-scheduler-extn scans through the nodes and filters out nodes with non
contiguous cores/devices and returns the nodes that are capable of supporing the given POD specification
6. kube-scheduler calls the neuron-scheduler-extn bind function with pod and node
7. neuron-scheduler-extn updates the POD annotation with allocated neuron core/device Ids (contiguous)
8. neuron-scheduler-extn sends the bind request to kubelet of the selected node
9. Kubelet calls the Alloc function of the neuron-device-plugin
10. neuron-device-plugin queries the POD Annotation for allocated core/device Ids
11. neuron-device-plugin exports the devices & visisble cores to container runtime

Source: https://awsdocs-neuron.readthedocs-hosted.com/en/latest/containers/tutorials/k8s-neuron-scheduler-flow.html#k8s-neuron-scheduler-flow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants