Skip to content

WIP: Introduce Node Lifecycle WG #8396

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

atiratree
Copy link
Member

No description provided.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/community-management area/slack-management Issues or PRs related to the Slack Management subproject labels Mar 24, 2025
@k8s-ci-robot k8s-ci-robot requested review from ahg-g and ardaguclu March 24, 2025 12:17
@k8s-ci-robot k8s-ci-robot added committee/steering Denotes an issue or PR intended to be handled by the steering committee. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/contributor-experience Categorizes an issue or PR as relevant to SIG Contributor Experience. do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Mar 24, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling Mar 24, 2025
@atiratree atiratree changed the title Introduce Node Lifecycle WG WIP: Introduce Node Lifecycle WG Mar 24, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 24, 2025
@atiratree
Copy link
Member Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 24, 2025
@rthallisey
Copy link

Looks like I'm not a member of kubernetes org anymore. I was a few years back, but didn't keep up with contributions recently. You can remove me as a lead and I can reapply after some contributions to this WG.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label Mar 24, 2025
@atiratree
Copy link
Member Author

We have had impactful conversations with Ryan about this group and its goals. He has experience with cluster maintenance and I look forward to his participation in the WG.

@marquiz
Copy link
Contributor

marquiz commented Mar 25, 2025

/cc

@k8s-ci-robot k8s-ci-robot requested a review from marquiz March 25, 2025 17:09
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: atiratree
Once this PR has been reviewed and has the lgtm label, please assign pohly for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@atiratree atiratree force-pushed the wg-node-lifecycle branch 4 times, most recently from d725bb9 to a3da4df Compare April 11, 2025 09:08
controllers, API validation, integration with existing core components and extension points for the
ecosystem. This should be accompanied by E2E / Conformance tests.

## Relevant Projects
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For visibility, please let me know, if anyone has a relevant project they would like to see included here.

Comment on lines +36 to +38
- Improve the Graceful/Non-Graceful Node Shutdown and consider how this affects the node lifecycle.
To graduate the [Graceful Node Shutdown](https://github.com/kubernetes/enhancements/issues/2000)
feature to GA and resolve the associated node shutdown issues.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Improve the Graceful/Non-Graceful Node Shutdown and consider how this affects the node lifecycle.
To graduate the [Graceful Node Shutdown](https://github.com/kubernetes/enhancements/issues/2000)
feature to GA and resolve the associated node shutdown issues.
- Improve the Graceful/Non-Graceful Node Shutdown and consider how this affects the node lifecycle.

Let's stick to general topics, w/o mentioning specific KEPs in the charter.

Copy link
Member Author

@atiratree atiratree Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was requested by SIG Node. @SergeyKanzhelev can you please give us input how would you like the goals to be defined?

- As a cluster admin I want to have a simple interface to initiate a node drain/maintenance without
any required manual interventions. I also want to be able to observe the node drain via the API
and check on its progress. I also want to be able to discover workloads that are blocking the node
drain.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this entire section has 3 separate use-cases:

  1. initiate
  2. observe
  3. discover

Can you just split them accordingly. It's easier to read shorter user stories.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, I will go over the use cases and improve them.

DRA: device taints and tolerations feature tracked in https://github.com/kubernetes/enhancements/issues/5055.
- An API to remove pods from endpoints before they terminate.
Currently tracked in https://docs.google.com/document/d/1t25jgO_-LRHhjRXf4KJ5xY_t8BZYdapv7MDAxVGY6R8/edit?tab=t.0#heading=h.i4lwa7rdng7y.
- Introduce enhancements across multiple Kubernetes SIGs to add support for the new APIs to solve
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement doesn't seem to fit in Area we expect to explore:. I'd drop it entirely.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still do not know how the integration will look like, but it will have to be explored and can result in other enhancements. I would still prefer if we could reference such future work.

Currently tracked in https://github.com/kubernetes/enhancements/issues/4563.
- An API/mechanism to gracefully terminate pods during a node shutdown.
Graceful node shutdown feature tracked in https://github.com/kubernetes/enhancements/issues/2000.
- An API to deschedule pods that use DRA devices.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying there will be separate API for descheduling any Pod and a Pod with DRA device? Why both can't just use /evict?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This just references an existing feature without specifying the implementation details here.

Graceful node shutdown feature tracked in https://github.com/kubernetes/enhancements/issues/2000.
- An API to deschedule pods that use DRA devices.
DRA: device taints and tolerations feature tracked in https://github.com/kubernetes/enhancements/issues/5055.
- An API to remove pods from endpoints before they terminate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here, /evict isn't sufficient?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this is referencing an existing doc and I believe the /evict API is not sufficient in this scenario since it needs to apply to all workloads.

projects and addressing scenarios that impede node drain or cause improper pod termination. Our
objective is to create easily configurable, out-of-the-box solutions that seamlessly integrate with
existing APIs and behaviors. We will strive to make these solutions minimalistic and extensible to
support advanced use cases across the ecosystem.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to especially stress this section:

We will strive to make these solutions minimalistic and extensible to support advanced 
use cases across the ecosystem.

to ensure we first look into existing APIs and how we can expand them, rather than introducing new ones.

We already struggle with small usage of Eviction API, adding new API will not resolve the problem, but will only make it more complicated for users to find the right one. I believe someone else already stressed that out, but I'd like to see this being one of the key goals for this WG.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, the goals should be stated in this fashion.

@kwilczynski
Copy link
Member

@atiratree, even though I don't work for Red Hat any more, I would like to join this WG, this topic is still of interest to me.

@selansen
Copy link

@atiratree, I would like to be part of this WG. Pls include me as well.

@evrardjp
Copy link

I have written some PoC that might interest this wg, sign me up.

@evrardjp
Copy link

/cc

@k8s-ci-robot
Copy link
Contributor

@evrardjp: GitHub didn't allow me to request PR reviews from the following users: evrardjp.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kaushik229
Copy link

/cc

@k8s-ci-robot
Copy link
Contributor

@kaushik229: GitHub didn't allow me to request PR reviews from the following users: kaushik229.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i like how this is going and am excited to see the wg formed. thank you @atiratree !

@humblec
Copy link
Contributor

humblec commented Apr 16, 2025

I have been exploring the API's in this area and would like to help on this initiative. Considering that, @atiratree, I would like to be part of this WG.

@atiratree atiratree force-pushed the wg-node-lifecycle branch 2 times, most recently from 43ff1f5 to c627543 Compare April 17, 2025 15:41
@atiratree
Copy link
Member Author

Thank you all for your interest!

Just to be on the same page for all visitors, this WG is open to everyone and we will announce the weekly meetings on the [email protected] mailing list as soon as the group is formed.

If you are interested in helping us organize/lead this group, please write me on Slack to discuss.

Co-authored-by: Ryan Hallisey <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/community-management area/slack-management Issues or PRs related to the Slack Management subproject cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. committee/steering Denotes an issue or PR intended to be handled by the steering committee. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/contributor-experience Categorizes an issue or PR as relevant to SIG Contributor Experience. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.