Skip to content

feat: Introduce NodePool repair policies with toleration durations #2310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

matanbaruch
Copy link

@matanbaruch matanbaruch commented Jun 14, 2025

This pull request introduces a new repair policy framework for managing unhealthy nodes in Karpenter, along with a development container configuration for local development. The changes include the addition of a RepairSpec to NodePool configurations, updates to the design document for node repair, and deprecation of the RepairPolicy in favor of a more flexible RepairStatement. Key updates are grouped below.

Node Repair Policy Enhancements:

  • Introduced RepairSpec in NodePoolSpec, allowing users to define custom repair policies and default toleration durations for unhealthy nodes. This includes new fields like policies and defaultTolerationDuration in NodePoolSpec (pkg/apis/v1/nodepool.go, [1] [2].
  • Updated the design document to reflect the new repair policy structure, including resolution order for toleration durations and examples for CloudProvider and NodePool configurations (designs/node-repair.md, [1] [2].
  • Deprecated RepairPolicy in favor of RepairStatement, which separates condition monitoring from toleration duration configuration (pkg/cloudprovider/types.go, pkg/cloudprovider/types.goR52-R59).

CRD and Codebase Updates:

  • Updated CRDs to include the repair field in NodePool specifications, with validation for defaultTolerationDuration and policies (pkg/apis/crds/karpenter.sh_nodepools.yaml, [1]; kwok/charts/crds/karpenter.sh_nodepools.yaml, [2].
  • Added autogenerated deepcopy methods for RepairSpec and RepairPolicy to support the new framework (pkg/apis/v1/zz_generated.deepcopy.go, [1] [2].

Development Environment:

  • Added a new development container configuration (.devcontainer/devcontainer.json) to streamline local development with pre-configured tools like Go, Git, and kubectl (.devcontainer/devcontainer.json, .devcontainer/devcontainer.jsonR1-R29).

… durations

This update enhances the node repair mechanism by allowing users to define repair policies and default toleration durations within NodePool configurations. The changes include the introduction of RepairSpec and RepairPolicy types, which enable customization of repair behaviors based on node conditions. The controller logic has been updated to prioritize NodePool-specific configurations over cloud provider defaults, improving flexibility and user control over node repair processes.
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 14, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: matanbaruch
Once this PR has been reviewed and has the lgtm label, please assign ellistarn for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 14, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @matanbaruch. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 14, 2025
…l configuration

This commit introduces a new RepairSpec and RepairPolicy structure to the NodePool configuration, allowing users to define custom repair policies and default toleration durations for node conditions. The controller logic has been updated to utilize these configurations, enhancing the flexibility and control over node repair processes.
…requeue logic

This commit adds debug logging to the node health controller, providing insights into the reconciliation process, including termination times and condition types. Additionally, it refines the logic for determining the earliest termination time among unhealthy conditions, ensuring more accurate requeue intervals based on the closest termination duration. These changes improve observability and correctness in handling node health states.
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 14, 2025
@coveralls
Copy link

coveralls commented Jun 14, 2025

Pull Request Test Coverage Report for Build 15894134042

Details

  • 90 of 110 (81.82%) changed or added relevant lines in 3 files are covered.
  • 11 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.1%) to 81.841%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/node/health/controller.go 55 61 90.16%
pkg/apis/v1/zz_generated.deepcopy.go 25 39 64.1%
Files with Coverage Reduction New Missed Lines %
pkg/controllers/disruption/consolidation.go 4 85.88%
pkg/controllers/provisioning/scheduling/preferences.go 7 88.76%
Totals Coverage Status
Change from base Build 15890000904: -0.1%
Covered Lines: 10276
Relevant Lines: 12556

💛 - Coveralls

This commit introduces a new test case for the node health controller, verifying that unhealthy nodes are deleted and metrics are recorded correctly. The test checks the deletion timestamp of the node claim and validates the metrics for disrupted node claims, enhancing the coverage of the node health functionality. Additionally, it includes improvements to pod scheduling tests by relaxing pod affinity and anti-affinity weights, ensuring better scheduling decisions.
…nd add post-create script for tool installations

This commit replaces the Git and kubectl-asdf features in the devcontainer configuration with Docker-in-Docker. Additionally, it introduces a new post-create script that installs essential Kubernetes tools such as kubectl, helm, kind, and kwok based on the system architecture, enhancing the development environment setup.
…ness

This commit modifies the ExpectStateNodeExists and ExpectStateNodeExistsForNodeClaim functions to improve their logic by checking for equality instead of inequality when identifying nodes. Additionally, a new function, ExpectStateNodeExistsForCoreV1Node, is introduced to streamline the process of retrieving state nodes based on corev1.Node names, enhancing code readability and maintainability.
This commit refines the ExpectStateNodeExists and ExpectStateNodeExistsForNodeClaim functions by adjusting the logic to ensure that nodes are correctly identified based on their names and provider IDs. The changes enhance the clarity and correctness of the node retrieval process, improving overall functionality.
…ings

This commit refines the NodePool repair configuration by introducing a new `repair` block that allows users to specify toleration durations for node conditions. The `TolerationDuration` field has been renamed to `toleration` for clarity, and the controller logic has been updated to utilize these new settings. This change enhances user control over node repair behaviors, allowing for more tailored responses to different workloads and conditions.
@matanbaruch matanbaruch requested a review from engedaam June 16, 2025 08:57
@jonathan-innis
Copy link
Member

/assign @engedaam

@matanbaruch
Copy link
Author

@engedaam Hey can you run tests please?

@@ -22,97 +22,176 @@ The alpha implementation will not consider these features:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@matanbaruch instead of changing the node repair doc, can you separate out your use cases and proposal into their own section. It might be better to write this as it's own RFC, since this a pretty major change to the NodePool API. Main reason here is that I would good for us to keep the historical context for the choices that were made with the feature.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. Fixed

@engedaam
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 25, 2025
…l over node repair behavior

This commit adds a new `repair` block to the NodePool API, allowing users to specify per-condition toleration durations and customize repair policies. The implementation enhances flexibility for different workloads, addressing the varying sensitivities to node issues. Additionally, it captures the evolution of the node repair feature and outlines the controller workflow for managing unhealthy nodes.
@matanbaruch
Copy link
Author

Resolve aws/karpenter-provider-aws#7491

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants