feat: Introduce NodePool repair policies with toleration durations #2310

matanbaruch · 2025-06-14T07:17:08Z

This pull request introduces a new repair policy framework for managing unhealthy nodes in Karpenter, along with a development container configuration for local development. The changes include the addition of a RepairSpec to NodePool configurations, updates to the design document for node repair, and deprecation of the RepairPolicy in favor of a more flexible RepairStatement. Key updates are grouped below.

Node Repair Policy Enhancements:

Introduced RepairSpec in NodePoolSpec, allowing users to define custom repair policies and default toleration durations for unhealthy nodes. This includes new fields like policies and defaultTolerationDuration in NodePoolSpec (pkg/apis/v1/nodepool.go, [1] [2].
Updated the design document to reflect the new repair policy structure, including resolution order for toleration durations and examples for CloudProvider and NodePool configurations (designs/node-repair.md, [1] [2].
Deprecated RepairPolicy in favor of RepairStatement, which separates condition monitoring from toleration duration configuration (pkg/cloudprovider/types.go, pkg/cloudprovider/types.goR52-R59).

CRD and Codebase Updates:

Updated CRDs to include the repair field in NodePool specifications, with validation for defaultTolerationDuration and policies (pkg/apis/crds/karpenter.sh_nodepools.yaml, [1]; kwok/charts/crds/karpenter.sh_nodepools.yaml, [2].
Added autogenerated deepcopy methods for RepairSpec and RepairPolicy to support the new framework (pkg/apis/v1/zz_generated.deepcopy.go, [1] [2].

Development Environment:

Added a new development container configuration (.devcontainer/devcontainer.json) to streamline local development with pre-configured tools like Go, Git, and kubectl (.devcontainer/devcontainer.json, .devcontainer/devcontainer.jsonR1-R29).

… durations This update enhances the node repair mechanism by allowing users to define repair policies and default toleration durations within NodePool configurations. The changes include the introduction of RepairSpec and RepairPolicy types, which enable customization of repair behaviors based on node conditions. The controller logic has been updated to prioritize NodePool-specific configurations over cloud provider defaults, improving flexibility and user control over node repair processes.

k8s-ci-robot · 2025-06-14T07:17:15Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: matanbaruch
Once this PR has been reviewed and has the lgtm label, please assign ellistarn for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-06-14T07:17:17Z

Hi @matanbaruch. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

…l configuration This commit introduces a new RepairSpec and RepairPolicy structure to the NodePool configuration, allowing users to define custom repair policies and default toleration durations for node conditions. The controller logic has been updated to utilize these configurations, enhancing the flexibility and control over node repair processes.

…requeue logic This commit adds debug logging to the node health controller, providing insights into the reconciliation process, including termination times and condition types. Additionally, it refines the logic for determining the earliest termination time among unhealthy conditions, ensuring more accurate requeue intervals based on the closest termination duration. These changes improve observability and correctness in handling node health states.

coveralls · 2025-06-14T10:27:13Z

Pull Request Test Coverage Report for Build 15894134042

Details

90 of 110 (81.82%) changed or added relevant lines in 3 files are covered.
11 unchanged lines in 2 files lost coverage.
Overall coverage decreased (-0.1%) to 81.841%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/controllers/node/health/controller.go	55	61	90.16%
pkg/apis/v1/zz_generated.deepcopy.go	25	39	64.1%

Files with Coverage Reduction	New Missed Lines	%
pkg/controllers/disruption/consolidation.go	4	85.88%
pkg/controllers/provisioning/scheduling/preferences.go	7	88.76%

Totals
Change from base Build 15890000904:	-0.1%
Covered Lines:	10276
Relevant Lines:	12556

💛 - Coveralls

This commit introduces a new test case for the node health controller, verifying that unhealthy nodes are deleted and metrics are recorded correctly. The test checks the deletion timestamp of the node claim and validates the metrics for disrupted node claims, enhancing the coverage of the node health functionality. Additionally, it includes improvements to pod scheduling tests by relaxing pod affinity and anti-affinity weights, ensuring better scheduling decisions.

…nd add post-create script for tool installations This commit replaces the Git and kubectl-asdf features in the devcontainer configuration with Docker-in-Docker. Additionally, it introduces a new post-create script that installs essential Kubernetes tools such as kubectl, helm, kind, and kwok based on the system architecture, enhancing the development environment setup.

…ness This commit modifies the ExpectStateNodeExists and ExpectStateNodeExistsForNodeClaim functions to improve their logic by checking for equality instead of inequality when identifying nodes. Additionally, a new function, ExpectStateNodeExistsForCoreV1Node, is introduced to streamline the process of retrieving state nodes based on corev1.Node names, enhancing code readability and maintainability.

This commit refines the ExpectStateNodeExists and ExpectStateNodeExistsForNodeClaim functions by adjusting the logic to ensure that nodes are correctly identified based on their names and provider IDs. The changes enhance the clarity and correctness of the node retrieval process, improving overall functionality.

designs/node-repair.md

.devcontainer/devcontainer.json

…ings This commit refines the NodePool repair configuration by introducing a new `repair` block that allows users to specify toleration durations for node conditions. The `TolerationDuration` field has been renamed to `toleration` for clarity, and the controller logic has been updated to utilize these new settings. This change enhances user control over node repair behaviors, allowing for more tailored responses to different workloads and conditions.

jonathan-innis · 2025-06-17T20:36:19Z

/assign @engedaam

matanbaruch · 2025-06-23T11:47:22Z

@engedaam Hey can you run tests please?

engedaam · 2025-06-25T18:49:15Z

designs/node-repair.md

@@ -22,97 +22,176 @@ The alpha implementation will not consider these features:



@matanbaruch instead of changing the node repair doc, can you separate out your use cases and proposal into their own section. It might be better to write this as it's own RFC, since this a pretty major change to the NodePool API. Main reason here is that I would good for us to keep the historical context for the choices that were made with the feature.

You are right. Fixed

engedaam · 2025-06-25T18:50:17Z

/ok-to-test

…l over node repair behavior This commit adds a new `repair` block to the NodePool API, allowing users to specify per-condition toleration durations and customize repair policies. The implementation enhances flexibility for different workloads, addressing the varying sensitivities to node issues. Additionally, it captures the evolution of the node repair feature and outlines the controller workflow for managing unhealthy nodes.

matanbaruch · 2025-06-29T09:23:12Z

Resolve aws/karpenter-provider-aws#7491

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 14, 2025

k8s-ci-robot requested review from engedaam and tallaxes June 14, 2025 07:17

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 14, 2025

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 14, 2025

matanbaruch added 3 commits June 14, 2025 10:22

chore: Remove unused cloudprovider import from health suite test

f5046aa

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 14, 2025

matanbaruch added 4 commits June 14, 2025 12:21

engedaam reviewed Jun 14, 2025

View reviewed changes

designs/node-repair.md Outdated Show resolved Hide resolved

designs/node-repair.md Outdated Show resolved Hide resolved

designs/node-repair.md Outdated Show resolved Hide resolved

.devcontainer/devcontainer.json Show resolved Hide resolved

matanbaruch requested a review from engedaam June 16, 2025 08:57

k8s-ci-robot assigned engedaam Jun 17, 2025

Merge branch 'main' into main

8a4afdf

engedaam reviewed Jun 25, 2025

View reviewed changes

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 25, 2025

Merge branch 'main' into main

2b74f60

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Introduce NodePool repair policies with toleration durations #2310

feat: Introduce NodePool repair policies with toleration durations #2310

matanbaruch commented Jun 14, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Jun 14, 2025

Uh oh!

k8s-ci-robot commented Jun 14, 2025

Uh oh!

coveralls commented Jun 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jonathan-innis commented Jun 17, 2025

Uh oh!

matanbaruch commented Jun 23, 2025

Uh oh!

engedaam Jun 25, 2025

Uh oh!

matanbaruch Jun 26, 2025

Uh oh!

engedaam commented Jun 25, 2025

Uh oh!

matanbaruch commented Jun 29, 2025

Uh oh!

Uh oh!

		@@ -22,97 +22,176 @@ The alpha implementation will not consider these features:

feat: Introduce NodePool repair policies with toleration durations #2310

Are you sure you want to change the base?

feat: Introduce NodePool repair policies with toleration durations #2310

Conversation

matanbaruch commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Node Repair Policy Enhancements:

CRD and Codebase Updates:

Development Environment:

Uh oh!

k8s-ci-robot commented Jun 14, 2025

Uh oh!

k8s-ci-robot commented Jun 14, 2025

Uh oh!

coveralls commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 15894134042

Details

💛 - Coveralls

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jonathan-innis commented Jun 17, 2025

Uh oh!

matanbaruch commented Jun 23, 2025

Uh oh!

engedaam Jun 25, 2025

Choose a reason for hiding this comment

Uh oh!

matanbaruch Jun 26, 2025

Choose a reason for hiding this comment

Uh oh!

engedaam commented Jun 25, 2025

Uh oh!

matanbaruch commented Jun 29, 2025

Uh oh!

Uh oh!

matanbaruch commented Jun 14, 2025 •

edited

Loading

coveralls commented Jun 14, 2025 •

edited

Loading