-
Notifications
You must be signed in to change notification settings - Fork 297
feat: Introduce NodePool repair policies with toleration durations #2310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… durations This update enhances the node repair mechanism by allowing users to define repair policies and default toleration durations within NodePool configurations. The changes include the introduction of RepairSpec and RepairPolicy types, which enable customization of repair behaviors based on node conditions. The controller logic has been updated to prioritize NodePool-specific configurations over cloud provider defaults, improving flexibility and user control over node repair processes.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: matanbaruch The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @matanbaruch. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
…l configuration This commit introduces a new RepairSpec and RepairPolicy structure to the NodePool configuration, allowing users to define custom repair policies and default toleration durations for node conditions. The controller logic has been updated to utilize these configurations, enhancing the flexibility and control over node repair processes.
…requeue logic This commit adds debug logging to the node health controller, providing insights into the reconciliation process, including termination times and condition types. Additionally, it refines the logic for determining the earliest termination time among unhealthy conditions, ensuring more accurate requeue intervals based on the closest termination duration. These changes improve observability and correctness in handling node health states.
Pull Request Test Coverage Report for Build 15894134042Details
💛 - Coveralls |
This commit introduces a new test case for the node health controller, verifying that unhealthy nodes are deleted and metrics are recorded correctly. The test checks the deletion timestamp of the node claim and validates the metrics for disrupted node claims, enhancing the coverage of the node health functionality. Additionally, it includes improvements to pod scheduling tests by relaxing pod affinity and anti-affinity weights, ensuring better scheduling decisions.
…nd add post-create script for tool installations This commit replaces the Git and kubectl-asdf features in the devcontainer configuration with Docker-in-Docker. Additionally, it introduces a new post-create script that installs essential Kubernetes tools such as kubectl, helm, kind, and kwok based on the system architecture, enhancing the development environment setup.
…ness This commit modifies the ExpectStateNodeExists and ExpectStateNodeExistsForNodeClaim functions to improve their logic by checking for equality instead of inequality when identifying nodes. Additionally, a new function, ExpectStateNodeExistsForCoreV1Node, is introduced to streamline the process of retrieving state nodes based on corev1.Node names, enhancing code readability and maintainability.
This commit refines the ExpectStateNodeExists and ExpectStateNodeExistsForNodeClaim functions by adjusting the logic to ensure that nodes are correctly identified based on their names and provider IDs. The changes enhance the clarity and correctness of the node retrieval process, improving overall functionality.
…ings This commit refines the NodePool repair configuration by introducing a new `repair` block that allows users to specify toleration durations for node conditions. The `TolerationDuration` field has been renamed to `toleration` for clarity, and the controller logic has been updated to utilize these new settings. This change enhances user control over node repair behaviors, allowing for more tailored responses to different workloads and conditions.
/assign @engedaam |
@engedaam Hey can you run tests please? |
@@ -22,97 +22,176 @@ The alpha implementation will not consider these features: | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@matanbaruch instead of changing the node repair doc, can you separate out your use cases and proposal into their own section. It might be better to write this as it's own RFC, since this a pretty major change to the NodePool API. Main reason here is that I would good for us to keep the historical context for the choices that were made with the feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Fixed
/ok-to-test |
…l over node repair behavior This commit adds a new `repair` block to the NodePool API, allowing users to specify per-condition toleration durations and customize repair policies. The implementation enhances flexibility for different workloads, addressing the varying sensitivities to node issues. Additionally, it captures the evolution of the node repair feature and outlines the controller workflow for managing unhealthy nodes.
Resolve aws/karpenter-provider-aws#7491 |
This pull request introduces a new repair policy framework for managing unhealthy nodes in Karpenter, along with a development container configuration for local development. The changes include the addition of a
RepairSpec
to NodePool configurations, updates to the design document for node repair, and deprecation of theRepairPolicy
in favor of a more flexibleRepairStatement
. Key updates are grouped below.Node Repair Policy Enhancements:
RepairSpec
inNodePoolSpec
, allowing users to define custom repair policies and default toleration durations for unhealthy nodes. This includes new fields likepolicies
anddefaultTolerationDuration
inNodePoolSpec
(pkg/apis/v1/nodepool.go
, [1] [2].designs/node-repair.md
, [1] [2].RepairPolicy
in favor ofRepairStatement
, which separates condition monitoring from toleration duration configuration (pkg/cloudprovider/types.go
, pkg/cloudprovider/types.goR52-R59).CRD and Codebase Updates:
repair
field in NodePool specifications, with validation fordefaultTolerationDuration
andpolicies
(pkg/apis/crds/karpenter.sh_nodepools.yaml
, [1];kwok/charts/crds/karpenter.sh_nodepools.yaml
, [2].RepairSpec
andRepairPolicy
to support the new framework (pkg/apis/v1/zz_generated.deepcopy.go
, [1] [2].Development Environment:
.devcontainer/devcontainer.json
) to streamline local development with pre-configured tools like Go, Git, and kubectl (.devcontainer/devcontainer.json
, .devcontainer/devcontainer.jsonR1-R29).