Skip to content

Support maxUnhealthy to easily express N-1, N-2 scenarios #372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 2, 2025

Conversation

tiraboschi
Copy link
Contributor

@tiraboschi tiraboschi commented Jun 25, 2025

Why we need this PR

The minHealthy parameter already allows cluster admins to prevent NHC from remediating too many nodes at once, protecting overall compute capacity.

However, expressing N-1 or N-2 scenarios, where only one or two nodes should be remediated at a time,
currently requires knowing the exact number of nodes in advance and updating the configuration as nodes are added.

This change introduces support for maxUnhealthy as an alternative to minHealthy.

Changes made

This change:

  • introduces a new maxUnhealthy API as a mutually exclusive alternative to minHealthy.
  • Only alters the logic used to compute the minHealthy value (similar to how percentages are handled) when the value is expressed with maxUnhealthy.
  • Keeps the internal processing and validation unchanged.

Which issue(s) this PR fixes

Fixes https://issues.redhat.com/browse/RHWA-88
It's an alternative to #371

Test plan

  • Existing Unit and E2E tests are passing.
  • New unit tests to cover maxUnhealthy.

Summary by CodeRabbit

  • New Features

    • Introduced a new maxUnhealthy field for NodeHealthCheck resources, allowing remediation based on the maximum number or percentage of unhealthy nodes.
    • Users can now choose between minHealthy and maxUnhealthy fields, which are mutually exclusive.
  • Bug Fixes

    • Improved validation to ensure only one of minHealthy or maxUnhealthy can be set, and neither can have negative values.
  • Documentation

    • Updated user guides and configuration documentation to explain the new maxUnhealthy option and its usage constraints.
  • Tests

    • Expanded test coverage to validate new field behaviors, mutual exclusivity, and error handling.

Copy link
Contributor

openshift-ci bot commented Jun 25, 2025

Hi @tiraboschi. Thanks for your PR.

I'm waiting for a medik8s member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Copy link

coderabbitai bot commented Jun 25, 2025

Walkthrough

This update introduces a new MaxUnhealthy parameter to the NodeHealthCheck (NHC) custom resource, allowing configuration of the maximum number or percentage of unhealthy nodes tolerated before remediation is blocked. The change includes schema updates, validation logic enforcing mutual exclusivity with MinHealthy, controller logic refactoring to calculate minimum healthy nodes accordingly, documentation enhancements, and comprehensive tests covering validation, controller behavior, and E2E scenarios.

Changes

File(s) Change Summary
api/v1alpha1/nodehealthcheck_types.go,
api/v1alpha1/zz_generated.deepcopy.go
Added MaxUnhealthy field to NodeHealthCheckSpec, updated MinHealthy to allow non-negative values, and included deep copy logic for MaxUnhealthy.
api/v1alpha1/nodehealthcheck_webhook.go,
api/v1alpha1/nodehealthcheck_webhook_test.go
Replaced previous MinHealthy validation with combined ValidateMinHealthyMaxUnhealthy enforcing mutual exclusivity and presence of either field; added validation prohibiting MaxUnhealthy with node deletion remediators; expanded test coverage for these rules.
controllers/nodehealthcheck_controller.go,
controllers/nodehealthcheck_controller_test.go
Refactored reconciliation logic to use helper getMinHealthy that calculates minimum healthy nodes from either MinHealthy or MaxUnhealthy, with validation and error handling; added detailed table-driven tests for this logic and improved error logging.
e2e/nhc_e2e_test.go Updated E2E tests to initialize NodeHealthCheck with MaxUnhealthy instead of MinHealthy; added validation tests for negative values, mutual exclusivity, and required presence of one field; adjusted update validation tests accordingly.
README.md,
docs/configuration.md
Extended documentation to introduce MaxUnhealthy as an alternative to MinHealthy, clarify mutual exclusivity, update default values, and provide usage guidance including warnings about remediators that delete nodes.
bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml,
config/manifests/base/bases/node-healthcheck-operator.clusterserviceversion.yaml
Added spec descriptor for maxUnhealthy field; updated minHealthy descriptor to allow non-negative values and note mutual exclusivity; clarified allowed values and usage constraints in ClusterServiceVersion manifests.
bundle/manifests/remediation.medik8s.io_nodehealthchecks.yaml,
config/crd/bases/remediation.medik8s.io_nodehealthchecks.yaml
Added maxUnhealthy field to NodeHealthCheck CRD schema with validation and documentation; updated minHealthy field by removing default, allowing zero, and adding mutual exclusivity note; both fields share validation patterns and usage guidance.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant API Server
    participant NHC Validator
    participant NHC Controller

    User->>API Server: Create/Update NodeHealthCheck (with MinHealthy or MaxUnhealthy)
    API Server->>NHC Validator: Validate NHC spec
    NHC Validator->>API Server: Accept/Reject (enforces mutual exclusivity, value checks)
    API Server->>NHC Controller: Store/Update NHC
    NHC Controller->>NHC Controller: getMinHealthy(spec, totalNodes)
    NHC Controller->>NHC Controller: Determine remediation eligibility
Loading

Poem

In the warren of code, a new field appears,
MaxUnhealthy now guides us through HA frontiers.
With checks and balances, no bunny confused,
Only one may be set, so the logic’s not abused.
Now our nodes hop safely, both healthy and spry—
Hooray for this patch, from a rabbit nearby!
🐇✨


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c094b8c and e50f4fa.

📒 Files selected for processing (13)
  • README.md (1 hunks)
  • api/v1alpha1/nodehealthcheck_types.go (1 hunks)
  • api/v1alpha1/nodehealthcheck_webhook.go (4 hunks)
  • api/v1alpha1/nodehealthcheck_webhook_test.go (1 hunks)
  • api/v1alpha1/zz_generated.deepcopy.go (1 hunks)
  • bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml (1 hunks)
  • bundle/manifests/remediation.medik8s.io_nodehealthchecks.yaml (1 hunks)
  • config/crd/bases/remediation.medik8s.io_nodehealthchecks.yaml (1 hunks)
  • config/manifests/base/bases/node-healthcheck-operator.clusterserviceversion.yaml (1 hunks)
  • controllers/nodehealthcheck_controller.go (2 hunks)
  • controllers/nodehealthcheck_controller_test.go (3 hunks)
  • docs/configuration.md (1 hunks)
  • e2e/nhc_e2e_test.go (3 hunks)
✅ Files skipped from review due to trivial changes (1)
  • bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml
🚧 Files skipped from review as they are similar to previous changes (12)
  • README.md
  • e2e/nhc_e2e_test.go
  • api/v1alpha1/zz_generated.deepcopy.go
  • docs/configuration.md
  • config/manifests/base/bases/node-healthcheck-operator.clusterserviceversion.yaml
  • api/v1alpha1/nodehealthcheck_webhook.go
  • bundle/manifests/remediation.medik8s.io_nodehealthchecks.yaml
  • api/v1alpha1/nodehealthcheck_webhook_test.go
  • config/crd/bases/remediation.medik8s.io_nodehealthchecks.yaml
  • api/v1alpha1/nodehealthcheck_types.go
  • controllers/nodehealthcheck_controller.go
  • controllers/nodehealthcheck_controller_test.go
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
README.md (1)

64-67: Fix grammar and spelling issues in the documentation.

The documentation correctly explains the new maxUnhealthy parameter, but there are language issues that need to be addressed.

Apply this diff to fix the grammar and spelling:

-In alternative a complementary parameter can be used to specify the maximum
+As an alternative, a complementary parameter can be used to specify the maximum
-number of unhealthy nodes simplifying the configuration of N-1, N-2 scenario
+number of unhealthy nodes, simplifying the configuration of N-1, N-2 scenarios
-when the total number of selected nodes is not known in advance or is exected
+when the total number of selected nodes is not known in advance or is expected
config/crd/bases/remediation.medik8s.io_nodehealthchecks.yaml (1)

141-142: In-spec references should use camelCase for field names

The description refers to MinHealthy and MaxUnhealthy with an initial capital letter, whereas the CRD field names are minHealthy and maxUnhealthy. Using the exact casing makes grep-based searches and generated docs clearer.

-  MinHealthy and MaxUnhealthy are configuring the same aspect,
+  minHealthy and maxUnhealthy configure the same aspect,
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f7f2c1d and c5e3a73.

📒 Files selected for processing (13)
  • README.md (1 hunks)
  • api/v1alpha1/nodehealthcheck_types.go (1 hunks)
  • api/v1alpha1/nodehealthcheck_webhook.go (3 hunks)
  • api/v1alpha1/nodehealthcheck_webhook_test.go (1 hunks)
  • api/v1alpha1/zz_generated.deepcopy.go (1 hunks)
  • bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml (1 hunks)
  • bundle/manifests/remediation.medik8s.io_nodehealthchecks.yaml (1 hunks)
  • config/crd/bases/remediation.medik8s.io_nodehealthchecks.yaml (1 hunks)
  • config/manifests/base/bases/node-healthcheck-operator.clusterserviceversion.yaml (1 hunks)
  • controllers/nodehealthcheck_controller.go (2 hunks)
  • controllers/nodehealthcheck_controller_test.go (3 hunks)
  • docs/configuration.md (1 hunks)
  • e2e/nhc_e2e_test.go (1 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (3)
api/v1alpha1/nodehealthcheck_webhook.go (1)
api/v1alpha1/nodehealthcheck_types.go (1)
  • NodeHealthCheck (319-325)
controllers/nodehealthcheck_controller.go (1)
api/v1alpha1/nodehealthcheck_types.go (1)
  • NodeHealthCheckSpec (57-136)
controllers/nodehealthcheck_controller_test.go (1)
api/v1alpha1/nodehealthcheck_types.go (1)
  • NodeHealthCheckSpec (57-136)
🪛 LanguageTool
README.md

[uncategorized] ~64-~64: A comma might be missing here.
Context: ...s, by percentage or absolute number. In alternative a complementary parameter can be used t...

(AI_EN_LECTOR_MISSING_PUNCTUATION_COMMA)

🔇 Additional comments (24)
e2e/nhc_e2e_test.go (1)

39-45: LGTM! Correct usage of the new maxUnhealthy field.

The test properly demonstrates the new maxUnhealthy parameter as an alternative to minHealthy. Using intstr.FromInt(1) for maxUnhealthy effectively limits remediation to one unhealthy node at a time, which aligns with the PR's goal of simplifying N-1 scenarios.

api/v1alpha1/zz_generated.deepcopy.go (1)

120-124: LGTM! Correctly generated deep copy logic.

The auto-generated deep copy code properly handles the new MaxUnhealthy field, following the same pattern as other pointer fields in the struct. The nil check and new allocation are implemented correctly.

docs/configuration.md (1)

52-60: Excellent documentation update!

The table clearly documents the new maxUnhealthy field and its relationship with minHealthy. The mutual exclusivity is well explained, and the formatting improvements make the documentation more readable. The change from a default value to "n/a" correctly reflects that at least one of the fields must be explicitly set.

api/v1alpha1/nodehealthcheck_types.go (2)

86-87: LGTM! Clear documentation of mutual exclusivity.

The updated documentation for MinHealthy correctly explains its relationship with the new MaxUnhealthy field, making it clear that they cannot be used simultaneously.


94-104: Excellent implementation of the MaxUnhealthy field!

The new field is properly defined with:

  • Consistent validation annotations matching MinHealthy
  • Clear documentation explaining the 0% edge case
  • Proper kubebuilder and operator-sdk annotations
  • Explicit mention of mutual exclusivity with MinHealthy

The field definition aligns perfectly with the PR objectives of providing an alternative to MinHealthy for N-1, N-2 scenarios.

api/v1alpha1/nodehealthcheck_webhook.go (3)

43-43: LGTM: Consistent error constant addition

The new maxUnhealthyError constant follows the same pattern as the existing minHealthyError, maintaining consistency in the codebase.


109-109: LGTM: Function rename reflects expanded scope

The function call correctly reflects the expanded validation that now handles both MinHealthy and MaxUnhealthy fields.


129-144: LGTM: Comprehensive validation logic

The unified validation function correctly implements all requirements:

  • Enforces mutual exclusivity between MinHealthy and MaxUnhealthy
  • Requires at least one field to be specified
  • Validates both fields for negative integer values
  • Maintains consistent error messaging

The implementation follows established patterns and handles all edge cases properly.

config/manifests/base/bases/node-healthcheck-operator.clusterserviceversion.yaml (2)

59-66: LGTM: Clear and comprehensive field descriptor

The new maxUnhealthy spec descriptor provides excellent documentation:

  • Clearly explains the field's purpose and behavior
  • Documents value constraints (positive integers/percentages, 100% cap)
  • Explicitly mentions mutual exclusivity with minHealthy
  • Consistent formatting with other descriptors

70-72: LGTM: Updated description maintains consistency

The updated minHealthy description correctly documents the mutual exclusivity constraint with maxUnhealthy, ensuring users understand the relationship between these fields.

bundle/manifests/remediation.medik8s.io_nodehealthchecks.yaml (2)

134-146: LGTM: Well-structured CRD schema definition

The maxUnhealthy field schema is correctly implemented:

  • Uses appropriate anyOf pattern for integer/string values
  • Includes comprehensive description with clear constraints
  • Validation pattern matches minHealthy for consistency
  • Properly configured with x-kubernetes-int-or-string: true
  • Documents mutual exclusivity relationship

156-157: LGTM: Consistent documentation update

The updated minHealthy description correctly documents the mutual exclusivity constraint, maintaining consistency with the maxUnhealthy field description.

bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml (2)

99-106: LGTM: Consistent spec descriptor across manifests

The maxUnhealthy spec descriptor matches the base manifest, ensuring consistent documentation across all ClusterServiceVersion files. The description is clear and comprehensive.


110-112: LGTM: Maintains manifest consistency

The minHealthy description update ensures consistency across all ClusterServiceVersion manifests, properly documenting the mutual exclusivity constraint.

api/v1alpha1/nodehealthcheck_webhook_test.go (4)

75-76: LGTM: Proper test isolation

Setting MaxUnhealthy to nil ensures the test properly isolates the MinHealthy negative value validation scenario.


83-93: LGTM: Comprehensive negative value validation test

The test properly validates negative maxUnhealthy values:

  • Correctly isolates the test by setting MinHealthy to nil
  • Uses the appropriate error constant
  • Follows the established test pattern

95-105: LGTM: Proper mutual exclusivity validation test

The test correctly validates that minHealthy and maxUnhealthy cannot be specified simultaneously, checking for the appropriate error message.


107-116: LGTM: Complete requirement validation test

The test correctly validates that at least one of minHealthy or maxUnhealthy must be specified, checking for the appropriate error message.

controllers/nodehealthcheck_controller_test.go (4)

122-148: LGTM: Well-organized test structure for minHealthy validation.

The reorganization of minHealthy tests into a proper Context("minHealthy") block improves test readability and structure. The new test cases appropriately cover percentage validation scenarios and maintain consistency with the existing skipped test for negative values.


150-176: LGTM: Comprehensive test coverage for maxUnhealthy validation.

The maxUnhealthy test cases appropriately mirror the minHealthy tests, which makes sense given their similar validation requirements. The tests cover the key scenarios: invalid percentages over 100%, negative values (properly skipped with explanation), and valid percentage ranges.


277-413: LGTM: Excellent table-driven test coverage for getMinHealthy function.

This comprehensive table test covers all the critical scenarios for the new getMinHealthy function:

Positive cases: Both absolute values and percentages for minHealthy and maxUnhealthy
Edge cases: 0%, 100%, and rounding scenarios
Error cases: Negative values, mutual exclusivity, neither field set, and maxUnhealthy exceeding total nodes
Calculation verification: Correct percentage calculations and minHealthy derivation from maxUnhealthy

The test entries correctly validate the mathematical relationships (e.g., 75% of 10 nodes = 8 after rounding, maxUnhealthy of 8 → minHealthy of 2).


684-687: LGTM: Clean helper function for test data creation.

The ptrToIntStr helper function provides a concise way to create *intstr.IntOrString pointers from string values, improving the readability of the table test entries.

controllers/nodehealthcheck_controller.go (2)

332-335: LGTM: Clean refactoring with enhanced error context.

The refactoring from direct intstr.GetScaledValueFromIntOrPercent call to the new getMinHealthy helper function is well-implemented. The enhanced error logging now includes both minHealthy and maxUnhealthy values, providing better debugging context for administrators.


883-905: LGTM: Robust implementation with comprehensive validation.

The getMinHealthy helper function correctly implements the mutual exclusivity requirement and provides thorough validation:

Mutual exclusivity enforcement (lines 884-886): Prevents both parameters from being set simultaneously
Negative value validation: Applied to both MinHealthy and MaxUnhealthy calculations
Upper bound validation (lines 899-901): Ensures MaxUnhealthy doesn't exceed total node count
Mandatory field validation (line 904): Requires at least one parameter to be specified
Correct mathematical conversion (line 902): total - maxUnhealthy accurately represents the minimum healthy nodes

The implementation aligns perfectly with the PR objectives of supporting N-1, N-2 scenarios through the MaxUnhealthy parameter.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
README.md (1)

64-67: Polish wording & fix typos in the new ‘maxUnhealthy’ paragraph
Current phrasing is a bit awkward (“In alternative”), has a comma missing, plural/typo issues (“scenario”, “exected”). The paragraph reads more smoothly and professionally with the tweaks below.

-In alternative a complementary parameter can be used to specify the maximum
-number of unhealthy nodes simplifying the configuration of N-1, N-2 scenario
-when the total number of selected nodes is not known in advance or is exected
-to change.
+Alternatively, a complementary `maxUnhealthy` parameter can be used to specify
+the maximum number of unhealthy nodes, simplifying the configuration of N-1 or
+N-2 scenarios when the total number of selected nodes is not known in advance
+or is expected to change.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c5e3a73 and 170d888.

📒 Files selected for processing (13)
  • README.md (1 hunks)
  • api/v1alpha1/nodehealthcheck_types.go (1 hunks)
  • api/v1alpha1/nodehealthcheck_webhook.go (3 hunks)
  • api/v1alpha1/nodehealthcheck_webhook_test.go (1 hunks)
  • api/v1alpha1/zz_generated.deepcopy.go (1 hunks)
  • bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml (1 hunks)
  • bundle/manifests/remediation.medik8s.io_nodehealthchecks.yaml (1 hunks)
  • config/crd/bases/remediation.medik8s.io_nodehealthchecks.yaml (1 hunks)
  • config/manifests/base/bases/node-healthcheck-operator.clusterserviceversion.yaml (1 hunks)
  • controllers/nodehealthcheck_controller.go (2 hunks)
  • controllers/nodehealthcheck_controller_test.go (3 hunks)
  • docs/configuration.md (1 hunks)
  • e2e/nhc_e2e_test.go (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (12)
  • e2e/nhc_e2e_test.go
  • api/v1alpha1/zz_generated.deepcopy.go
  • docs/configuration.md
  • config/manifests/base/bases/node-healthcheck-operator.clusterserviceversion.yaml
  • api/v1alpha1/nodehealthcheck_types.go
  • bundle/manifests/remediation.medik8s.io_nodehealthchecks.yaml
  • api/v1alpha1/nodehealthcheck_webhook.go
  • controllers/nodehealthcheck_controller.go
  • bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml
  • config/crd/bases/remediation.medik8s.io_nodehealthchecks.yaml
  • controllers/nodehealthcheck_controller_test.go
  • api/v1alpha1/nodehealthcheck_webhook_test.go
🧰 Additional context used
🪛 LanguageTool
README.md

[uncategorized] ~64-~64: A comma might be missing here.
Context: ...s, by percentage or absolute number. In alternative a complementary parameter can be used t...

(AI_EN_LECTOR_MISSING_PUNCTUATION_COMMA)

Copy link
Member

@slintes slintes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only alters the logic used to compute the minHealthy value

nice solution 👍

@@ -2243,6 +2273,144 @@ var _ = Describe("Node Health Check CR", func() {
})

})

Context("getAbsoluteMinHealthy", func() {
DescribeTable("should return absolute minHealthy",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice 👍

| _remediationTemplate_ | yes but mutually exclusive with below | n/a | A [ObjectReference](https://kubernetes.io/docs/reference/kubernetes-api/common-definitions/object-reference/) to a remediation template provided by a remediation provider. See details below. |
| _escalatingRemediations_ | yes but mutually exclusive with above | n/a | A list of ObjectReferences to a remediation template with order and timeout. See details below. |
| _minHealthy_ | one of _minHealthy_ or _maxUnhealthy_ should be set | n/a | The minimum number of healthy nodes selected by this CR for allowing further remediation. Percentage or absolute number. _minHealthy_ and _maxUnhealthy_ are mutually exclusive. |
| _maxUnhealthy_ | one of _minHealthy_ or _maxUnhealthy_ should be set | n/a | The maximum number of unhealthy nodes selected by this CR for allowing further remediation. Percentage or absolute number. _minHealthy_ and _maxUnhealthy_ are mutually exclusive. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we please note somewhere that maxUnhealthy should not be used with remediators which delete nodes, e.g. MachineDeletionRemediation, because deleting unhealthy nodes breaks counting (un)healthy nodes? Maybe here and also at new the field in types.go?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@slintes slintes changed the title Support manUnhealthy to easily express N-1, N-2 scenarios Support maxUnhealthy to easily express N-1, N-2 scenarios Jun 27, 2025
@slintes
Copy link
Member

slintes commented Jun 27, 2025

/ok-to-test

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7940c62 and 680e5ab.

📒 Files selected for processing (14)
  • README.md (1 hunks)
  • api/v1alpha1/nodehealthcheck_types.go (1 hunks)
  • api/v1alpha1/nodehealthcheck_webhook.go (4 hunks)
  • api/v1alpha1/nodehealthcheck_webhook_test.go (1 hunks)
  • api/v1alpha1/zz_generated.deepcopy.go (1 hunks)
  • bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml (1 hunks)
  • bundle/manifests/remediation.medik8s.io_nodehealthchecks.yaml (1 hunks)
  • config/crd/bases/remediation.medik8s.io_nodehealthchecks.yaml (1 hunks)
  • config/manager/kustomization.yaml (1 hunks)
  • config/manifests/base/bases/node-healthcheck-operator.clusterserviceversion.yaml (1 hunks)
  • controllers/nodehealthcheck_controller.go (2 hunks)
  • controllers/nodehealthcheck_controller_test.go (3 hunks)
  • docs/configuration.md (1 hunks)
  • e2e/nhc_e2e_test.go (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (13)
  • api/v1alpha1/zz_generated.deepcopy.go
  • api/v1alpha1/nodehealthcheck_webhook_test.go
  • bundle/manifests/remediation.medik8s.io_nodehealthchecks.yaml
  • docs/configuration.md
  • api/v1alpha1/nodehealthcheck_webhook.go
  • config/crd/bases/remediation.medik8s.io_nodehealthchecks.yaml
  • api/v1alpha1/nodehealthcheck_types.go
  • config/manifests/base/bases/node-healthcheck-operator.clusterserviceversion.yaml
  • controllers/nodehealthcheck_controller.go
  • e2e/nhc_e2e_test.go
  • README.md
  • bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml
  • controllers/nodehealthcheck_controller_test.go
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/manager.go:319-364
Timestamp: 2025-05-28T07:55:11.390Z
Learning: In the node-healthcheck-operator HandleHealthyNode method, when calcCrDeletionDelay fails with an error, the intended behavior is to log the error and proceed with CR deletion (treating it as "no delay configured") rather than aborting reconciliation. This prevents the system from getting stuck when delay calculations fail due to issues like malformed annotations.
Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/manager.go:319-335
Timestamp: 2025-05-28T08:18:35.543Z
Learning: In the node-healthcheck-operator HandleHealthyNode method, the UpdateStatusNodeDelayedHealthy call with unsafe type cast to *NodeHealthCheck is actually safe because HealthyDelayContextKey is only set for NodeHealthCheck controllers, not MachineHealthCheck controllers. This means shortestDelay will always be 0 for MachineHealthCheck, preventing the unsafe cast line from being reached.
Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/status.go:74-81
Timestamp: 2025-05-28T07:42:13.767Z
Learning: In the Node Healthcheck Operator's status management for controllers/resources/status.go, when a node's healthy delay period expires, the entire unhealthy node entry is removed from the status via UpdateStatusNodeHealthy rather than just resetting the HealthyDelayed flag to false. The state transition flow is: unhealthy -> healthy with delay (HealthyDelayed=true) -> completely healthy (removed from unhealthy nodes list).

Comment on lines 12 to 15
images:
- name: controller
newName: quay.io/medik8s/node-healthcheck-operator
newTag: latest
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Pin the controller image to a fixed tag or digest instead of latest

Using latest makes deployments non-deterministic and risks pulling incompatible images when new versions are pushed. Pin to a released semantic version or, preferably, an immutable digest.

-  newName: quay.io/medik8s/node-healthcheck-operator
-  newTag: latest
+  newName: quay.io/medik8s/node-healthcheck-operator
+  # pin to a specific release tag or digest for reproducibility
+  newTag: v0.8.1        # example – update to the actual release
+# digest: sha256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  # optional but safer
🤖 Prompt for AI Agents
In config/manager/kustomization.yaml lines 12 to 15, the controller image tag is
set to 'latest', which causes non-deterministic deployments. Replace 'latest'
with a fixed semantic version tag or an immutable image digest to ensure
consistent and reliable deployments.

@tiraboschi
Copy link
Contributor Author

/test 4.18-openshift-e2e

@tiraboschi
Copy link
Contributor Author

/test 4.17-openshift-e2e

@tiraboschi
Copy link
Contributor Author

/test 4.16-openshift-e2e

The minHealthy parameter already allows cluster admins
to prevent NHC from remediating too many nodes at once,
protecting overall compute capacity.

However, expressing N-1 or N-2 scenarios, where only
one or two nodes should be remediated at a time,
currently requires knowing the exact number of nodes
in advance and updating the configuration as nodes are added.

This change introduces support for maxUnhealthy as an alternative
to minHealthy.

This change:
- introduces a new maxUnhealthy API as a mutually exclusive
  alternative to minHealthy.
- Only alters the logic used to compute the minHealthy value
  (similar to how percentages are handled) when the value is
  expressed with maxUnhealthy.
- Keeps the internal processing and validation unchanged.

Signed-off-by: Simone Tiraboschi <[email protected]>
@tiraboschi
Copy link
Contributor Author

/test 4.18-openshift-e2e

@slintes
Copy link
Member

slintes commented Jul 2, 2025

/lgtm
/approve

Copy link
Contributor

openshift-ci bot commented Jul 2, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: slintes, tiraboschi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Jul 2, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 576e81a into medik8s:main Jul 2, 2025
22 checks passed
@tiraboschi tiraboschi deleted the maxUnhealthy branch July 3, 2025 07:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants