-
Notifications
You must be signed in to change notification settings - Fork 26
NHC healthy delay #365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NHC healthy delay #365
Conversation
Skipping CI for Draft Pull Request. |
802200b
to
5bb81fc
Compare
/test 4.17-openshift-e2e |
/test 4.17-openshift-e2e |
448f583
to
abe8a9d
Compare
abe8a9d
to
e2fab75
Compare
e2fab75
to
83a26f8
Compare
WalkthroughThis update introduces a configurable "healthy delay" feature to the NodeHealthCheck (NHC) system. A new field allows specifying a delay before a node is considered healthy after recovery, with corresponding API, CRD, and status schema changes. The reconciliation logic, resource manager, and tests are updated to support and verify delayed healthy state recognition and remediation CR deletion. Changes
Sequence Diagram(s)sequenceDiagram
participant Reconciler
participant ResourceManager
participant Node
participant RemediationCR
Reconciler->>ResourceManager: HandleHealthyNode(nodeName, crName, owner)
ResourceManager->>RemediationCR: List remediation CRs for node
loop For each remediation CR
ResourceManager->>RemediationCR: Check for healthy delay annotation
alt No annotation
ResourceManager->>RemediationCR: Set delay start annotation (now)
ResourceManager-->>Reconciler: Return delay duration
else Annotation present
ResourceManager->>ResourceManager: Calculate time left
alt Delay not expired
ResourceManager-->>Reconciler: Return remaining delay
else Delay expired
ResourceManager->>RemediationCR: Delete CR
end
end
end
ResourceManager-->>Reconciler: Return shortest delay for requeue
Suggested labels
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (13)
🚧 Files skipped from review as they are similar to previous changes (10)
🧰 Additional context used🧠 Learnings (4)📓 Common learnings
controllers/shared.go (3)
docs/configuration.md (3)
controllers/resources/manager.go (3)
🧬 Code Graph Analysis (1)controllers/shared.go (1)
🪛 LanguageTooldocs/configuration.md[uncategorized] ~202-~202: Possible missing comma found. (AI_HYDRA_LEO_MISSING_COMMA) 🔇 Additional comments (9)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
83a26f8
to
226c757
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
giving the chance to get further reviews
/hold
//cc @slintes |
@coderabbitai review |
✅ Actions performedReview triggered.
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nodeUpdateNeedsReconcile()
needs to handle annotations
ed25fb9
to
b170c73
Compare
Good point. |
64be754
to
1afee04
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left a note
1afee04
to
203a991
Compare
/test 4.17-openshift-e2e |
/test 4.16-openshift-e2e |
1 similar comment
/test 4.16-openshift-e2e |
… period of time has passed. An annotation on CR is used to manage that, and a status update is added on the CR to note node healthiness is delayed Signed-off-by: Michael Shitrit <[email protected]>
Signed-off-by: Michael Shitrit <[email protected]>
- Requeue a reconcile in case CR deletion is delayed - Fix CSV descriptions - Some refactoring for better readability and usage - Fix status update - Update Healthy Delay validation to allow allow negative values - Use pointers for new API fields - Update md with info regarding new spec - Add remediation.medik8s.io/manually-confirmed-healthy annotation support, in order to enable user to manually terminate delay for specific nodes. - Trigger reconcile upon change of RemediationManuallyConfirmedHealthy Annotation - Move annotation removal to cleanup phase Signed-off-by: Michael Shitrit <[email protected]>
Signed-off-by: Michael Shitrit <[email protected]>
203a991
to
1ec21b4
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mshitrit, slintes The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest |
/override 4.20-openshift-e2e |
@mshitrit: /override requires failed status contexts, check run or a prowjob name to operate on.
Only the following failed contexts/checkruns were expected:
If you are trying to override a checkrun that has a space in it, you must put a double quote on the context. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/override ci/prow/4.20-openshift-e2e |
@mshitrit: Overrode contexts on behalf of mshitrit: ci/prow/4.20-openshift-e2e In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Why we need this PR
We'd like to enable a configuration which delays node returning to health.
In case a delay is configured NHC will delete the remediation only after a node was healthy for the configured time.
A negative value means the node will not be considered healthy and a manual intervention is expected.
Motivation for this came from several customers which require more control on when taints are removed from the node and experienced use case where node regains health for a short period of time.
Changes made
Adding configuration which enables considering node unhealthy until a period of time has passed.
An annotation on CR is used to manage the delay, and a status update is added on the CR to note isn't healthy because of the delay
Which issue(s) this PR fixes
RHWA-10
Test plan
Summary by CodeRabbit