-
Notifications
You must be signed in to change notification settings - Fork 13
Stop FAR Agent on NHC Timeout Annotation #168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: razo7 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Create three function to reconcile logic into smaller usable functions: remove FAR CR finalizer on deletion, remove FAR taint on deletion, and stop agent execution when NHC time out annotation exist or FAR CR was deleted
d545c84 to
06547b1
Compare
|
/hold |
Delete taints and finalizer on the next reconilation
|
/retest |
| // Check NHC timeout annotation | ||
| if isTimedOutByNHC(far) { | ||
| // Check NHC timeout annotation and stop the agent, since remediation is no longer relevant (most likely because fixed by a different remediator) | ||
| if r.Executor.Exists(far.GetUID()) && isTimedOutByNHC(far) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you add a condition here, and few lines below you change to requeue.... I think that's a significant unwanted change in execution flow, what happens on next reconcile?
|
PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
WalkthroughThe reconciliation logic in the fence agents remediation controller was refactored to extract repeated code for stopping the agent and logging CR status conditions into a new method. Logging was improved for clarity, and comments were updated to accurately describe taint removal. Control flow and error handling remain unchanged. Changes
Sequence Diagram(s)sequenceDiagram
participant Reconciler
participant FenceAgentExecutor
participant CR (Custom Resource)
participant Logger
Reconciler->>CR: Detects NHC timeout or deletion
alt Executor exists for CR UID
Reconciler->>Reconciler: stopAgentAndGetCrConditionsStatus(CR, nodeName)
Reconciler->>Logger: Log processing and status conditions
Reconciler->>FenceAgentExecutor: Remove executor by CR UID
end
Reconciler->>Logger: Log finalizer removal (using CR name)
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
controllers/fenceagentsremediation_controller.go (1)
135-142: Consider returning even when no executor is running & rename helper for clarity
The early-exit currently requires
r.Executor.Exists(far.GetUID())to be true.
If the NHC timeout annotation is present but the executor has already finished, the controller will continue with the rest of the reconciliation logic although the remediation is obsolete. A simpler—and safer—rule is: if NHC timed out ⇒ stop processing.
stopAgentAndGetCrConditionsStatusneither “gets” nor returns any status; it only logs and stops the agent. The name is misleading and makes reviewers search for a returned value.Example refactor:
- if r.Executor.Exists(far.GetUID()) && isTimedOutByNHC(far) { + if isTimedOutByNHC(far) { + if r.Executor.Exists(far.GetUID()) { + r.stopAgentAndLogCrConditions(far, node.Name) + } utils.UpdateConditions(utils.RemediationInterruptedByNHC, far, r.Log) commonEvents.RemediationStoppedByNHC(r.Recorder, far) return requeueImmediately, nil }…and rename the helper accordingly.
| // stopAgentAndGetCrStatus log FAR CR status and stops the agent's parallel execution | ||
| func (r *FenceAgentsRemediationReconciler) stopAgentAndGetCrConditionsStatus(far *v1alpha1.FenceAgentsRemediation, nodeName string) { | ||
| processingCondition := meta.FindStatusCondition(far.Status.Conditions, commonConditions.ProcessingType).Status | ||
| fenceAgentActionSucceededCondition := meta.FindStatusCondition(far.Status.Conditions, utils.FenceAgentActionSucceededType).Status | ||
| succeededCondition := meta.FindStatusCondition(far.Status.Conditions, commonConditions.SucceededType).Status | ||
| r.Log.Info("FAR agent was stopped", "Node Name", nodeName, "Agent Name", far.Spec.Agent, "processing condition", processingCondition, | ||
| "fenceAgentActionSucceeded condition", fenceAgentActionSucceededCondition, "succeeded condition", succeededCondition) | ||
| r.Executor.Remove(far.GetUID()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nil-pointer panic risk when reading CR conditions
meta.FindStatusCondition returns *metav1.Condition which can be nil when the condition is absent.
Directly dereferencing .Status will panic and crash the controller on the very scenario it tries to handle (e.g., first reconcile where conditions aren’t set).
- processingCondition := meta.FindStatusCondition(far.Status.Conditions, commonConditions.ProcessingType).Status
- fenceAgentActionSucceededCondition := meta.FindStatusCondition(far.Status.Conditions, utils.FenceAgentActionSucceededType).Status
- succeededCondition := meta.FindStatusCondition(far.Status.Conditions, commonConditions.SucceededType).Status
+ var processingStatus, actionSucceededStatus, succeededStatus metav1.ConditionStatus
+
+ if c := meta.FindStatusCondition(far.Status.Conditions, commonConditions.ProcessingType); c != nil {
+ processingStatus = c.Status
+ }
+ if c := meta.FindStatusCondition(far.Status.Conditions, utils.FenceAgentActionSucceededType); c != nil {
+ actionSucceededStatus = c.Status
+ }
+ if c := meta.FindStatusCondition(far.Status.Conditions, commonConditions.SucceededType); c != nil {
+ succeededStatus = c.Status
+ }This guards against panics while preserving the intended log output.
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In controllers/fenceagentsremediation_controller.go around lines 281 to 288, the
code directly dereferences the Status field of the result from
meta.FindStatusCondition without checking for nil, which can cause a nil-pointer
panic if the condition is absent. To fix this, first check if the returned
condition pointer is nil before accessing its Status field; if nil, use a
default value (e.g., an empty string or a specific status) to safely log the
condition status without panicking.
|
@razo7: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Why we need this PR
FAR didn't stop the agent execution upon detecting NHC time-out annotation. Now it will do that, and on the next reconciliation fianlizer and taints can be removed (only) when far CR is deleted.
Changes made
Add stop agent execution (stopAgentAndGetCrStatus) function- Stop the agent when NHC time-out annotation (
remediation.medik8s.io/nhc-timed-out) exists or when FAR CR has a finalizer and is deleted. The timed-out annotation is set on escalating remediations.Which issue(s) this PR fixes
ECOPROJECT-2539
Test plan
Summary by CodeRabbit