-
Notifications
You must be signed in to change notification settings - Fork 27
Modify SNR Remediation Taint #277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: razo7 The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
Warning Rate limit exceeded@razo7 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 24 minutes and 46 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (3)
📝 WalkthroughWalkthroughSwaps Kubernetes core imports from Changes
Sequence Diagram(s)(omitted — changes are type/taint migrations and test/manifest updates without new multi-component control flows) Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
controllers/selfnoderemediation_controller.go (1)
814-835: Function name and messages don't match the taint effect.The function
removeNoExecuteTaintremovesNodeNoScheduleTaintwhich hasTaintEffectNoSchedule, notTaintEffectNoExecute. Additionally:
- Line 831: Log message says "NoExecute taint removed" but the effect is NoSchedule
- Line 832: Event message says "remove NoExecute taint from healthy remediated node" but the effect is NoSchedule
This creates significant confusion about what the code is actually doing.
🔎 Proposed fix
-func (r *SelfNodeRemediationReconciler) removeNoExecuteTaint(node *corev1.Node) error { +func (r *SelfNodeRemediationReconciler) removeRemediationTaint(node *corev1.Node) error { if !utils.TaintExists(node.Spec.Taints, NodeNoScheduleTaint) { return nil } patch := client.MergeFrom(node.DeepCopy()) if taints, deleted := utils.DeleteTaint(node.Spec.Taints, NodeNoScheduleTaint); !deleted { r.logger.Info("Failed to remove taint from node, taint not found", "node name", node.Name, "taint key", NodeNoScheduleTaint.Key, "taint effect", NodeNoScheduleTaint.Effect) return nil } else { node.Spec.Taints = taints } if err := r.Client.Patch(context.Background(), node, patch); err != nil { r.logger.Error(err, "Failed to remove taint from node,", "node name", node.Name, "taint key", NodeNoScheduleTaint.Key, "taint effect", NodeNoScheduleTaint.Effect) return err } - r.logger.Info("NoExecute taint removed", "new taints", node.Spec.Taints) - events.NormalEvent(r.Recorder, node, eventReasonRemoveNoExecute, "Remediation process - remove NoExecute taint from healthy remediated node") + r.logger.Info("Remediation taint removed", "new taints", node.Spec.Taints) + events.NormalEvent(r.Recorder, node, eventReasonRemoveNoExecute, "Remediation process - remove remediation taint from healthy remediated node") return nil }Note: You'll also need to update the call site at line 579 and consider renaming the event reason constant.
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
controllers/selfnoderemediation_controller.gocontrollers/tests/controller/selfnoderemediation_controller_test.goe2e/self_node_remediation_test.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-06-18T11:38:24.387Z
Learnt from: slintes
Repo: medik8s/self-node-remediation PR: 262
File: pkg/peerhealth/client_server_test.go:165-189
Timestamp: 2025-06-18T11:38:24.387Z
Learning: In the medik8s/self-node-remediation project, the Ginkgo tests in pkg/peerhealth/client_server_test.go don't run in parallel, making shared mutable state like the `reader` variable safe to use across test cases without race condition concerns.
Applied to files:
controllers/tests/controller/selfnoderemediation_controller_test.go
🧬 Code graph analysis (2)
e2e/self_node_remediation_test.go (1)
controllers/selfnoderemediation_controller.go (1)
NodeNoScheduleTaint(74-78)
controllers/tests/controller/selfnoderemediation_controller_test.go (1)
controllers/selfnoderemediation_controller.go (1)
NodeNoScheduleTaint(74-78)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (1)
controllers/selfnoderemediation_controller.go (1)
74-78: Taint effect change looks correct.The change from
TaintEffectNoExecutetoTaintEffectNoSchedulealigns with the PR objective to prevent volume consistency issues. NoSchedule will prevent new pods from being scheduled without force-evicting existing pods.
controllers/tests/controller/selfnoderemediation_controller_test.go
Outdated
Show resolved
Hide resolved
9f9cd60 to
31dd039
Compare
|
/test 4.21-openshift-e2e |
|
|
||
| NodeNoExecuteTaint = &v1.Taint{ | ||
| NodeNoScheduleTaint = &corev1.Taint{ | ||
| Key: "medik8s.io/remediation", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please update the key as discussed
Use a distinctive, common, alias for k8s.io/api/core/v1 rather than 'v1'
Avoid manual mispelling of common oos tain key
The remediation taint is aimed to prevent scheduling of workloads on the unhealthy node before the node is considered to be healthy and the reboot took action. Using NoExecute effect when SNR failed to fence the node within those 6 minutes could then lead to a risk of volume inconsistency. In any situation where a pod deletion has not succeeded for 6 minutes, kubernetes will force detach volumes being unmounted if the node is unhealthy at that instant
45082a5 to
20792a8
Compare
|
/test 4.21-openshift-e2e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
controllers/tests/controller/selfnoderemediation_controller_test.go (1)
365-369: Consider using the controller'sOutOfServiceTaintconstant for consistency.The hardcoded taint values should ideally reference
controllers.OutOfServiceTaintto avoid drift if the taint definition changes.♻️ Suggested refactor
Eventually(func() bool { err := k8sClient.Get(context.Background(), nodeKey, updatedNode) if err != nil { return false } - return !utils.TaintExists(updatedNode.Spec.Taints, &corev1.Taint{ - Key: corev1.TaintNodeOutOfService, - Value: "nodeshutdown", - Effect: corev1.TaintEffectNoExecute, - }) + return !utils.TaintExists(updatedNode.Spec.Taints, controllers.OutOfServiceTaint) }, 10*time.Second, 250*time.Millisecond).Should(BeTrue(), "out-of-service taint should be automatically removed after 3 second timeout")controllers/selfnoderemediation_controller.go (1)
55-60: Minor naming inconsistency in event reason constants.
eventReasonAddNoScheduleandeventReasonRemoveNoSchedulehave slightly inconsistent naming patterns (AddNoSchedulevsRemoveNoScheduleTaint). Consider aligning them for consistency.♻️ Suggested alignment
- eventReasonAddNoSchedule = "AddNoSchedule" + eventReasonAddNoSchedule = "AddNoScheduleTaint"Or alternatively:
- eventReasonRemoveNoSchedule = "RemoveNoScheduleTaint" + eventReasonRemoveNoSchedule = "RemoveNoSchedule"
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
api/v1alpha1/selfnoderemediationconfig_webhook_test.gocontrollers/selfnoderemediation_controller.gocontrollers/tests/config/selfnoderemediationconfig_controller_test.gocontrollers/tests/controller/selfnoderemediation_controller_test.goe2e/self_node_remediation_test.goe2e/utils/command.goinstall/self-node-remediation-deamonset.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
- e2e/self_node_remediation_test.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-06-18T11:38:24.387Z
Learnt from: slintes
Repo: medik8s/self-node-remediation PR: 262
File: pkg/peerhealth/client_server_test.go:165-189
Timestamp: 2025-06-18T11:38:24.387Z
Learning: In the medik8s/self-node-remediation project, the Ginkgo tests in pkg/peerhealth/client_server_test.go don't run in parallel, making shared mutable state like the `reader` variable safe to use across test cases without race condition concerns.
Applied to files:
controllers/tests/controller/selfnoderemediation_controller_test.gocontrollers/tests/config/selfnoderemediationconfig_controller_test.go
🧬 Code graph analysis (1)
controllers/tests/controller/selfnoderemediation_controller_test.go (3)
controllers/tests/shared/shared.go (2)
UnhealthyNodeName(27-27)Namespace(26-26)pkg/utils/taints.go (1)
TaintExists(37-44)controllers/selfnoderemediation_controller.go (1)
NodeNoScheduleTaint(67-70)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: build
🔇 Additional comments (11)
controllers/tests/config/selfnoderemediationconfig_controller_test.go (1)
124-126: LGTM! Toleration updated to NoSchedule semantics.The test expectation correctly reflects the new
NoScheduletaint effect, aligning with the controller'sNodeNoScheduleTaintdefinition.e2e/utils/command.go (1)
212-221: LGTM! Toleration added for NoSchedule taint.The test utility pod now correctly tolerates both
NoExecuteandNoScheduletaints withOperator: Exists, ensuring the pod can be scheduled on nodes tainted by SNR remediation.api/v1alpha1/selfnoderemediationconfig_webhook_test.go (1)
278-278: LGTM! Valid CR test updated to use NoSchedule taint effect.The test correctly validates that
NoScheduletolerations are accepted, consistent with the new taint semantics.install/self-node-remediation-deamonset.yaml (1)
101-110: LGTM! DaemonSet tolerations updated for NoSchedule semantics.The tolerations are correctly configured:
- New
medik8s.io/self-node-remediationtoleration withNoScheduleeffect matches the controller'sNodeNoScheduleTaintdefinition- Master/control-plane tolerations changed to
Operator: Existsis appropriate for broader compatibilitycontrollers/tests/controller/selfnoderemediation_controller_test.go (3)
12-12: LGTM! Import alias updated consistently.The import alias change from
v1tocorev1aligns with best practices for distinguishing Kubernetes core types.
200-210: LGTM! Test assertions updated for NoSchedule semantics.The verification calls and event assertions correctly reflect the new taint behavior:
verifyNoScheduleTaintExist()/verifyNoScheduleTaintRemoved()- Event reasons:
AddNoSchedule,RemoveNoScheduleTaint
567-578: LGTM! Helper functions correctly renamed and updated.The verification helpers properly reference
controllers.NodeNoScheduleTaintand usecorev1.Tainttypes consistently.controllers/selfnoderemediation_controller.go (4)
29-29: LGTM! Import alias standardized tocorev1.The import alias change follows the convention used in Kubernetes client-go examples and improves code clarity.
66-76: LGTM! Taint definitions correctly implement the NoSchedule migration.The changes properly address the PR objectives:
NodeNoScheduleTaintuses a distinctive key (medik8s.io/self-node-remediation) withNoScheduleeffect and empty valueOutOfServiceTaintcorrectly usescorev1.TaintNodeOutOfServiceconstant withNoExecuteeffect for pod GC functionalityThis mitigates the volume inconsistency risk from NoExecute taints triggering force-detachment.
744-784: LGTM! Taint management functions correctly implemented.Both
addNoScheduleTaintandremoveNoScheduleTaintfollow best practices:
- Idempotency checks before mutations
- Atomic patches for concurrent safety
- Proper event emission and logging
403-416: LGTM! Function signatures consistently use*corev1.Node.The remediation strategy functions properly accept
*corev1.Nodeparameters, maintaining type consistency throughout the controller.
20792a8 to
2c7aef6
Compare
Use a distinctive taint key which includes the remediator name without the usage of value. Keeping it empty. Mimic FAR's taint for the same use case. Furthermore, Modify SNR deamonset and reboot pod taint tolerations.Use a new taint key, effect and verify tolerations with exist operator
2c7aef6 to
c9b72b7
Compare
Why we need this PR
corev1for k8s.io/api/core/v1 rather thanv1existsand add taint toleration for any NoSchedule taint for the pod (in e2e test) who is doing a rebootChanges made
NoExecutetoNoScheduleWhich issue(s) this PR fixes
RHWA-511
Test plan
Modify unit-test and e2e for the new taint key and effect
Summary by CodeRabbit
Refactor
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.