Skip to content

Conversation

@razo7
Copy link
Member

@razo7 razo7 commented Dec 29, 2025

Why we need this PR

  • Use a distinctive, common, alias of corev1 for k8s.io/api/core/v1 rather than v1
  • Avoid manual mispelling of common taint keys by fetching from k8s.io/api/core/v1
  • The remediation taint is aimed to prevent scheduling of workloads on the unhealthy node before the node is considered to be healthy and the reboot is taken action. Using the NoExecute effect when SNR failed to fence the node within those 6 minutes could then lead to a risk of volume inconsistency. See https://kubernetes.io/docs/concepts/cluster-administration/node-shutdown/#storage-force-detach-on-timeout for more

In any situation where a pod deletion has not succeeded for 6 minutes, Kubernetes will force detach volumes being unmounted if the node is unhealthy at that instant

  • Use a distinctive taint key which includes the remediator name without the usage of value. Keeping it empty. Mimic FAR's taint for the same use case
  • Use the new taint key, effect, and verify tolerations with exists and add taint toleration for any NoSchedule taint for the pod (in e2e test) who is doing a reboot

Changes made

  • Refactor usage of v1 infavor of corev1 alias to k8s.io/api/core/v1
  • Fetch common taint keys from corev1 (k8s.io/api/core/v1)
  • Modify SNR remediation taint effect from NoExecute to NoSchedule
  • Modify SNR remediation taint key and value
  • Modify SNR deamonset taint tolerations, and add taint toleration for reboot pod

Which issue(s) this PR fixes

RHWA-511

Test plan

Modify unit-test and e2e for the new taint key and effect

Summary by CodeRabbit

  • Refactor

    • Standardized node/taint handling to use NoSchedule and updated remediation flow and messaging.
  • Tests

    • Updated unit and e2e tests to validate NoSchedule taint behavior and revised remediation verification.
  • Chores

    • Updated DaemonSet and test pod tolerations/manifest entries to align with the new NoSchedule/OutOfService taint semantics.

✏️ Tip: You can customize this high-level summary in your review settings.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 29, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Dec 29, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: razo7

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link

coderabbitai bot commented Dec 29, 2025

Warning

Rate limit exceeded

@razo7 has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 24 minutes and 46 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 20792a8 and c9b72b7.

📒 Files selected for processing (3)
  • controllers/selfnoderemediation_controller.go
  • e2e/utils/command.go
  • install/self-node-remediation-deamonset.yaml
📝 Walkthrough

Walkthrough

Swaps Kubernetes core imports from v1 to corev1 and converts remediation taint semantics from NoExecute to NoSchedule across controller code, tests, e2e, webhook/config tests, and the DaemonSet manifest.

Changes

Cohort / File(s) Summary
Controller type & taint migration
controllers/selfnoderemediation_controller.go
Replaced v1 imports with corev1; changed many function signatures to use *corev1.Node/*corev1.Pod; migrated taint constants and handlers from NoExecute → NoSchedule (taint keys, add/remove helpers, events, logging).
Controller tests
controllers/tests/controller/selfnoderemediation_controller_test.go
Updated fixtures, helpers and assertions to corev1 types; replaced NoExecute taint checks/events with NoSchedule equivalents; adjusted helper signatures used by tests.
E2E test
e2e/self_node_remediation_test.go
Renamed and updated taint-check helper to verify NoSchedule taint removal; updated call sites and descriptions.
Webhook & config tests
api/v1alpha1/selfnoderemediationconfig_webhook_test.go, controllers/tests/config/selfnoderemediationconfig_controller_test.go
Adjusted expected toleration Effect from NoExecute to NoSchedule in valid-CR and controller config tests.
Install manifest
install/self-node-remediation-deamonset.yaml
Modified DaemonSet tolerations: added/changed toleration key to medik8s.io/self-node-remediation with Exists/NoSchedule; updated node-role toleration operators to Exists while keeping NoSchedule.
E2E test helpers
e2e/utils/command.go
Added Pod toleration entry for NoSchedule effect (Operator Exists) alongside existing NoExecute toleration.

Sequence Diagram(s)

(omitted — changes are type/taint migrations and test/manifest updates without new multi-component control flows)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

lgtm

Suggested reviewers

  • slintes
  • clobrano
  • beekhof
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 3.70% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Modify SNR Remediation Taint' directly and concisely summarizes the main change: updating the SNR remediation taint effect from NoExecute to NoSchedule and changing associated taint semantics across the codebase.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
controllers/selfnoderemediation_controller.go (1)

814-835: Function name and messages don't match the taint effect.

The function removeNoExecuteTaint removes NodeNoScheduleTaint which has TaintEffectNoSchedule, not TaintEffectNoExecute. Additionally:

  • Line 831: Log message says "NoExecute taint removed" but the effect is NoSchedule
  • Line 832: Event message says "remove NoExecute taint from healthy remediated node" but the effect is NoSchedule

This creates significant confusion about what the code is actually doing.

🔎 Proposed fix
-func (r *SelfNodeRemediationReconciler) removeNoExecuteTaint(node *corev1.Node) error {
+func (r *SelfNodeRemediationReconciler) removeRemediationTaint(node *corev1.Node) error {
 	if !utils.TaintExists(node.Spec.Taints, NodeNoScheduleTaint) {
 		return nil
 	}
 
 	patch := client.MergeFrom(node.DeepCopy())
 	if taints, deleted := utils.DeleteTaint(node.Spec.Taints, NodeNoScheduleTaint); !deleted {
 		r.logger.Info("Failed to remove taint from node, taint not found", "node name", node.Name, "taint key", NodeNoScheduleTaint.Key, "taint effect", NodeNoScheduleTaint.Effect)
 		return nil
 	} else {
 		node.Spec.Taints = taints
 	}
 
 	if err := r.Client.Patch(context.Background(), node, patch); err != nil {
 		r.logger.Error(err, "Failed to remove taint from node,", "node name", node.Name, "taint key", NodeNoScheduleTaint.Key, "taint effect", NodeNoScheduleTaint.Effect)
 		return err
 	}
-	r.logger.Info("NoExecute taint removed", "new taints", node.Spec.Taints)
-	events.NormalEvent(r.Recorder, node, eventReasonRemoveNoExecute, "Remediation process - remove NoExecute taint from healthy remediated node")
+	r.logger.Info("Remediation taint removed", "new taints", node.Spec.Taints)
+	events.NormalEvent(r.Recorder, node, eventReasonRemoveNoExecute, "Remediation process - remove remediation taint from healthy remediated node")
 
 	return nil
 }

Note: You'll also need to update the call site at line 579 and consider renaming the event reason constant.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8a73019 and baa47de.

📒 Files selected for processing (3)
  • controllers/selfnoderemediation_controller.go
  • controllers/tests/controller/selfnoderemediation_controller_test.go
  • e2e/self_node_remediation_test.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-06-18T11:38:24.387Z
Learnt from: slintes
Repo: medik8s/self-node-remediation PR: 262
File: pkg/peerhealth/client_server_test.go:165-189
Timestamp: 2025-06-18T11:38:24.387Z
Learning: In the medik8s/self-node-remediation project, the Ginkgo tests in pkg/peerhealth/client_server_test.go don't run in parallel, making shared mutable state like the `reader` variable safe to use across test cases without race condition concerns.

Applied to files:

  • controllers/tests/controller/selfnoderemediation_controller_test.go
🧬 Code graph analysis (2)
e2e/self_node_remediation_test.go (1)
controllers/selfnoderemediation_controller.go (1)
  • NodeNoScheduleTaint (74-78)
controllers/tests/controller/selfnoderemediation_controller_test.go (1)
controllers/selfnoderemediation_controller.go (1)
  • NodeNoScheduleTaint (74-78)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (1)
controllers/selfnoderemediation_controller.go (1)

74-78: Taint effect change looks correct.

The change from TaintEffectNoExecute to TaintEffectNoSchedule aligns with the PR objective to prevent volume consistency issues. NoSchedule will prevent new pods from being scheduled without force-evicting existing pods.

@razo7 razo7 force-pushed the timeadded-oos-taint branch 2 times, most recently from 9f9cd60 to 31dd039 Compare December 29, 2025 14:25
@razo7
Copy link
Member Author

razo7 commented Dec 31, 2025

/test 4.21-openshift-e2e


NodeNoExecuteTaint = &v1.Taint{
NodeNoScheduleTaint = &corev1.Taint{
Key: "medik8s.io/remediation",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please update the key as discussed

@razo7 razo7 changed the title Change SNR Remediation Taint Effect to NoSchedule Modify SNR Remediation Taint Jan 12, 2026
razo7 added 3 commits January 12, 2026 13:22
Use a distinctive, common, alias for k8s.io/api/core/v1 rather than 'v1'
Avoid manual mispelling of common oos tain key
The remediation taint is aimed to prevent scheduling of workloads on the unhealthy node before the node is considered to be healthy and the reboot took action. Using NoExecute effect when SNR failed to fence the node within those 6 minutes could then lead to a risk of volume inconsistency. In any situation where a pod deletion has not succeeded for 6 minutes, kubernetes will force detach volumes being unmounted if the node is unhealthy at that instant
@razo7 razo7 force-pushed the timeadded-oos-taint branch from 45082a5 to 20792a8 Compare January 12, 2026 11:32
@razo7
Copy link
Member Author

razo7 commented Jan 12, 2026

/test 4.21-openshift-e2e

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
controllers/tests/controller/selfnoderemediation_controller_test.go (1)

365-369: Consider using the controller's OutOfServiceTaint constant for consistency.

The hardcoded taint values should ideally reference controllers.OutOfServiceTaint to avoid drift if the taint definition changes.

♻️ Suggested refactor
 				Eventually(func() bool {
 					err := k8sClient.Get(context.Background(), nodeKey, updatedNode)
 					if err != nil {
 						return false
 					}
-					return !utils.TaintExists(updatedNode.Spec.Taints, &corev1.Taint{
-						Key:    corev1.TaintNodeOutOfService,
-						Value:  "nodeshutdown",
-						Effect: corev1.TaintEffectNoExecute,
-					})
+					return !utils.TaintExists(updatedNode.Spec.Taints, controllers.OutOfServiceTaint)
 				}, 10*time.Second, 250*time.Millisecond).Should(BeTrue(), "out-of-service taint should be automatically removed after 3 second timeout")
controllers/selfnoderemediation_controller.go (1)

55-60: Minor naming inconsistency in event reason constants.

eventReasonAddNoSchedule and eventReasonRemoveNoSchedule have slightly inconsistent naming patterns (AddNoSchedule vs RemoveNoScheduleTaint). Consider aligning them for consistency.

♻️ Suggested alignment
-	eventReasonAddNoSchedule                = "AddNoSchedule"
+	eventReasonAddNoSchedule                = "AddNoScheduleTaint"

Or alternatively:

-	eventReasonRemoveNoSchedule             = "RemoveNoScheduleTaint"
+	eventReasonRemoveNoSchedule             = "RemoveNoSchedule"
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 45082a5 and 20792a8.

📒 Files selected for processing (7)
  • api/v1alpha1/selfnoderemediationconfig_webhook_test.go
  • controllers/selfnoderemediation_controller.go
  • controllers/tests/config/selfnoderemediationconfig_controller_test.go
  • controllers/tests/controller/selfnoderemediation_controller_test.go
  • e2e/self_node_remediation_test.go
  • e2e/utils/command.go
  • install/self-node-remediation-deamonset.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • e2e/self_node_remediation_test.go
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-06-18T11:38:24.387Z
Learnt from: slintes
Repo: medik8s/self-node-remediation PR: 262
File: pkg/peerhealth/client_server_test.go:165-189
Timestamp: 2025-06-18T11:38:24.387Z
Learning: In the medik8s/self-node-remediation project, the Ginkgo tests in pkg/peerhealth/client_server_test.go don't run in parallel, making shared mutable state like the `reader` variable safe to use across test cases without race condition concerns.

Applied to files:

  • controllers/tests/controller/selfnoderemediation_controller_test.go
  • controllers/tests/config/selfnoderemediationconfig_controller_test.go
🧬 Code graph analysis (1)
controllers/tests/controller/selfnoderemediation_controller_test.go (3)
controllers/tests/shared/shared.go (2)
  • UnhealthyNodeName (27-27)
  • Namespace (26-26)
pkg/utils/taints.go (1)
  • TaintExists (37-44)
controllers/selfnoderemediation_controller.go (1)
  • NodeNoScheduleTaint (67-70)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: build
🔇 Additional comments (11)
controllers/tests/config/selfnoderemediationconfig_controller_test.go (1)

124-126: LGTM! Toleration updated to NoSchedule semantics.

The test expectation correctly reflects the new NoSchedule taint effect, aligning with the controller's NodeNoScheduleTaint definition.

e2e/utils/command.go (1)

212-221: LGTM! Toleration added for NoSchedule taint.

The test utility pod now correctly tolerates both NoExecute and NoSchedule taints with Operator: Exists, ensuring the pod can be scheduled on nodes tainted by SNR remediation.

api/v1alpha1/selfnoderemediationconfig_webhook_test.go (1)

278-278: LGTM! Valid CR test updated to use NoSchedule taint effect.

The test correctly validates that NoSchedule tolerations are accepted, consistent with the new taint semantics.

install/self-node-remediation-deamonset.yaml (1)

101-110: LGTM! DaemonSet tolerations updated for NoSchedule semantics.

The tolerations are correctly configured:

  • New medik8s.io/self-node-remediation toleration with NoSchedule effect matches the controller's NodeNoScheduleTaint definition
  • Master/control-plane tolerations changed to Operator: Exists is appropriate for broader compatibility
controllers/tests/controller/selfnoderemediation_controller_test.go (3)

12-12: LGTM! Import alias updated consistently.

The import alias change from v1 to corev1 aligns with best practices for distinguishing Kubernetes core types.


200-210: LGTM! Test assertions updated for NoSchedule semantics.

The verification calls and event assertions correctly reflect the new taint behavior:

  • verifyNoScheduleTaintExist() / verifyNoScheduleTaintRemoved()
  • Event reasons: AddNoSchedule, RemoveNoScheduleTaint

567-578: LGTM! Helper functions correctly renamed and updated.

The verification helpers properly reference controllers.NodeNoScheduleTaint and use corev1.Taint types consistently.

controllers/selfnoderemediation_controller.go (4)

29-29: LGTM! Import alias standardized to corev1.

The import alias change follows the convention used in Kubernetes client-go examples and improves code clarity.


66-76: LGTM! Taint definitions correctly implement the NoSchedule migration.

The changes properly address the PR objectives:

  • NodeNoScheduleTaint uses a distinctive key (medik8s.io/self-node-remediation) with NoSchedule effect and empty value
  • OutOfServiceTaint correctly uses corev1.TaintNodeOutOfService constant with NoExecute effect for pod GC functionality

This mitigates the volume inconsistency risk from NoExecute taints triggering force-detachment.


744-784: LGTM! Taint management functions correctly implemented.

Both addNoScheduleTaint and removeNoScheduleTaint follow best practices:

  • Idempotency checks before mutations
  • Atomic patches for concurrent safety
  • Proper event emission and logging

403-416: LGTM! Function signatures consistently use *corev1.Node.

The remediation strategy functions properly accept *corev1.Node parameters, maintaining type consistency throughout the controller.

@razo7 razo7 force-pushed the timeadded-oos-taint branch from 20792a8 to 2c7aef6 Compare January 12, 2026 13:22
Use a distinctive taint key which includes the remediator name without the usage of value. Keeping it empty. Mimic FAR's taint for the same use case. Furthermore, Modify SNR deamonset and reboot pod taint tolerations.Use a new taint key, effect and verify tolerations with exist operator
@razo7 razo7 force-pushed the timeadded-oos-taint branch from 2c7aef6 to c9b72b7 Compare January 12, 2026 13:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants