Fix draining, job completion detection and reboot sentinel lefovers #13

jimmykarily · 2025-06-06T07:54:46Z

Drain pod in kube-system namespace (e.g. traefik). They should also be scheduled somewhere else before we reboot.
Use Kubernetes populated field to detect if a Job reached a terminal state. We were kind-of re-implementing it and in a wrong way. A Job with backOffLimit > 1 might still get a chance to succeed even if it has a failed Pod.
Cleanup matching sentinel files before we start waiting for one to appear. It happened that if we create the NodeOpUpgrade with the same name and for some reason the deletion of the sentinel doesn't happen, all future reboot Pods (with the same name of the NodeOpUpgrade) will match the existing sentinel file and reboot immediately, stopping the upgrade Pod before it completes.

- Drain pod in kube-system namespace (e.g. traefik). They should also be scheduled somewhere else before we reboot. - Use Kubernetes populated field to detect if a Job reached a terminal state. We were kind-of re-implementing it and in a wrong way. A Job with backOffLimit > 1 might still get a chance to succeed even if it has a failed Pod. - Cleanup matching sentinel files before we start waiting for one to appear. It happened that if we create the NodeOpUpgrade with the same name and for some reason the deletion of the sentinel doesn't happen, all future reboot Pods (with the same name of the NodeOpUpgrade) will match the existing sentinel file and reboot immediately, stopping the upgrade Pod before it completes. Signed-off-by: Dimitris Karakasilis <[email protected]>

jimmykarily · 2025-06-06T08:23:52Z

internal/controller/nodeop_controller.go

-								sleep 10
-							done
-						`, rebootStatusCompleted, jobBaseName, rebootStatusCompleted),
+						`echo "=== Checking for existing reboot annotation ==="


The changes here are:

interpolate the variables using + to make it clear which one goes where

Cleanup any leftover matching sentinels before we enter the while loop

maybe it's worth using templates for this?

https://pkg.go.dev/text/template

After we test it for a while, I think it might make more sense to implement it in go and put it in the operator image as a binary. This way we only need one image for all our operations and we don't have bash scripts around. For now, I'd say let's stick to this until we get a better idea of what we want to do next.

jimmykarily · 2025-06-06T08:24:38Z

internal/controller/nodeop_controller_test.go

@@ -261,8 +261,7 @@ var _ = Describe("NodeOp Controller", func() {
 			Expect(nodeop.Status.NodeStatuses).ToNot(BeEmpty())

 			// Update Job status to simulate completion
-			job.Status.Succeeded = 1
-			Expect(k8sClient.Status().Update(ctx, job)).To(Succeed())
+			Expect(markJobAsCompleted(ctx, k8sClient, job)).To(Succeed())


Setting the job.Status is no longer enough to mark is as "completed" because we changed how we detect the completion in the code. Now we have helpers in tests to mark jobs as completed

jimmykarily · 2025-06-06T08:25:46Z

internal/controller/nodeop_controller.go

-		if pod.Namespace == "kube-system" {
-			continue
-		}


Let them drain. We had traefik and other Pods that we didn't let be scheduled on other Nodes. No reason to do that.

Copilot

Pull Request Overview

This PR fixes issues related to pod draining, job completion detection, and cleanup of stale reboot sentinel files. Key changes include the introduction of helper functions to update Job status for both success and failure scenarios, refactoring tests to use these helpers, and modifications in the NodeOp controller for improved handling of pod draining and reboot process.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
internal/controller/suite_test.go	Added helper functions to mark Jobs as completed/failed.
internal/controller/nodeopupgrade_controller_test.go	Updated tests to use the new helper functions for job status updates.
internal/controller/nodeop_controller_test.go	Replaced direct job status mutations with helper function calls.
internal/controller/nodeop_controller.go	Removed pod namespace skipping for draining, refined job status logic, and improved sentinel file cleanup and reboot pod creation commands.

mauromorales

one observation from my side which I think it's worth doing but not a blocker specially with the timing

jimmykarily · 2025-06-06T12:59:11Z

I tested this with 3 master and 2 workers nodes. I removed the label from one of the masters to check the label targeting too. The upgrade (concurrency=1) worked as expected, first upgraded all master nodes one by one and then the worker nodes, skipping the master node which didn't have the matching label.

Status of the cluster after the upgrade:

(finally I upgraded the remaining master node by adding the label back and restarting the operator Pod)

jimmykarily force-pushed the fix-drain-completion-and-sentinels branch from 0f4a25b to 376bf47 Compare June 6, 2025 08:22

jimmykarily commented Jun 6, 2025

View reviewed changes

jimmykarily marked this pull request as ready for review June 6, 2025 08:25

jimmykarily added this to 🧙Issue tracking board Jun 6, 2025

jimmykarily moved this to Under review 🔍 in 🧙Issue tracking board Jun 6, 2025

jimmykarily self-assigned this Jun 6, 2025

jimmykarily requested review from a team and Copilot June 6, 2025 08:26

Copilot AI reviewed Jun 6, 2025

View reviewed changes

Itxaka approved these changes Jun 6, 2025

View reviewed changes

mauromorales approved these changes Jun 6, 2025

View reviewed changes

jimmykarily merged commit 0ad7abd into main Jun 6, 2025
7 checks passed

github-project-automation bot moved this from Under review 🔍 to Done ✅ in 🧙Issue tracking board Jun 6, 2025

jimmykarily deleted the fix-drain-completion-and-sentinels branch June 6, 2025 13:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix draining, job completion detection and reboot sentinel lefovers #13

Fix draining, job completion detection and reboot sentinel lefovers #13

Uh oh!

jimmykarily commented Jun 6, 2025 •

edited

Loading

Uh oh!

jimmykarily Jun 6, 2025

Uh oh!

mauromorales Jun 6, 2025

Uh oh!

jimmykarily Jun 6, 2025

Uh oh!

jimmykarily Jun 6, 2025

Uh oh!

jimmykarily Jun 6, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

mauromorales left a comment

Uh oh!

jimmykarily commented Jun 6, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Fix draining, job completion detection and reboot sentinel lefovers #13

Fix draining, job completion detection and reboot sentinel lefovers #13

Uh oh!

Conversation

jimmykarily commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jimmykarily Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

mauromorales Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

jimmykarily Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

jimmykarily Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

jimmykarily Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

mauromorales left a comment

Choose a reason for hiding this comment

Uh oh!

jimmykarily commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jimmykarily commented Jun 6, 2025 •

edited

Loading

jimmykarily commented Jun 6, 2025 •

edited

Loading