Skip to content

Fix draining, job completion detection and reboot sentinel lefovers #13

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 6, 2025

Conversation

jimmykarily
Copy link
Contributor

@jimmykarily jimmykarily commented Jun 6, 2025

  • Drain pod in kube-system namespace (e.g. traefik). They should also be scheduled somewhere else before we reboot.
  • Use Kubernetes populated field to detect if a Job reached a terminal state. We were kind-of re-implementing it and in a wrong way. A Job with backOffLimit > 1 might still get a chance to succeed even if it has a failed Pod.
  • Cleanup matching sentinel files before we start waiting for one to appear. It happened that if we create the NodeOpUpgrade with the same name and for some reason the deletion of the sentinel doesn't happen, all future reboot Pods (with the same name of the NodeOpUpgrade) will match the existing sentinel file and reboot immediately, stopping the upgrade Pod before it completes.

Part of: kairos-io/kairos#769

- Drain pod in kube-system namespace (e.g. traefik). They should also be
  scheduled somewhere else before we reboot.
- Use Kubernetes populated field to detect if a Job reached a terminal
  state. We were kind-of re-implementing it and in a wrong way. A Job
  with backOffLimit > 1 might still get a chance to succeed even if it
  has a failed Pod.
- Cleanup matching sentinel files before we start waiting for one to
  appear. It happened that if we create the NodeOpUpgrade with the same
  name and for some reason the deletion of the sentinel doesn't happen,
  all future reboot Pods (with the same name of the NodeOpUpgrade) will
  match the existing sentinel file and reboot immediately, stopping the
  upgrade Pod before it completes.

Signed-off-by: Dimitris Karakasilis <[email protected]>
@jimmykarily jimmykarily force-pushed the fix-drain-completion-and-sentinels branch from 0f4a25b to 376bf47 Compare June 6, 2025 08:22
sleep 10
done
`, rebootStatusCompleted, jobBaseName, rebootStatusCompleted),
`echo "=== Checking for existing reboot annotation ==="
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes here are:

  • interpolate the variables using + to make it clear which one goes where
  • Cleanup any leftover matching sentinels before we enter the while loop

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's worth using templates for this?

https://pkg.go.dev/text/template

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After we test it for a while, I think it might make more sense to implement it in go and put it in the operator image as a binary. This way we only need one image for all our operations and we don't have bash scripts around. For now, I'd say let's stick to this until we get a better idea of what we want to do next.

@@ -261,8 +261,7 @@ var _ = Describe("NodeOp Controller", func() {
Expect(nodeop.Status.NodeStatuses).ToNot(BeEmpty())

// Update Job status to simulate completion
job.Status.Succeeded = 1
Expect(k8sClient.Status().Update(ctx, job)).To(Succeed())
Expect(markJobAsCompleted(ctx, k8sClient, job)).To(Succeed())
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting the job.Status is no longer enough to mark is as "completed" because we changed how we detect the completion in the code. Now we have helpers in tests to mark jobs as completed

Comment on lines -278 to -280
if pod.Namespace == "kube-system" {
continue
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let them drain. We had traefik and other Pods that we didn't let be scheduled on other Nodes. No reason to do that.

@jimmykarily jimmykarily marked this pull request as ready for review June 6, 2025 08:25
@jimmykarily jimmykarily moved this to Under review 🔍 in 🧙Issue tracking board Jun 6, 2025
@jimmykarily jimmykarily self-assigned this Jun 6, 2025
@jimmykarily jimmykarily requested review from a team and Copilot June 6, 2025 08:26
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes issues related to pod draining, job completion detection, and cleanup of stale reboot sentinel files. Key changes include the introduction of helper functions to update Job status for both success and failure scenarios, refactoring tests to use these helpers, and modifications in the NodeOp controller for improved handling of pod draining and reboot process.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
internal/controller/suite_test.go Added helper functions to mark Jobs as completed/failed.
internal/controller/nodeopupgrade_controller_test.go Updated tests to use the new helper functions for job status updates.
internal/controller/nodeop_controller_test.go Replaced direct job status mutations with helper function calls.
internal/controller/nodeop_controller.go Removed pod namespace skipping for draining, refined job status logic, and improved sentinel file cleanup and reboot pod creation commands.

Copy link
Member

@mauromorales mauromorales left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one observation from my side which I think it's worth doing but not a blocker specially with the timing

@jimmykarily
Copy link
Contributor Author

jimmykarily commented Jun 6, 2025

I tested this with 3 master and 2 workers nodes. I removed the label from one of the masters to check the label targeting too. The upgrade (concurrency=1) worked as expected, first upgraded all master nodes one by one and then the worker nodes, skipping the master node which didn't have the matching label.

Status of the cluster after the upgrade:

image

(finally I upgraded the remaining master node by adding the label back and restarting the operator Pod)

@jimmykarily jimmykarily merged commit 0ad7abd into main Jun 6, 2025
7 checks passed
@github-project-automation github-project-automation bot moved this from Under review 🔍 to Done ✅ in 🧙Issue tracking board Jun 6, 2025
@jimmykarily jimmykarily deleted the fix-drain-completion-and-sentinels branch June 6, 2025 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants