-
Notifications
You must be signed in to change notification settings - Fork 0
Fix draining, job completion detection and reboot sentinel lefovers #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Drain pod in kube-system namespace (e.g. traefik). They should also be scheduled somewhere else before we reboot. - Use Kubernetes populated field to detect if a Job reached a terminal state. We were kind-of re-implementing it and in a wrong way. A Job with backOffLimit > 1 might still get a chance to succeed even if it has a failed Pod. - Cleanup matching sentinel files before we start waiting for one to appear. It happened that if we create the NodeOpUpgrade with the same name and for some reason the deletion of the sentinel doesn't happen, all future reboot Pods (with the same name of the NodeOpUpgrade) will match the existing sentinel file and reboot immediately, stopping the upgrade Pod before it completes. Signed-off-by: Dimitris Karakasilis <[email protected]>
0f4a25b
to
376bf47
Compare
sleep 10 | ||
done | ||
`, rebootStatusCompleted, jobBaseName, rebootStatusCompleted), | ||
`echo "=== Checking for existing reboot annotation ===" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes here are:
- interpolate the variables using
+
to make it clear which one goes where - Cleanup any leftover matching sentinels before we enter the
while
loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it's worth using templates for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After we test it for a while, I think it might make more sense to implement it in go and put it in the operator image as a binary. This way we only need one image for all our operations and we don't have bash scripts around. For now, I'd say let's stick to this until we get a better idea of what we want to do next.
@@ -261,8 +261,7 @@ var _ = Describe("NodeOp Controller", func() { | |||
Expect(nodeop.Status.NodeStatuses).ToNot(BeEmpty()) | |||
|
|||
// Update Job status to simulate completion | |||
job.Status.Succeeded = 1 | |||
Expect(k8sClient.Status().Update(ctx, job)).To(Succeed()) | |||
Expect(markJobAsCompleted(ctx, k8sClient, job)).To(Succeed()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting the job.Status is no longer enough to mark is as "completed" because we changed how we detect the completion in the code. Now we have helpers in tests to mark jobs as completed
if pod.Namespace == "kube-system" { | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let them drain. We had traefik
and other Pods that we didn't let be scheduled on other Nodes. No reason to do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes issues related to pod draining, job completion detection, and cleanup of stale reboot sentinel files. Key changes include the introduction of helper functions to update Job status for both success and failure scenarios, refactoring tests to use these helpers, and modifications in the NodeOp controller for improved handling of pod draining and reboot process.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
File | Description |
---|---|
internal/controller/suite_test.go | Added helper functions to mark Jobs as completed/failed. |
internal/controller/nodeopupgrade_controller_test.go | Updated tests to use the new helper functions for job status updates. |
internal/controller/nodeop_controller_test.go | Replaced direct job status mutations with helper function calls. |
internal/controller/nodeop_controller.go | Removed pod namespace skipping for draining, refined job status logic, and improved sentinel file cleanup and reboot pod creation commands. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one observation from my side which I think it's worth doing but not a blocker specially with the timing
Part of: kairos-io/kairos#769