-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Controller shutdown doesn't wait for reconciles to finish #491
Comments
Hi, we have a single reconciliation context that's propagated throughout the reconciliation. If the context is cancelled after the reconciliation has started, any new operation that's part of the reconciliation will not succeed. I think it's better to terminate instead of trying to finish the operations by creating a new, independent, context. As far as I know, that's how most of the go programs behave in terms of context, unless there's a good reason to do otherwise. I would have expected the notifications to also not work because it sends an http request to notification-controller. But looking at the code, the notification event emitter doesn't take any context, hence it is able to send the notification request successfully, see the event code in https://github.com/fluxcd/pkg/blob/ce1f99f03d6756cde86b2fa9494fe13804296bac/runtime/events/recorder.go#L119. This doesn't take a context because it implements the upstream kubernetes event recorder interface, see https://github.com/kubernetes/client-go/blob/v12.0.0/tools/record/event.go#L109, which also doesn't take a context. Our event recorder emits kubernetes event and flux event which is sent to the notification controller. If we decide to not emit events at all during context cancellation, it'll also mean we don't emit kubernetes event, which may be useful for some users. So, although we can check if the given context has been cancelled in
if context.Cause(ctx) == context.Canceled {
return
} I'm afraid that'll have the side-effect of also skipping logging and emitting kubernetes events. One can argue that these logs and events would not be useful at all, but I'd like to hear more feedback before we do such things. At present I think it's okay to do that, but I would like to hear other's opinions too on this. For now, since the issue is the spam failure notifications through notification-controller, one way to get around it would be to exclude them at the notification-controller level by setting event exclusion on the respective alert, see https://fluxcd.io/flux/components/notification/alerts/#event-exclusion. |
I spent some more time thinking about it and also tried to see this behavior in other flux controllers. It seems others too behave the same if the controller pod is terminated while in the middle of a reconciliation. For example, for source-controller, I got the following alert:
When the controller starts again, if nothing new happens (no change in the result of reconciliation), no new notification is sent. We have a failure recovery notification mechanism in place which works based on the objects being marked as failed and when the object is reconciled the next time, a new notification about the success is sent after comparing the objects current state and the previously known state. This doesn't work for this particular case because of the cancelled context, the object being marked as failed doesn't get persisted in kubernetes. Requests to kubernetes-apiserver fail when sent with a cancelled context. When the controller starts again and gets the object, it has the last ready state, not failed. No recovery happened. I'm in favor of implementing this across all the controllers. Thanks @rsafonseca for highlighting this issue. |
Yes I fully suport this, these type of events are just noise as all controllers will resume work after restart. |
Every time the image-reflector-controller pod goes down for any reason, it isn't waiting for running reconciliations to finish, and propagates an immediate context cancellation to the running contexts.
This causes running reconciliations to fail and return an error, which in turn is propagated to all channels defined in the notification controller, generating a lot of noise.
Ideally, the running reconciliations should be allowed to finish, or at the worst these expected context cancellation shutdown errors should not be propagated to the notification channels.
Here's an example log of the issue:
The text was updated successfully, but these errors were encountered: