Controller shutdown doesn't wait for reconciles to finish #491

rsafonseca · 2024-01-12T11:28:47Z

Every time the image-reflector-controller pod goes down for any reason, it isn't waiting for running reconciliations to finish, and propagates an immediate context cancellation to the running contexts.

This causes running reconciliations to fail and return an error, which in turn is propagated to all channels defined in the notification controller, generating a lot of noise.

Ideally, the running reconciliations should be allowed to finish, or at the worst these expected context cancellation shutdown errors should not be propagated to the notification channels.

Here's an example log of the issue:

{"level":"info","ts":"2024-01-11T11:41:01.305Z","msg":"Stopping and waiting for non leader election runnables"}
{"level":"info","ts":"2024-01-11T11:41:01.312Z","msg":"Stopping and waiting for leader election runnables"}
{"level":"info","ts":"2024-01-11T11:41:01.324Z","msg":"All workers finished","controller":"imagepolicy","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImagePolicy"}
{"level":"info","ts":"2024-01-11T11:41:01.312Z","msg":"Shutdown signal received, waiting for all workers to finish","controller":"imagepolicy","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImagePolicy"}
{"level":"info","ts":"2024-01-11T11:41:01.312Z","msg":"Shutdown signal received, waiting for all workers to finish","controller":"imagerepository","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImageRepository"}
{"level":"error","ts":"2024-01-11T11:41:01.332Z","msg":"failed to configure authentication options: operation error ECR: GetAuthorizationToken, get identity: get credentials: request canceled, context canceled","controller":"imagerepository","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImageRepository","ImageRepository":{"name":"**REDACTED**","namespace":"**REDACTED**"},"namespace":"**REDACTED**","name":"**REDACTED**","reconcileID":"64c8c4f2-e200-4466-bb1a-0059375fc0d6","error":"AuthenticationFailed"}
{"level":"info","ts":"2024-01-11T11:41:01.338Z","msg":"Warning: Reconciler returned both a non-zero result and a non-nil error. The result will always be ignored if the error is non-nil and the non-nil error causes reqeueuing with exponential backoff. For more details, see: https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/reconcile#Reconciler","controller":"imagerepository","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImageRepository","ImageRepository":{"name":"**REDACTED**","namespace":"**REDACTED**"},"namespace":"**REDACTED**","name":"**REDACTED**","reconcileID":"9f627893-53d1-4bdc-acac-a0b3099216e5"}
{"level":"error","ts":"2024-01-11T11:41:01.339Z","msg":"Reconciler error","controller":"imagerepository","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImageRepository","ImageRepository":{"name":"**REDACTED**","namespace":"**REDACTED**"},"namespace":"**REDACTED**","name":"**REDACTED**","reconcileID":"9f627893-53d1-4bdc-acac-a0b3099216e5","error":"[Patch \"https://10.16.0.1:443/apis/image.toolkit.fluxcd.io/v1beta2/namespaces/**REDACTED**/imagerepositories/**REDACTED**/status?fieldManager=image-reflector-controller\": context canceled, context canceled]","errorCauses":[{"error":"[Patch \"https://10.16.0.1:443/apis/image.toolkit.fluxcd.io/v1beta2/namespaces/**REDACTED**/imagerepositories/**REDACTED**/status?fieldManager=image-reflector-controller\": context canceled, context canceled]","errorCauses":[{"error":"Patch \"https://10.16.0.1:443/apis/image.toolkit.fluxcd.io/v1beta2/namespaces/**REDACTED**/imagerepositories/**REDACTED**/status?fieldManager=image-reflector-controller\": context canceled"},{"error":"context canceled"}]}]}
{"level":"error","ts":"2024-01-11T11:41:01.373Z","msg":"Reconciler error","controller":"imagerepository","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImageRepository","ImageRepository":{"name":"**REDACTED**","namespace":"**REDACTED**"},"namespace":"**REDACTED**","name":"**REDACTED**","reconcileID":"64c8c4f2-e200-4466-bb1a-0059375fc0d6","error":"[failed to configure authentication options: operation error ECR: GetAuthorizationToken, get identity: get credentials: request canceled, context canceled, context canceled]","errorCauses":[{"error":"failed to configure authentication options: operation error ECR: GetAuthorizationToken, get identity: get credentials: request canceled, context canceled"},{"error":"context canceled","errorCauses":[{"error":"context canceled"},{"error":"context canceled"}]}]}
{"level":"info","ts":"2024-01-11T11:41:01.375Z","msg":"All workers finished","controller":"imagerepository","controllerGroup":"image.toolkit.fluxcd.io","controllerKind":"ImageRepository"}
{"level":"info","ts":"2024-01-11T11:41:01.375Z","msg":"Stopping and waiting for caches"}
{"level":"info","ts":"2024-01-11T11:41:01.384Z","msg":"Stopping and waiting for webhooks"}
{"level":"info","ts":"2024-01-11T11:41:01.384Z","msg":"Stopping and waiting for HTTP servers"}
{"level":"info","ts":"2024-01-11T11:41:01.384Z","msg":"shutting down server","kind":"health probe","addr":"[::]:9440"}
{"level":"info","ts":"2024-01-11T11:41:01.389Z","logger":"controller-runtime.metrics","msg":"Shutting down metrics server with timeout of 1 minute"}
{"level":"info","ts":"2024-01-11T11:41:01.396Z","msg":"Wait completed, proceeding to shutdown the manager"}
{"level":"error","ts":"2024-01-11T11:41:01.408Z","msg":"error received after stop sequence was engaged","error":"leader election lost"}

The text was updated successfully, but these errors were encountered:

darkowlzz · 2024-02-19T16:09:06Z

Hi, we have a single reconciliation context that's propagated throughout the reconciliation. If the context is cancelled after the reconciliation has started, any new operation that's part of the reconciliation will not succeed. I think it's better to terminate instead of trying to finish the operations by creating a new, independent, context. As far as I know, that's how most of the go programs behave in terms of context, unless there's a good reason to do otherwise.

I would have expected the notifications to also not work because it sends an http request to notification-controller. But looking at the code, the notification event emitter doesn't take any context, hence it is able to send the notification request successfully, see the event code in https://github.com/fluxcd/pkg/blob/ce1f99f03d6756cde86b2fa9494fe13804296bac/runtime/events/recorder.go#L119. This doesn't take a context because it implements the upstream kubernetes event recorder interface, see https://github.com/kubernetes/client-go/blob/v12.0.0/tools/record/event.go#L109, which also doesn't take a context. Our event recorder emits kubernetes event and flux event which is sent to the notification controller. If we decide to not emit events at all during context cancellation, it'll also mean we don't emit kubernetes event, which may be useful for some users. So, although we can check if the given context has been cancelled in

image-reflector-controller/internal/controller/imagerepository_controller.go

Line 666 in 1d0afd0

    
           func notify(ctx context.Context, r kuberecorder.EventRecorder, oldObj, newObj conditions.Setter, nextScanMsg string) {

and skip any notifications, something like

if context.Cause(ctx) == context.Canceled {
  return
}

I'm afraid that'll have the side-effect of also skipping logging and emitting kubernetes events. One can argue that these logs and events would not be useful at all, but I'd like to hear more feedback before we do such things. At present I think it's okay to do that, but I would like to hear other's opinions too on this.

For now, since the issue is the spam failure notifications through notification-controller, one way to get around it would be to exclude them at the notification-controller level by setting event exclusion on the respective alert, see https://fluxcd.io/flux/components/notification/alerts/#event-exclusion.

darkowlzz · 2024-02-21T12:27:44Z

I spent some more time thinking about it and also tried to see this behavior in other flux controllers. It seems others too behave the same if the controller pod is terminated while in the middle of a reconciliation. For example, for source-controller, I got the following alert:

failed to checkout and determine revision: unable to list remote for 'http://example.com/test-group/test-repo.git': Get "http://example.com/test-group/test-repo.git/info/refs?service=git-upload-pack": context canceled

When the controller starts again, if nothing new happens (no change in the result of reconciliation), no new notification is sent. We have a failure recovery notification mechanism in place which works based on the objects being marked as failed and when the object is reconciled the next time, a new notification about the success is sent after comparing the objects current state and the previously known state. This doesn't work for this particular case because of the cancelled context, the object being marked as failed doesn't get persisted in kubernetes. Requests to kubernetes-apiserver fail when sent with a cancelled context. When the controller starts again and gets the object, it has the last ready state, not failed. No recovery happened.
This leaves a failed notification and no subsequent recovery notification. I think this is a good reason to skip notifications when context gets cancelled. I believe context cancellation will only happen when the controller-manager is signaled to stop. This is different from context timeout which will still continue to send notification and behave as expected.

I'm in favor of implementing this across all the controllers.

Thanks @rsafonseca for highlighting this issue.

stefanprodan · 2024-02-21T12:36:47Z

I think this is a good reason to skip notifications when context gets cancelled.

Yes I fully suport this, these type of events are just noise as all controllers will resume work after restart.

darkowlzz added enhancement New feature or request area/ux In pursuit of a delightful user experience labels Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Controller shutdown doesn't wait for reconciles to finish #491

Controller shutdown doesn't wait for reconciles to finish #491

rsafonseca commented Jan 12, 2024

darkowlzz commented Feb 19, 2024

darkowlzz commented Feb 21, 2024 •

edited

Loading

stefanprodan commented Feb 21, 2024

Controller shutdown doesn't wait for reconciles to finish #491

Controller shutdown doesn't wait for reconciles to finish #491

Comments

rsafonseca commented Jan 12, 2024

darkowlzz commented Feb 19, 2024

darkowlzz commented Feb 21, 2024 • edited Loading

stefanprodan commented Feb 21, 2024

darkowlzz commented Feb 21, 2024 •

edited

Loading