Skip to content

Retry initializing informers to allow for network instability on node restart #3688

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Jun 10, 2025

Conversation

dehaansa
Copy link
Contributor

PR Description

Immediately after a node restart, the Alloy pod is not always able to access the api server due to network initialization. While there may be a more correct way to handle this, to implement a quick solution I've just added a retry mechanism. This solved the problem for me when testing using a local kind cluster, but I would like interested parties to test the fix in their environments as well.

Which issue(s) this PR fixes

Fixes #1853

Notes to the Reviewer

PR Checklist

  • CHANGELOG.md updated

@dehaansa dehaansa marked this pull request as ready for review June 10, 2025 13:46
@dehaansa dehaansa requested a review from a team as a code owner June 10, 2025 13:46
var informer cache.Informer
var err error
i := 0
for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I'd prefer an indexed for loop here instead of if i==2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented Kalle's recommendation to use the dskit backoff package instead.

var informer cache.Informer
var err error
i := 0
for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: we could try to use github.com/grafana/dskit/backoff for retires like we do in a lot of other places

@dehaansa dehaansa requested a review from thampiotr June 10, 2025 14:14
@dehaansa dehaansa enabled auto-merge (squash) June 10, 2025 15:09
@dehaansa dehaansa merged commit 501a307 into main Jun 10, 2025
40 checks passed
@dehaansa dehaansa deleted the retry-get-informer branch June 10, 2025 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Kubernetes node reboot prometheus operator CRDs not monitored on restart
3 participants