-
Notifications
You must be signed in to change notification settings - Fork 647
fix: server run race #7964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
fix: server run race #7964
Conversation
✅ Deploy Preview for cerulean-figolla-1f9435 canceled.
|
|
failed as excepted: |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7964 +/- ##
==========================================
+ Coverage 72.78% 73.45% +0.67%
==========================================
Files 237 237
Lines 35475 35487 +12
==========================================
+ Hits 25821 26068 +247
+ Misses 7812 7562 -250
- Partials 1842 1857 +15 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
nice! |
|
can we enable the -race flag before releasing this |
|
Hi @zirain is it easy to hit this? |
the testcase is reproducible. |
it's enabled, the problem is lack of test case. |
| time.Sleep(3 * time.Second) | ||
|
|
||
| r.runHook(ctx) | ||
| r.runHook(ctx, wg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that there is still a race: wg.Add(1) is called from the watcher goroutine while shutdown calls wg.Wait() in server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you show a reproducible test case for better understanding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand this correctly, this PR moves the initial WaitGroup.Add into the same goroutine as WaitGrop.Wait, But during config reload, WaitGroup.Add in runHook and WaitGroup.Wait in Server are still called from different goroutines, which could introduce a race condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch, let me try to work out another solution.
9699599 to
6483c24
Compare
internal/cmd/server.go
Outdated
| }, | ||
| ) | ||
| return server(cmd.Context(), cmd.OutOrStdout(), cmd.ErrOrStderr(), runnerErrors) | ||
| started := &atomic.Bool{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like this is being read or used anywhere, can this be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this's useful to allow us konw when the server is ready and when to trigger a config update.
we can update this to a callback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we structure the tests differently if that's the only place it's being used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only a minor nit so non-blocking
a5fad1d to
de06900
Compare
|
this is the 3rd fix in this area, can we take a step back and redesign what server/runner initialization, handling partial failures, and config restarts should look like ? |
Signed-off-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>
de06900 to
2ec46b5
Compare
the main reason of the regssion is lack of reason.
I think the new test case cover them. |
Signed-off-by: zirain <[email protected]>
|
can you elaborate on what the logic is / should be for
|
AIGW required #7880 fix the leak, but there's a race when using WaitGroup with multi goroutinues. This PR use Update above to PR description. |
I agree with the sentiment here, but I also think that given we're close to the desired date for 1.7 release, and the fact that this should be ideally cherry-picked to previous versions, it is convenient to merge the fix (it's safer to cherry-pick a small fix than a refactoring), unblock the projects that depend on this, and then plan and work for a proper redesign. |
xref: envoyproxy/ai-gateway#1770 (comment) #7880
AIGW required
Wait for the runners to close, see #5560.There's a problem is that
runnersDoneis blocked untilctx.Done(), which cause goroutine leak when config reloading.#7880 fix the leak, but there's a race when using WaitGroup with multi goroutinues.
This PR use
semaphorewith weight 2, to handle all of these cases. seeLoader.Wait().