Skip to content

Conversation

@zirain
Copy link
Member

@zirain zirain commented Jan 15, 2026

xref: envoyproxy/ai-gateway#1770 (comment) #7880

AIGW required Wait for the runners to close, see #5560.
There's a problem is that runnersDone is blocked until ctx.Done(), which cause goroutine leak when config reloading.

#7880 fix the leak, but there's a race when using WaitGroup with multi goroutinues.

This PR use semaphore with weight 2, to handle all of these cases. see Loader.Wait().

@zirain zirain requested a review from a team as a code owner January 15, 2026 11:09
@netlify
Copy link

netlify bot commented Jan 15, 2026

Deploy Preview for cerulean-figolla-1f9435 canceled.

Name Link
🔨 Latest commit 976bc3f
🔍 Latest deploy log https://app.netlify.com/projects/cerulean-figolla-1f9435/deploys/696c3f8384c1e40008701acf

@zirain
Copy link
Member Author

zirain commented Jan 15, 2026

failed as excepted:

WARNING: DATA RACE
Write at 0x00c00201e958 by goroutine 129:
  runtime.racewrite()
      <autogenerated>:1 +0x1e
  github.com/envoyproxy/gateway/internal/cmd.server()
      /home/runner/work/gateway/gateway/internal/cmd/server.go:113 +0x491
  github.com/envoyproxy/gateway/internal/cmd.TestServerRun.func1()
      /home/runner/work/gateway/gateway/internal/cmd/server_test.go:43 +0x92

Previous read at 0x00c00201e958 by goroutine 131:
  runtime.raceread()
      <autogenerated>:1 +0x1e
  github.com/envoyproxy/gateway/internal/cmd.server.func1()
      /home/runner/work/gateway/gateway/internal/cmd/server.go:83 +0x84
  github.com/envoyproxy/gateway/internal/envoygateway/config/loader.(*Loader).runHook.func1()
      /home/runner/work/gateway/gateway/internal/envoygateway/config/loader/configloader.go:126 +0xc3
  github.com/envoyproxy/gateway/internal/envoygateway/config/loader.(*Loader).runHook.gowrap1()
      /home/runner/work/gateway/gateway/internal/envoygateway/config/loader/configloader.go:132 +0x4f

Goroutine 129 (running) created at:
  github.com/envoyproxy/gateway/internal/cmd.TestServerRun()
      /home/runner/work/gateway/gateway/internal/cmd/server_test.go:42 +0x147
  testing.tRunner()
      /opt/hostedtoolcache/go/1.25.5/x64/src/testing/testing.go:1934 +0x21c
  testing.(*T).Run.gowrap1()
      /opt/hostedtoolcache/go/1.25.5/x64/src/testing/testing.go:1997 +0x44

Goroutine 131 (running) created at:
  github.com/envoyproxy/gateway/internal/envoygateway/config/loader.(*Loader).runHook()
      /home/runner/work/gateway/gateway/internal/envoygateway/config/loader/configloader.go:124 +0x309
  github.com/envoyproxy/gateway/internal/envoygateway/config/loader.(*Loader).Start()
      /home/runner/work/gateway/gateway/internal/envoygateway/config/loader/configloader.go:46 +0xc4
  github.com/envoyproxy/gateway/internal/cmd.server()
      /home/runner/work/gateway/gateway/internal/cmd/server.go:94 +0x2d7
  github.com/envoyproxy/gateway/internal/cmd.TestServerRun.func1()
      /home/runner/work/gateway/gateway/internal/cmd/server_test.go:43 +0x92

@codecov
Copy link

codecov bot commented Jan 15, 2026

Codecov Report

❌ Patch coverage is 61.76471% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.45%. Comparing base (30cabed) to head (976bc3f).

Files with missing lines Patch % Lines
internal/cmd/server.go 46.66% 7 Missing and 1 partial ⚠️
...nternal/envoygateway/config/loader/configloader.go 73.68% 3 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7964      +/-   ##
==========================================
+ Coverage   72.78%   73.45%   +0.67%     
==========================================
  Files         237      237              
  Lines       35475    35487      +12     
==========================================
+ Hits        25821    26068     +247     
+ Misses       7812     7562     -250     
- Partials     1842     1857      +15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mathetake
Copy link
Member

nice!

jukie
jukie previously approved these changes Jan 16, 2026
@arkodg
Copy link
Contributor

arkodg commented Jan 16, 2026

can we enable the -race flag before releasing this

@zhaohuabing
Copy link
Member

Hi @zirain is it easy to hit this?

@zirain
Copy link
Member Author

zirain commented Jan 16, 2026

Hi @zirain is it easy to hit this?

the testcase is reproducible.

@zirain
Copy link
Member Author

zirain commented Jan 16, 2026

can we enable the -race flag before releasing this

go.test.coverage: go.test.cel ## Run go unit and integration tests in GitHub Actions
	@$(LOG_TARGET)
	KUBEBUILDER_ASSETS="$$($(GO_TOOL) setup-envtest use $(ENVTEST_K8S_VERSION) -p path)" \
		go test ./... --tags=integration -race -coverprofile=coverage.xml -covermode=atomic -coverpkg=./...

it's enabled, the problem is lack of test case.

time.Sleep(3 * time.Second)

r.runHook(ctx)
r.runHook(ctx, wg)
Copy link
Member

@zhaohuabing zhaohuabing Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that there is still a race: wg.Add(1) is called from the watcher goroutine while shutdown calls wg.Wait() in server.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you show a reproducible test case for better understanding?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this correctly, this PR moves the initial WaitGroup.Add into the same goroutine as WaitGrop.Wait, But during config reload, WaitGroup.Add in runHook and WaitGroup.Wait in Server are still called from different goroutines, which could introduce a race condition.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, let me try to work out another solution.

@zirain zirain force-pushed the fix/waitgroup-race branch 2 times, most recently from 9699599 to 6483c24 Compare January 16, 2026 07:56
@zirain zirain changed the title fix: WaitGroup race fix: server run race Jan 16, 2026
@nacx
Copy link
Member

nacx commented Jan 16, 2026

We would love to have this merged, as there is an Envoy AI Gateway release waiting for this. @arkodg @jukie mind having another review?

},
)
return server(cmd.Context(), cmd.OutOrStdout(), cmd.ErrOrStderr(), runnerErrors)
started := &atomic.Bool{}
Copy link
Contributor

@jukie jukie Jan 16, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like this is being read or used anywhere, can this be removed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this's useful to allow us konw when the server is ready and when to trigger a config update.
we can update this to a callback.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we structure the tests differently if that's the only place it's being used?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only a minor nit so non-blocking

@zirain zirain force-pushed the fix/waitgroup-race branch from a5fad1d to de06900 Compare January 17, 2026 10:36
@arkodg
Copy link
Contributor

arkodg commented Jan 17, 2026

this is the 3rd fix in this area, can we take a step back and redesign what server/runner initialization, handling partial failures, and config restarts should look like ?

Signed-off-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>
Signed-off-by: zirain <[email protected]>
@zirain zirain force-pushed the fix/waitgroup-race branch from de06900 to 2ec46b5 Compare January 18, 2026 01:47
@zirain
Copy link
Member Author

zirain commented Jan 18, 2026

this is the 3rd fix in this area, can we take a step back and redesign what server/runner initialization, handling partial failures, and config restarts should look like ?

the main reason of the regssion is lack of reason.

I think the new test case cover them.

Signed-off-by: zirain <[email protected]>
@arkodg
Copy link
Contributor

arkodg commented Jan 18, 2026

can you elaborate on what the logic is / should be for

  • dealing with partial runner termination
  • config reloads that impact runner restarts

@zirain
Copy link
Member Author

zirain commented Jan 18, 2026

can you elaborate on what the logic is / should be for

  • dealing with partial runner termination
  • config reloads that impact runner restarts

AIGW required Wait for the runners to close, see https://github.com/envoyproxy/gateway/pull/5560/changes.
There's a problem is that runnersDone is blocked until ctx.Done(), which cause goroutine leak when config reloading.

#7880 fix the leak, but there's a race when using WaitGroup with multi goroutinues.

This PR use semaphore with weight 2, to handle all of these cases. see Loader.Wait().

Update above to PR description.

@nacx
Copy link
Member

nacx commented Jan 19, 2026

this is the 3rd fix in this area, can we take a step back and redesign what server/runner initialization, handling partial failures, and config restarts should look like ?

I agree with the sentiment here, but I also think that given we're close to the desired date for 1.7 release, and the fact that this should be ideally cherry-picked to previous versions, it is convenient to merge the fix (it's safer to cherry-pick a small fix than a refactoring), unblock the projects that depend on this, and then plan and work for a proper redesign.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants