Ensure websocket conections persist until done on queue-proxy drain #15759

elijah-rou · 2025-02-07T20:01:33Z

Fixes: Websockets closing abruptly when queue-proxy undergoes drain.

Due to hijacked connections in net/http not being respected when server.Shutdown is called, any active websocket connections currently end as soon as the queue-proxy calls .Shutdown. See gorilla/websocket#448 and golang/go#17721 for details. This patch fixes this issue by introducing an atomic counter of active requests, which increments as a request comes in and decrements as a request handler terminates. During drain, this counter must reach zero or adhere to the revision timeout, in order to call .Shutdown.

Release Note

Introduce a pending requests atom which the queue-proxy can use to gracefully terminate all connections (including highjacked connections)

knative-prow · 2025-02-07T20:01:40Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: elijah-rou
Once this PR has been reviewed and has the lgtm label, please assign dsimansk for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

knative-prow · 2025-02-07T20:01:43Z

Hi @elijah-rou. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

dprotaso · 2025-02-09T22:24:12Z

/ok-to-test

codecov · 2025-02-09T22:29:02Z

Codecov Report

Attention: Patch coverage is 0% with 23 lines in your changes missing coverage. Please review.

Project coverage is 74.95%. Comparing base (6265a8e) to head (2b581d6).
Report is 60 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/queue/sharedmain/main.go	0.00%	15 Missing ⚠️
pkg/queue/sharedmain/handlers.go	0.00%	8 Missing ⚠️

❌ Your project check has failed because the head coverage (74.95%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #15759      +/-   ##
==========================================
- Coverage   80.84%   74.95%   -5.90%     
==========================================
  Files         222      222              
  Lines       18070    18095      +25     
==========================================
- Hits        14609    13563    -1046     
- Misses       3089     4181    +1092     
+ Partials      372      351      -21

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

dprotaso

Can we add an e2e test to validate the failure is fixed by these changes

pkg/queue/sharedmain/handlers.go

pkg/queue/sharedmain/main.go

skonto · 2025-02-12T15:18:57Z

pkg/queue/sharedmain/main.go

+				break WaitOnPendingRequests
+			}
+		}
+		time.Sleep(drainSleepDuration)


This adds a non required overhead if there are no websocket connections though. Could we avoid it for the non-websocket workloads?

I suppose this is needed so that there is enough time for the app to notify the client by initiating a websocker close action? Note here that QP does not do the actual draining of the connection (due to the known reasons).

Another idea is to have something like:

done := make(chan struct{}) Server.RegisterOnShutdown( func() { ticker := time.NewTicker(1 * time.Second) defer ticker.Stop() logger.Infof("Drain: waiting for %d pending requests to complete", pendingRequests.Load()) WaitOnPendingRequests: for range ticker.C { if pendingRequests.Load() <= 0 { logger.Infof("Drain: all pending requests completed") break WaitOnPendingRequests } } # wait some configurable time e.g. WEBSOCKET_TERMINATION_DRAIN_DURATION_SECONDS defer done <- struct{}{} }) Server.Shutdown(...) <-done

WEBSOCKET_TERMINATION_DRAIN_DURATION_SECONDS could be zero by default for regular workloads. Or use something similar for the sleep time above in this PR.

Another thing (a bit of a hack and thinking out loud) is whether we could detect hijacked connections with the upgrade and connection headers which are mandatory 🤔 (not sure about wss but I think we dont support it do we ?). cc @dprotaso

I think may have moved that sleep to the wrong place, should probably just move the sleep to before the wait on pending. Will avoid the additional wait. drainSleepDuration should only be a minimum wait period.

As for avoiding the actual pending req check, may be able to, but it's probably required then that you check specifics about the connection before incrementing the counter to make it a websocket specific (so instead of just using rev proxy we would probably have to). IMO likely not worth doing, since you expecting a wait from .Shutdown already, and this would only bypass .Shutdown having to do the work to wait for non-highjacked connections.

elijah-rou · 2025-02-19T19:53:38Z

@dprotaso is the only thing missing from this PR an E2E test?

dprotaso · 2025-03-11T18:40:41Z

pkg/queue/sharedmain/main.go

@@ -304,8 +307,24 @@ func Main(opts ...Option) error {
 	case <-d.Ctx.Done():
 		logger.Info("Received TERM signal, attempting to gracefully shutdown servers.")
 		logger.Infof("Sleeping %v to allow K8s propagation of non-ready state", drainSleepDuration)
+		time.Sleep(drainSleepDuration)


we don't need this explicit sleep - this is what drainer.Drain is doing drainSleepDuration is just set to 30s or so

dprotaso · 2025-03-11T18:42:33Z

pkg/queue/sharedmain/handlers.go

@@ -139,3 +144,11 @@ func withFullDuplex(h http.Handler, enableFullDuplex bool, logger *zap.SugaredLo
 		h.ServeHTTP(w, r)
 	})
 }
+
+func withRequestCounter(h http.Handler, pendingRequests *atomic.Int32) http.Handler {


Can we create an struct that implements http.Handler and it holds the atomic counter - that would make this reusable

Your handler can hold the next handler so you can call h.ServeHTTP(w, r)

dprotaso · 2025-03-11T18:43:44Z

pkg/queue/sharedmain/main.go

+	pendingRequests := atomic.Int32{}
+	pendingRequests.Store(0)


See above comment - let's merge this into a special http.Handler struct counting requests - that is created in the mainHandler function

dprotaso · 2025-03-11T18:53:40Z

test/e2e/websocket_test.go

@@ -322,6 +322,11 @@ func TestWebSocketWithTimeout(t *testing.T) {
 		idleTimeoutSeconds: 10,
 		delay:              "20",
 		expectError:        true,
+	}, {
+		name:           "websocket does not drop after queue drain is called at 30s",


I think we should make a separate this - cause this isn't really testing the WebSocket Timeout but instead we're testing that draining long running requests works as expected.

add e2e for ws beyond queue drain; move sleep to appropriate loc add ref to go issue separate drain test

dprotaso · 2025-03-29T22:59:50Z

/retest

dprotaso · 2025-04-14T01:22:06Z

/retest

dprotaso · 2025-04-14T01:24:02Z

closing, re-opening to pick up latest actions

dprotaso · 2025-04-14T01:26:16Z

ah there's a legit compile error in the e2e test

knative-prow · 2025-04-14T02:28:34Z

@elijah-rou: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
istio-latest-no-mesh_serving_main	`2b581d6`	link	true	`/test istio-latest-no-mesh`

Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

knative-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 7, 2025

knative-prow bot requested review from dprotaso and skonto February 7, 2025 20:01

knative-prow bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 9, 2025

dprotaso reviewed Feb 9, 2025

View reviewed changes

pkg/queue/sharedmain/handlers.go Outdated Show resolved Hide resolved

pkg/queue/sharedmain/main.go Outdated Show resolved Hide resolved

skonto reviewed Feb 10, 2025

View reviewed changes

pkg/queue/sharedmain/main.go Outdated Show resolved Hide resolved

skonto reviewed Feb 10, 2025

View reviewed changes

pkg/queue/sharedmain/main.go Outdated Show resolved Hide resolved

elijah-rou force-pushed the fix/ensure-websockets-complete-on-drain branch from d6da0ea to 99dd71c Compare February 12, 2025 00:59

skonto reviewed Feb 12, 2025

View reviewed changes

elijah-rou force-pushed the fix/ensure-websockets-complete-on-drain branch 2 times, most recently from 54db92b to e8dec25 Compare February 12, 2025 16:49

dprotaso reviewed Mar 11, 2025

View reviewed changes

ensure websockets persists until done on drain

2b581d6

add e2e for ws beyond queue drain; move sleep to appropriate loc add ref to go issue separate drain test

elijah-rou force-pushed the fix/ensure-websockets-complete-on-drain branch from e8dec25 to 2b581d6 Compare March 16, 2025 15:33

dprotaso closed this Apr 14, 2025

dprotaso reopened this Apr 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure websocket conections persist until done on queue-proxy drain #15759

Ensure websocket conections persist until done on queue-proxy drain #15759

elijah-rou commented Feb 7, 2025

knative-prow bot commented Feb 7, 2025

knative-prow bot commented Feb 7, 2025

dprotaso commented Feb 9, 2025

codecov bot commented Feb 9, 2025 •

edited

Loading

dprotaso left a comment

skonto Feb 12, 2025 •

edited

Loading

skonto Feb 12, 2025 •

edited

Loading

elijah-rou Feb 12, 2025

elijah-rou commented Feb 19, 2025

dprotaso Mar 11, 2025

dprotaso Mar 11, 2025

dprotaso Mar 11, 2025

dprotaso Mar 11, 2025

dprotaso commented Mar 29, 2025

dprotaso commented Apr 14, 2025

dprotaso commented Apr 14, 2025

dprotaso commented Apr 14, 2025

knative-prow bot commented Apr 14, 2025

Ensure websocket conections persist until done on queue-proxy drain #15759

Are you sure you want to change the base?

Ensure websocket conections persist until done on queue-proxy drain #15759

Conversation

elijah-rou commented Feb 7, 2025

knative-prow bot commented Feb 7, 2025

knative-prow bot commented Feb 7, 2025

dprotaso commented Feb 9, 2025

codecov bot commented Feb 9, 2025 • edited Loading

Codecov Report

dprotaso left a comment

Choose a reason for hiding this comment

skonto Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

skonto Feb 12, 2025 • edited Loading

Choose a reason for hiding this comment

elijah-rou Feb 12, 2025

Choose a reason for hiding this comment

elijah-rou commented Feb 19, 2025

dprotaso Mar 11, 2025

Choose a reason for hiding this comment

dprotaso Mar 11, 2025

Choose a reason for hiding this comment

dprotaso Mar 11, 2025

Choose a reason for hiding this comment

dprotaso Mar 11, 2025

Choose a reason for hiding this comment

dprotaso commented Mar 29, 2025

dprotaso commented Apr 14, 2025

dprotaso commented Apr 14, 2025

dprotaso commented Apr 14, 2025

knative-prow bot commented Apr 14, 2025

codecov bot commented Feb 9, 2025 •

edited

Loading

skonto Feb 12, 2025 •

edited

Loading

skonto Feb 12, 2025 •

edited

Loading