Skip to content

Ensure websocket conections persist until done on queue-proxy drain #15759

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

elijah-rou
Copy link

Fixes: Websockets closing abruptly when queue-proxy undergoes drain.

Due to hijacked connections in net/http not being respected when server.Shutdown is called, any active websocket connections currently end as soon as the queue-proxy calls .Shutdown. See gorilla/websocket#448 and golang/go#17721 for details. This patch fixes this issue by introducing an atomic counter of active requests, which increments as a request comes in and decrements as a request handler terminates. During drain, this counter must reach zero or adhere to the revision timeout, in order to call .Shutdown.

Release Note

Introduce a pending requests atom which the queue-proxy can use to gracefully terminate all connections (including highjacked connections)

Copy link

knative-prow bot commented Feb 7, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: elijah-rou
Once this PR has been reviewed and has the lgtm label, please assign dsimansk for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

knative-prow bot commented Feb 7, 2025

Hi @elijah-rou. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@knative-prow knative-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 7, 2025
@knative-prow knative-prow bot requested review from dprotaso and skonto February 7, 2025 20:01
@dprotaso
Copy link
Member

dprotaso commented Feb 9, 2025

/ok-to-test

@knative-prow knative-prow bot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Feb 9, 2025
Copy link

codecov bot commented Feb 9, 2025

Codecov Report

Attention: Patch coverage is 0% with 23 lines in your changes missing coverage. Please review.

Project coverage is 74.95%. Comparing base (6265a8e) to head (2b581d6).
Report is 60 commits behind head on main.

Files with missing lines Patch % Lines
pkg/queue/sharedmain/main.go 0.00% 15 Missing ⚠️
pkg/queue/sharedmain/handlers.go 0.00% 8 Missing ⚠️

❌ Your project check has failed because the head coverage (74.95%) is below the target coverage (80.00%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15759      +/-   ##
==========================================
- Coverage   80.84%   74.95%   -5.90%     
==========================================
  Files         222      222              
  Lines       18070    18095      +25     
==========================================
- Hits        14609    13563    -1046     
- Misses       3089     4181    +1092     
+ Partials      372      351      -21     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@dprotaso dprotaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add an e2e test to validate the failure is fixed by these changes

@elijah-rou elijah-rou force-pushed the fix/ensure-websockets-complete-on-drain branch from d6da0ea to 99dd71c Compare February 12, 2025 00:59
break WaitOnPendingRequests
}
}
time.Sleep(drainSleepDuration)
Copy link
Contributor

@skonto skonto Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds a non required overhead if there are no websocket connections though. Could we avoid it for the non-websocket workloads?

I suppose this is needed so that there is enough time for the app to notify the client by initiating a websocker close action? Note here that QP does not do the actual draining of the connection (due to the known reasons).

Copy link
Contributor

@skonto skonto Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea is to have something like:

done := make(chan struct{})
Server.RegisterOnShutdown( func() {
ticker := time.NewTicker(1 * time.Second)
		defer ticker.Stop()

		logger.Infof("Drain: waiting for %d pending requests to complete", pendingRequests.Load())
	WaitOnPendingRequests:
		for range ticker.C {
			if pendingRequests.Load() <= 0 {
				logger.Infof("Drain: all pending requests completed")
				break WaitOnPendingRequests
			}
		}
		
       # wait some configurable time e.g. WEBSOCKET_TERMINATION_DRAIN_DURATION_SECONDS
       
      defer done <- struct{}{}
})


Server.Shutdown(...)
<-done

WEBSOCKET_TERMINATION_DRAIN_DURATION_SECONDS could be zero by default for regular workloads. Or use something similar for the sleep time above in this PR.

Another thing (a bit of a hack and thinking out loud) is whether we could detect hijacked connections with the upgrade and connection headers which are mandatory 🤔 (not sure about wss but I think we dont support it do we ?). cc @dprotaso

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think may have moved that sleep to the wrong place, should probably just move the sleep to before the wait on pending. Will avoid the additional wait. drainSleepDuration should only be a minimum wait period.

As for avoiding the actual pending req check, may be able to, but it's probably required then that you check specifics about the connection before incrementing the counter to make it a websocket specific (so instead of just using rev proxy we would probably have to). IMO likely not worth doing, since you expecting a wait from .Shutdown already, and this would only bypass .Shutdown having to do the work to wait for non-highjacked connections.

@elijah-rou elijah-rou force-pushed the fix/ensure-websockets-complete-on-drain branch 2 times, most recently from 54db92b to e8dec25 Compare February 12, 2025 16:49
@elijah-rou
Copy link
Author

@dprotaso is the only thing missing from this PR an E2E test?

@@ -304,8 +307,24 @@ func Main(opts ...Option) error {
case <-d.Ctx.Done():
logger.Info("Received TERM signal, attempting to gracefully shutdown servers.")
logger.Infof("Sleeping %v to allow K8s propagation of non-ready state", drainSleepDuration)
time.Sleep(drainSleepDuration)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need this explicit sleep - this is what drainer.Drain is doing drainSleepDuration is just set to 30s or so

@@ -139,3 +144,11 @@ func withFullDuplex(h http.Handler, enableFullDuplex bool, logger *zap.SugaredLo
h.ServeHTTP(w, r)
})
}

func withRequestCounter(h http.Handler, pendingRequests *atomic.Int32) http.Handler {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we create an struct that implements http.Handler and it holds the atomic counter - that would make this reusable

Your handler can hold the next handler so you can call h.ServeHTTP(w, r)

Comment on lines +173 to +174
pendingRequests := atomic.Int32{}
pendingRequests.Store(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above comment - let's merge this into a special http.Handler struct counting requests - that is created in the mainHandler function

@@ -322,6 +322,11 @@ func TestWebSocketWithTimeout(t *testing.T) {
idleTimeoutSeconds: 10,
delay: "20",
expectError: true,
}, {
name: "websocket does not drop after queue drain is called at 30s",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make a separate this - cause this isn't really testing the WebSocket Timeout but instead we're testing that draining long running requests works as expected.

add e2e for ws beyond queue drain; move sleep to appropriate loc

add ref to go issue

separate drain test
@elijah-rou elijah-rou force-pushed the fix/ensure-websockets-complete-on-drain branch from e8dec25 to 2b581d6 Compare March 16, 2025 15:33
@dprotaso
Copy link
Member

/retest

1 similar comment
@dprotaso
Copy link
Member

/retest

@dprotaso
Copy link
Member

closing, re-opening to pick up latest actions

@dprotaso dprotaso closed this Apr 14, 2025
@dprotaso dprotaso reopened this Apr 14, 2025
@dprotaso
Copy link
Member

ah there's a legit compile error in the e2e test

Copy link

knative-prow bot commented Apr 14, 2025

@elijah-rou: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
istio-latest-no-mesh_serving_main 2b581d6 link true /test istio-latest-no-mesh

Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants