Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gracefully shutdown of the websocket client #213

Merged

Conversation

haoqixu
Copy link
Member

@haoqixu haoqixu commented Nov 21, 2023

Resolves #163

@haoqixu haoqixu force-pushed the f-163-wcclient-graceful-shutdown branch from 46b92c2 to b5666be Compare November 21, 2023 07:46
Copy link

codecov bot commented Nov 21, 2023

Codecov Report

Attention: Patch coverage is 85.41667% with 7 lines in your changes are missing coverage. Please review.

Project coverage is 73.78%. Comparing base (7e92da0) to head (0c4dab0).

Files Patch % Lines
client/wsclient.go 75.86% 5 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #213      +/-   ##
==========================================
- Coverage   73.80%   73.78%   -0.03%     
==========================================
  Files          25       25              
  Lines        1550     1579      +29     
==========================================
+ Hits         1144     1165      +21     
- Misses        296      301       +5     
- Partials      110      113       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@haoqixu haoqixu marked this pull request as ready for review November 21, 2023 15:30
@haoqixu haoqixu requested a review from a team November 21, 2023 15:30
@haoqixu
Copy link
Member Author

haoqixu commented Nov 29, 2023

Could you please take a look @evan-bradley @srikanthccv @tigrannajaryan

@haoqixu
Copy link
Member Author

haoqixu commented Nov 30, 2023

I have watched the meeting record and I think I can add some explanation about this PR.

The solution here is making wsclient.runOneCycle responsible for the entire lifecyle of the connection and controlling the running of sender and receiver. It also move the receiver to run in a separate goroutine. wsclient.runOneCycle will ensure that both sender and receiver are stopped and the connection is closed before it returns. So we don't need to care about the connection outside wsclient.runOneCycle.

When the client is shutting down, the sender goroutine will be cancelled by the ctx, send the close message to server and return the error to wsclient.runOneCycle via sender.Err(). wsclient.runOneCycle will wait for the close message from server until timeout and ensure the connection is closed.

If the server wants to close the connection, it should send the close message to initiate the close handshake. When a close message is received, the connection read methods will handle it by calling the close handler (the default handler will send a close message back to the peer. I think this is what we want.) and return a CloseError. The receiver loop will stop on any error. wsclient.runOneCycle will stop the sender and close the connection when receiver is stopped.

Comment on lines 265 to 266
case <-r.IsStopped():
c.common.Logger.Debugf("shutdown handshake complete.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding websocket.IsCloseError(r.Err(), websocket.CloseNormalClosure) as this may not always be handshake complete. Less likely but there is a chance b/w some other error could occur before the close message is received from the server.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the receiver loop has handled it and printed a log for it. However, the log message here is not very accurate. The "complete" here actually means that the receiver loop has also exited. 🤔

select {
case <-c.sender.IsStopped():
// sender will send close message to initiate the close handshake
if err := c.sender.Err(); err != nil && err != websocket.ErrCloseSent {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this err != websocket.ErrCloseSent to handle the case when the server initiates the close?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. The sender and receiver run concurrently. There is a possibility that the server and client initiate the closing handshake at the same time. In this case, the close message may be sent by CloseHandler, which is called in the receiver.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change sender.Err() to return nil if the error was websocket.ErrCloseSent? Let's consider it an internal implementation detail of the sender and avoid leaking it here. That we only need to check err for nil here. Sender seeing a websocket.ErrCloseSent response is a success, we don't need to consider it an error.

@@ -56,13 +66,22 @@ out:
s.sendNextMessage()

case <-ctx.Done():
s.err = s.sendCloseMessage()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means stopping the sender also closes the connection which might be unexpected because run should simply run until the context is cancelled but it also closes the connection. I think sending Close message should be part of client Stop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is by design. Stopping the sender would always try to send the close message for the close handshake and runOneCycle is responsible for the subsequent steps of connection closing. I updated the comment of the runOneCycle().

Copy link
Member

@srikanthccv srikanthccv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, One main difference from this PR #205 is that the sender initiates the close handshake which makes the implementation a little simpler but also the connection management is now done b/w wsclient and wssender. @evan-bradley what do you think?

client/wsclient_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@evan-bradley evan-bradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this, overall it looks good to me. I think this approach is an improvement over #205. There may additional improvements we can make in the future, but for now this will resolve the errors we are seeing and will cleanly close the connection.

client/wsclient.go Outdated Show resolved Hide resolved
client/wsclient.go Show resolved Hide resolved
@tigrannajaryan tigrannajaryan changed the title graceful shutdown of the websocket client Gracefully shutdown of the websocket client Jan 9, 2024
client/internal/wsreceiver.go Outdated Show resolved Hide resolved
client/internal/wsreceiver.go Show resolved Hide resolved
client/internal/wssender.go Outdated Show resolved Hide resolved
client/internal/wssender.go Show resolved Hide resolved
select {
case <-c.sender.IsStopped():
// sender will send close message to initiate the close handshake
if err := c.sender.Err(); err != nil && err != websocket.ErrCloseSent {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change sender.Err() to return nil if the error was websocket.ErrCloseSent? Let's consider it an internal implementation detail of the sender and avoid leaking it here. That we only need to check err for nil here. Sender seeing a websocket.ErrCloseSent response is a success, we don't need to consider it an error.

client/wsclient.go Outdated Show resolved Hide resolved
client/wsclient.go Outdated Show resolved Hide resolved
client/wsclient.go Outdated Show resolved Hide resolved
client/wsclient.go Show resolved Hide resolved
client/wsclient.go Show resolved Hide resolved
Copy link
Member

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience @haoqixu

Please bear with me while I am trying to understand all the concurrent flows.

client/internal/wssender.go Show resolved Hide resolved
client/wsclient.go Show resolved Hide resolved
client/internal/wsreceiver.go Outdated Show resolved Hide resolved
client/internal/wssender.go Outdated Show resolved Hide resolved
client/internal/wssender.go Outdated Show resolved Hide resolved
client/wsclient.go Outdated Show resolved Hide resolved
// sender will send close message to initiate the close handshake
if err := c.sender.Err(); err != nil {
c.common.Logger.Debugf("Error stopping the sender: %v", err)
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we break here? If the sender stopping failed shouldn't we still try to stop the receiver and close the connection?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this break was introduced during the updates of the PR. The initial code does not wait for the receiver to stop, leaving the closing of the connection to the defer. I update the branch to close the conneciton explicitly and wait for the receiver to stop before break.

@tigrannajaryan
Copy link
Member

Thanks for updates @haoqixu
The PR mostly looks good to me, the only remaining question is https://github.com/open-telemetry/opamp-go/pull/213/files#r1462064174

@haoqixu
Copy link
Member Author

haoqixu commented Jan 25, 2024

The PR mostly looks good to me, the only remaining question is https://github.com/open-telemetry/opamp-go/pull/213/files#r1462064174

I think this break was introduced during the updates of the PR. The initial code does not wait for the receiver to stop, leaving the closing of the connection to the defer. I update the branch to close the conneciton explicitly and wait for the receiver to stop before break.

I have updated the PR with 5fa47a3. Another option is to remove the break directly, but this will cause the sender to unnecessarily wait until timeout.

@tigrannajaryan
Copy link
Member

@haoqixu this looks good to me. Can you please resolve the conflicts?

@haoqixu haoqixu force-pushed the f-163-wcclient-graceful-shutdown branch from 165735a to fcbb3c7 Compare February 6, 2024 10:42
@haoqixu
Copy link
Member Author

haoqixu commented Feb 6, 2024

By design, we need to canceling the context and closing the connection to stop the ReceiverLoop. And the gracefully shutdown flow of wsclient will do this for us. Considering ReceiverLoop as an internal implementation detail of the wsclient, we could just gracefully shut down the wsclient.

gorilla/websocket doesn't provide a method to unblock the reading of connection and we always need to close the connection to unblock the reading.

#240 puts the reading of connections into a separate goroutine, which is unnecessary and may cause goroutine leaks. And the update here actually reverts #240

@haoqixu
Copy link
Member Author

haoqixu commented Feb 6, 2024

please take a look. @tigrannajaryan

@tigrannajaryan
Copy link
Member

By design, we need to canceling the context and closing the connection to stop the ReceiverLoop. And the gracefully shutdown flow of wsclient will do this for us. Considering ReceiverLoop as an internal implementation detail of the wsclient, we could just gracefully shut down the wsclient.

gorilla/websocket doesn't provide a method to unblock the reading of connection and we always need to close the connection to unblock the reading.

#240 puts the reading of connections into a separate goroutine, which is unnecessary and may cause goroutine leaks. And the update here actually reverts #240

@srikanthccv please take another look at this PR since it reverts your earlier PR #240.

@srikanthccv
Copy link
Member

srikanthccv commented Feb 6, 2024

We don't make it clear anywhere that ReceiverLoop is an internal implementation detail. By making both connection closing and context cancellation (I don't see much use of received parent context in this case) mandatory for exiting the receiver loop we may prematurely close the connection. There would be leaks if wsReceiver were to work in isolation but given we will close the connection followed by the receiver loop exit, there shouldn't be goroutine leaks?

We will need to send a final message(s) such as agent disconnect (not implemented yet) before stopping the client. I'd prefer to see wsclient have final say in when to close for such kind of reasons. I want to hear your thoughts on this.

@srikanthccv
Copy link
Member

I recall we had discussed this already once. The implementation can always evolve to satisfy the needs but We should explicitly call out what we consider as internal implementation details.

@haoqixu haoqixu force-pushed the f-163-wcclient-graceful-shutdown branch from 316a4d4 to d5c6e3a Compare March 7, 2024 03:56
@haoqixu haoqixu marked this pull request as draft March 7, 2024 04:09
@haoqixu haoqixu marked this pull request as ready for review March 7, 2024 07:00
@haoqixu
Copy link
Member Author

haoqixu commented Mar 7, 2024

I have updated the code. Please take another look. @srikanthccv @tigrannajaryan

Copy link
Member

@srikanthccv srikanthccv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, LGTM.

@jaronoff97
Copy link
Contributor

@tigrannajaryan are we good to merge this? I just ran into an instance of this in the opampextension

@tigrannajaryan tigrannajaryan merged commit c7fc585 into open-telemetry:main Mar 26, 2024
7 checks passed
@tigrannajaryan
Copy link
Member

Thank you @haoqixu

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

wsclient: Unexpected error while receiving:read ... use of closed network connection
5 participants