-
Notifications
You must be signed in to change notification settings - Fork 42
Disconnect slow client when control plane fills up #261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disconnect slow client when control plane fills up #261
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Overview
This PR introduces logic to disconnect slow clients when the message backlog fills up, ensuring that the control plane queue does not block or consume excess resources. Key changes include:
- Adding a new asynchronous test (test_slow_client) to verify client disconnection when the backlog is exceeded.
- Refactoring the client disconnect workflow by making on_disconnect asynchronous and using a CancellationToken for disconnecting slow clients.
- Adjusting the control plane backlog to use the same configurable backlog size as the data plane.
Reviewed Changes
File | Description |
---|---|
rust/foxglove/src/websocket/tests.rs | Added a test that verifies the proper disconnection of a slow client and checks the error message. |
rust/foxglove/src/websocket.rs | Updated on_disconnect to be asynchronous, replaced the fixed control plane backlog size with the configurable value, and enhanced the disconnection logic using a CancellationToken. |
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
Fix typo Co-authored-by: Copilot <[email protected]>
Fix typo Co-authored-by: Copilot <[email protected]>
…-of-blocking-on-slow-client
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can imagine test/sim use cases where there is only a single client, and where waiting is better than disconnecting entirely. So I'm not convinced this is always the right behavior.
However, I don't think anyone is very likely to run into the default limits with these 'control' messages, and we do provide a workaround. I'll let others weigh in and approve; I think this is a reasonable step for now.
rust/foxglove/src/websocket.rs
Outdated
let mut sender = self.sender.lock().await; | ||
let status = Status::new( | ||
StatusLevel::Error, | ||
"disconnected because message backlog is full, consider increasing it".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"consider increasing it" sounds like it belongs more in the server log. I can't configure backlog size in the Foxglove app.
I'd also use sentence case here — "Disconnected".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can clarify that, but I really want to point the user at what knob they need to tune. They might have access to it (or if not they know who to talk to)
I considered that, it is very tricky to implement though. I went down that mental rabbit hole for a bit and it looks hairy. You have to be able to block everything that calls send_control_msg, which is a lot of functions, some called when processing messages from the client, and others invoked by sdk methods. Those may be in sync or async contexts. I think you'd need to provide both an async version of send_control_msg that blocks via await, and a sync version that blocks the thread, and then switch to the disconnect behavior when a second client connects (and maybe also switch back if the second last client disconnects.) Not impossible, but maybe an opportunity for future improvement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG
@@ -322,6 +322,7 @@ pub(crate) struct ConnectedClient { | |||
/// Optional callback handler for a server implementation | |||
server_listener: Option<Arc<dyn ServerListener>>, | |||
server: Weak<Server>, | |||
cancellation_token: CancellationToken, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a description about how this is used. Maybe also give this a more self-descriptive name (client_unresponsive
?).
There are two queues per connected client. One for data (log messages, sent with channel.log) with is lossy. When it's full, we drop the oldest queued message. But this doesn't work well for "control" messages where missing a message would be bad, like channel advertisements/unadvertisements, RPC responses, etc. So we have a second queue for these messages that isn't lossy.
But we still have to do something when the control plane queue fills up. If it's unbounded we consume resources until we crash the process. If it's bounded, but blocking, then slow clients can block other clients from making progress.
So we need to do something without blocking and without increasing the queue size. The only thing that really makes any sense is disconnect the slow client. It's obviously not a good experience for that client, but we have to protect the process and the other clients. So we make this a configurable message backlog size, and the if the user is getting disconnected, and they have the resources to burn, they can increase it. Conversely if the process is crashing with OOM they can decrease it.
To keep things simple there's just one configurable backlog size, and we use the same value for the data plane and control plane queues. They don't need to be the same size, but I'm not sure we need to expose two adjustable backlog sizes for the user. By default we use a backlog of 1024 messages.
When we disconnect the client, we tell them why with an error status message, and point them at the configurable backlog option.