Skip to content

Disconnect slow client when control plane fills up #261

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

eloff
Copy link
Contributor

@eloff eloff commented Feb 28, 2025

There are two queues per connected client. One for data (log messages, sent with channel.log) with is lossy. When it's full, we drop the oldest queued message. But this doesn't work well for "control" messages where missing a message would be bad, like channel advertisements/unadvertisements, RPC responses, etc. So we have a second queue for these messages that isn't lossy.

But we still have to do something when the control plane queue fills up. If it's unbounded we consume resources until we crash the process. If it's bounded, but blocking, then slow clients can block other clients from making progress.

So we need to do something without blocking and without increasing the queue size. The only thing that really makes any sense is disconnect the slow client. It's obviously not a good experience for that client, but we have to protect the process and the other clients. So we make this a configurable message backlog size, and the if the user is getting disconnected, and they have the resources to burn, they can increase it. Conversely if the process is crashing with OOM they can decrease it.

To keep things simple there's just one configurable backlog size, and we use the same value for the data plane and control plane queues. They don't need to be the same size, but I'm not sure we need to expose two adjustable backlog sizes for the user. By default we use a backlog of 1024 messages.

When we disconnect the client, we tell them why with an error status message, and point them at the configurable backlog option.

Copy link

linear bot commented Feb 28, 2025

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This PR introduces logic to disconnect slow clients when the message backlog fills up, ensuring that the control plane queue does not block or consume excess resources. Key changes include:

  • Adding a new asynchronous test (test_slow_client) to verify client disconnection when the backlog is exceeded.
  • Refactoring the client disconnect workflow by making on_disconnect asynchronous and using a CancellationToken for disconnecting slow clients.
  • Adjusting the control plane backlog to use the same configurable backlog size as the data plane.

Reviewed Changes

File Description
rust/foxglove/src/websocket/tests.rs Added a test that verifies the proper disconnection of a slow client and checks the error message.
rust/foxglove/src/websocket.rs Updated on_disconnect to be asynchronous, replaced the fixed control plane backlog size with the configurable value, and enhanced the disconnection logic using a CancellationToken.

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

Copy link
Contributor

@bryfox bryfox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can imagine test/sim use cases where there is only a single client, and where waiting is better than disconnecting entirely. So I'm not convinced this is always the right behavior.

However, I don't think anyone is very likely to run into the default limits with these 'control' messages, and we do provide a workaround. I'll let others weigh in and approve; I think this is a reasonable step for now.

let mut sender = self.sender.lock().await;
let status = Status::new(
StatusLevel::Error,
"disconnected because message backlog is full, consider increasing it".to_string(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"consider increasing it" sounds like it belongs more in the server log. I can't configure backlog size in the Foxglove app.

I'd also use sentence case here — "Disconnected".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can clarify that, but I really want to point the user at what knob they need to tune. They might have access to it (or if not they know who to talk to)

@eloff
Copy link
Contributor Author

eloff commented Feb 28, 2025

I can imagine test/sim use cases where there is only a single client, and where waiting is better than disconnecting entirely. So I'm not convinced this is always the right behavior.

However, I don't think anyone is very likely to run into the default limits with these 'control' messages, and we do provide a workaround. I'll let others weigh in and approve; I think this is a reasonable step for now.

I considered that, it is very tricky to implement though. I went down that mental rabbit hole for a bit and it looks hairy.

You have to be able to block everything that calls send_control_msg, which is a lot of functions, some called when processing messages from the client, and others invoked by sdk methods. Those may be in sync or async contexts. I think you'd need to provide both an async version of send_control_msg that blocks via await, and a sync version that blocks the thread, and then switch to the disconnect behavior when a second client connects (and maybe also switch back if the second last client disconnects.)

Not impossible, but maybe an opportunity for future improvement.

Copy link
Contributor

@gasmith gasmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG

@@ -322,6 +322,7 @@ pub(crate) struct ConnectedClient {
/// Optional callback handler for a server implementation
server_listener: Option<Arc<dyn ServerListener>>,
server: Weak<Server>,
cancellation_token: CancellationToken,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a description about how this is used. Maybe also give this a more self-descriptive name (client_unresponsive?).

@eloff eloff merged commit 353e1b2 into main Feb 28, 2025
35 checks passed
@eloff eloff deleted the dan/fg-10441-close-client-connection-instead-of-blocking-on-slow-client branch February 28, 2025 23:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants