Disconnect slow client when control plane fills up #261

eloff · 2025-02-28T19:38:55Z

There are two queues per connected client. One for data (log messages, sent with channel.log) with is lossy. When it's full, we drop the oldest queued message. But this doesn't work well for "control" messages where missing a message would be bad, like channel advertisements/unadvertisements, RPC responses, etc. So we have a second queue for these messages that isn't lossy.

But we still have to do something when the control plane queue fills up. If it's unbounded we consume resources until we crash the process. If it's bounded, but blocking, then slow clients can block other clients from making progress.

So we need to do something without blocking and without increasing the queue size. The only thing that really makes any sense is disconnect the slow client. It's obviously not a good experience for that client, but we have to protect the process and the other clients. So we make this a configurable message backlog size, and the if the user is getting disconnected, and they have the resources to burn, they can increase it. Conversely if the process is crashing with OOM they can decrease it.

To keep things simple there's just one configurable backlog size, and we use the same value for the data plane and control plane queues. They don't need to be the same size, but I'm not sure we need to expose two adjustable backlog sizes for the user. By default we use a backlog of 1024 messages.

When we disconnect the client, we tell them why with an error status message, and point them at the configurable backlog option.

linear · 2025-02-28T19:38:57Z

FG-10441 Close client connection instead of blocking on slow client with full control plane

Copilot

PR Overview

This PR introduces logic to disconnect slow clients when the message backlog fills up, ensuring that the control plane queue does not block or consume excess resources. Key changes include:

Adding a new asynchronous test (test_slow_client) to verify client disconnection when the backlog is exceeded.
Refactoring the client disconnect workflow by making on_disconnect asynchronous and using a CancellationToken for disconnecting slow clients.
Adjusting the control plane backlog to use the same configurable backlog size as the data plane.

Reviewed Changes

File	Description
rust/foxglove/src/websocket/tests.rs	Added a test that verifies the proper disconnection of a slow client and checks the error message.
rust/foxglove/src/websocket.rs	Updated on_disconnect to be asynchronous, replaced the fixed control plane backlog size with the configurable value, and enhanced the disconnection logic using a CancellationToken.

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

rust/foxglove/src/websocket/tests.rs

rust/foxglove/src/websocket.rs

Fix typo Co-authored-by: Copilot <[email protected]>

…-of-blocking-on-slow-client

bryfox

I can imagine test/sim use cases where there is only a single client, and where waiting is better than disconnecting entirely. So I'm not convinced this is always the right behavior.

However, I don't think anyone is very likely to run into the default limits with these 'control' messages, and we do provide a workaround. I'll let others weigh in and approve; I think this is a reasonable step for now.

bryfox · 2025-02-28T21:02:01Z

rust/foxglove/src/websocket.rs

+            let mut sender = self.sender.lock().await;
+            let status = Status::new(
+                StatusLevel::Error,
+                "disconnected because message backlog is full, consider increasing it".to_string(),


"consider increasing it" sounds like it belongs more in the server log. I can't configure backlog size in the Foxglove app.

I'd also use sentence case here — "Disconnected".

We can clarify that, but I really want to point the user at what knob they need to tune. They might have access to it (or if not they know who to talk to)

eloff · 2025-02-28T21:31:14Z

I can imagine test/sim use cases where there is only a single client, and where waiting is better than disconnecting entirely. So I'm not convinced this is always the right behavior.

However, I don't think anyone is very likely to run into the default limits with these 'control' messages, and we do provide a workaround. I'll let others weigh in and approve; I think this is a reasonable step for now.

I considered that, it is very tricky to implement though. I went down that mental rabbit hole for a bit and it looks hairy.

You have to be able to block everything that calls send_control_msg, which is a lot of functions, some called when processing messages from the client, and others invoked by sdk methods. Those may be in sync or async contexts. I think you'd need to provide both an async version of send_control_msg that blocks via await, and a sync version that blocks the thread, and then switch to the disconnect behavior when a second client connects (and maybe also switch back if the second last client disconnects.)

Not impossible, but maybe an opportunity for future improvement.

gasmith

LG

gasmith · 2025-02-28T22:45:49Z

rust/foxglove/src/websocket.rs

@@ -322,6 +322,7 @@ pub(crate) struct ConnectedClient {
    /// Optional callback handler for a server implementation
    server_listener: Option<Arc<dyn ServerListener>>,
    server: Weak<Server>,
+    cancellation_token: CancellationToken,


Add a description about how this is used. Maybe also give this a more self-descriptive name (client_unresponsive?).

…-of-blocking-on-slow-client

disconnect slow client when control plane fills up

4eeb465

eloff requested review from bryfox, amacneil, gasmith and Copilot February 28, 2025 19:38

Copilot AI reviewed Feb 28, 2025

View reviewed changes

rust/foxglove/src/websocket/tests.rs Outdated Show resolved Hide resolved

rust/foxglove/src/websocket.rs Outdated Show resolved Hide resolved

eloff and others added 3 commits February 28, 2025 13:43

Update rust/foxglove/src/websocket.rs

e3965e0

Fix typo Co-authored-by: Copilot <[email protected]>

Update rust/foxglove/src/websocket/tests.rs

1614cf3

Fix typo Co-authored-by: Copilot <[email protected]>

Merge branch 'main' into dan/fg-10441-close-client-connection-instead…

b1a101b

…-of-blocking-on-slow-client

bryfox reviewed Feb 28, 2025

View reviewed changes

gasmith approved these changes Feb 28, 2025

View reviewed changes

eloff added 2 commits February 28, 2025 16:34

improve error message, document cancellation_token

4ede8b7

Merge branch 'main' into dan/fg-10441-close-client-connection-instead…

bfbdb9f

…-of-blocking-on-slow-client

eloff merged commit 353e1b2 into main Feb 28, 2025
35 checks passed

eloff deleted the dan/fg-10441-close-client-connection-instead-of-blocking-on-slow-client branch February 28, 2025 23:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Disconnect slow client when control plane fills up #261

Disconnect slow client when control plane fills up #261

Uh oh!

eloff commented Feb 28, 2025

Uh oh!

linear bot commented Feb 28, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

bryfox left a comment

Uh oh!

bryfox Feb 28, 2025

Uh oh!

eloff Feb 28, 2025

Uh oh!

eloff commented Feb 28, 2025

Uh oh!

gasmith left a comment

Uh oh!

gasmith Feb 28, 2025

Uh oh!

Uh oh!

Uh oh!

Disconnect slow client when control plane fills up #261

Disconnect slow client when control plane fills up #261

Uh oh!

Conversation

eloff commented Feb 28, 2025

Uh oh!

linear bot commented Feb 28, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

PR Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

bryfox left a comment

Choose a reason for hiding this comment

Uh oh!

bryfox Feb 28, 2025

Choose a reason for hiding this comment

Uh oh!

eloff Feb 28, 2025

Choose a reason for hiding this comment

Uh oh!

eloff commented Feb 28, 2025

Uh oh!

gasmith left a comment

Choose a reason for hiding this comment

Uh oh!

gasmith Feb 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!