-
Notifications
You must be signed in to change notification settings - Fork 42
Add some basic benchmarks for mcap and websocketserver #277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request adds a suite of benchmarks to measure performance for both MCAP writing and WebSocket server roundtrips, along with updating Cargo.toml to include benchmark definitions and a dependency on divan.
- Added benchmark scenarios for MCAP writing (json messages, large protobuf, multiple threads)
- Added benchmark scenarios for WebSocket server roundtrip tests (json messages, scene update, multi-threaded)
- Exposed a new channel method (“id”) for retrieving a channel’s unique identifier
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
File | Description |
---|---|
rust/foxglove/benches/mcap_bench.rs | Introduces new MCAP benchmarks with parameterized scenarios. |
rust/foxglove/benches/websocketserver_bench.rs | Adds WebSocket server benchmarks for roundtrip messaging and threading. |
rust/foxglove/Cargo.toml | Updates to include bench definitions and the divan dependency. |
rust/foxglove/src/encode.rs | Adds a new method to retrieve the channel ID to support the new benchmarks. |
Comments suppressed due to low confidence (2)
rust/foxglove/benches/mcap_bench.rs:107
- The function name 'mutlithreaded_log' appears to be misspelled; consider renaming it to 'multithreaded_log' for clarity.
fn mutlithreaded_log(bencher: divan::Bencher, num_threads: usize) {
rust/foxglove/benches/websocketserver_bench.rs:201
- The function name 'roundtrip_mutlithreaded' seems to have a typo; consider renaming it to 'roundtrip_multithreaded' for consistency.
fn roundtrip_mutlithreaded(bencher: divan::Bencher, num_threads: usize) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me in general.
Locally, between runs, I get wildly different slowest
values for the multithreaded log case, even with a single thread. Any idea why?
_ = ws_client.next().await.expect("No serverInfo sent"); | ||
|
||
// FG-10395 replace this with something more precise | ||
tokio::time::sleep(std::time::Duration::from_millis(50)).await; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal, but why do we need to sleep before adding each client?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question, I think this can be moved outside the loop
@@ -80,6 +80,11 @@ impl<T: Encode> TypedChannel<T> { | |||
pub fn topic(&self) -> &str { | |||
&self.inner.topic | |||
} | |||
|
|||
/// Returns the channel ID. | |||
pub fn id(&self) -> ChannelId { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is only needed for clients, do we want it as part of the public API? Or can this be behind some feature?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can hide it if we don't want to expose it, I'll do that for now until we have a reason to expose it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The end-to-end measurements are nice to have, but I think we might want to target smaller chunks of functionality.
For example, when we're benchmarking the websocket server, I'm not really interested in different kinds of serialization. I want to see how many fixed-sized binary messages we can push per second, for various message sizes.
I'd like to set up 0..N "null" sinks that just drop messages, and measure how much latency we have on the Channel.log()
path. It might also be nice to measure allocations on that path. Maybe we also measure allocations for the protobuf and json serializers, for certain schema types.
For particular high-throughput schema types (e.g., compressed video, point cloud), it might be nice to measure how long it takes to shove data into the protobuf struct, and then separately measure how long it takes to serialize.
let tmpdir = tempfile::tempdir().unwrap(); | ||
let path = tmpdir.path().join("test.mcap"); | ||
let _writer = McapWriter::new() | ||
.create_new_buffered_file(&path) | ||
.expect("Failed to start mcap writer"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe /dev/null
, so we're not measuring filesystem latency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, nice idea!
MSG_CHANNEL.log(&message); | ||
|
||
bencher.bench(|| { | ||
for _ in 0..100 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I first saw the numbers I was a bit dismayed. Now it makes more sense.
Why 100 iterations inside the loop? To amortize the cost of compressing/flushing chunks to the underlying file?
I'm wondering whether we want to measure these costs as part of the SDK benchmark suite, or whether they would belong better in the mcap library itself. There's a wide range of configuration options for the mcap writer, which will have various impacts on its efficiency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's worthwhile doing both
let mut entities = Vec::with_capacity(num_entities); | ||
for i in 0..num_entities { | ||
entities.push(SceneEntity { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to move these allocations out of the bench() closure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, this is the usage of the log function for scene update, so we want to measure it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If that's the case, let's measure it separately? Otherwise I don't know what part of the measurement is data-prep, vs serialization, vs mcap writing.
#[divan::bench(args = [1, 2, 4, 8, 16, 32])] | ||
fn mutlithreaded_log(bencher: divan::Bencher, num_threads: usize) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Divan has native support for threads which might be useful here.
.with_inputs(|| { | ||
runtime.block_on(async move { | ||
let mut ws_clients = Vec::with_capacity(num_clients); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're invoking this closure for every iteration of the benchmark, which ends up being extraordinarily time-consuming. Let's set up the clients in advance when we set up the server. You can package them up under a tokio Mutex
for interior mutability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point
}) | ||
}) | ||
.bench_values(|mut ws_clients| { | ||
for _ in 0..50 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 50 messages instead of just one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So that if the cost of the first message is higher, which is the case with mcap, it amortizes it out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But this is the websocket benchmark.
The slowest case is likely to be dominated by poor scheduling in the kernel, it's basically noise to be discarded. |
This adds 3 benchmark scenarios each for mcap writing and websocketserver.
One for json messages, one for large protobuf (SceneUpdate) and one for multiple threads. Each scenario has 4+ parameterized benchmarks to see how it scales along the dimension of interest.
Multi-threaded performance follows a U-shape with performance best in the middle. Adding threads helps, up to a point where contention starts to dominate the runtime.
We can use this to get a baseline when profiling and optimizing performance.
Here's the results on my machine: