Skip to content

Add some basic benchmarks for mcap and websocketserver #277

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

eloff
Copy link
Contributor

@eloff eloff commented Mar 11, 2025

This adds 3 benchmark scenarios each for mcap writing and websocketserver.

One for json messages, one for large protobuf (SceneUpdate) and one for multiple threads. Each scenario has 4+ parameterized benchmarks to see how it scales along the dimension of interest.

Multi-threaded performance follows a U-shape with performance best in the middle. Adding threads helps, up to a point where contention starts to dominate the runtime.

We can use this to get a baseline when profiling and optimizing performance.

Here's the results on my machine:

> cargo bench -p foxglove

     Running benches/mcap_bench.rs (target/release/deps/mcap_bench-8552836f6882eaf7)
Timer precision: 10 ns
mcap_bench            fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ log_json_message                 │               │               │               │         │
│  ├─ 50              173.5 µs      │ 3.171 ms      │ 197.8 µs      │ 239.9 µs      │ 100     │ 100
│  ├─ 100             183.6 µs      │ 2.472 ms      │ 204.4 µs      │ 243 µs        │ 100     │ 100
│  ├─ 200             178.4 µs      │ 1.944 ms      │ 196.8 µs      │ 244.5 µs      │ 100     │ 100
│  ╰─ 400             189.6 µs      │ 2.177 ms      │ 209 µs        │ 293.4 µs      │ 100     │ 100
├─ log_scene_update                 │               │               │               │         │
│  ├─ 1               187.8 µs      │ 2.696 ms      │ 209.3 µs      │ 258 µs        │ 100     │ 100
│  ├─ 2               200.7 µs      │ 2.057 ms      │ 244.2 µs      │ 321 µs        │ 100     │ 100
│  ├─ 4               248.6 µs      │ 2.734 ms      │ 284.8 µs      │ 428.8 µs      │ 100     │ 100
│  ╰─ 8               330.6 µs      │ 2.018 ms      │ 374.2 µs      │ 536.2 µs      │ 100     │ 100
╰─ mutlithreaded_log                │               │               │               │         │
   ├─ 1               8.342 ms      │ 34.25 ms      │ 17.8 ms       │ 18.58 ms      │ 100     │ 100
   ├─ 2               4.804 ms      │ 27.94 ms      │ 14.74 ms      │ 14.96 ms      │ 100     │ 100
   ├─ 4               4.777 ms      │ 11.83 ms      │ 6.694 ms      │ 6.653 ms      │ 100     │ 100
   ├─ 8               5.128 ms      │ 32.91 ms      │ 15.99 ms      │ 15.75 ms      │ 100     │ 100
   ├─ 16              8.569 ms      │ 32.62 ms      │ 29.33 ms      │ 28 ms         │ 100     │ 100
   ╰─ 32              15.31 ms      │ 34.91 ms      │ 32.27 ms      │ 31.58 ms      │ 100     │ 100

     Running benches/websocketserver_bench.rs (target/release/deps/websocketserver_bench-aadf4b8711f23b7c)
Timer precision: 46 ns
websocketserver_bench       fastest       │ slowest       │ median        │ mean          │ samples │ iters
├─ roundtrip_json_message                 │               │               │               │         │
│  ├─ 1                     1.478 ms      │ 42.46 ms      │ 41.36 ms      │ 41 ms         │ 100     │ 100
│  ├─ 2                     40.56 ms      │ 42.86 ms      │ 41.72 ms      │ 41.78 ms      │ 100     │ 100
│  ├─ 4                     40.89 ms      │ 43.99 ms      │ 42.22 ms      │ 42.28 ms      │ 100     │ 100
│  ╰─ 8                     41.86 ms      │ 44.72 ms      │ 43.23 ms      │ 43.35 ms      │ 100     │ 100
├─ roundtrip_mutlithreaded                │               │               │               │         │
│  ├─ 1                     12.52 ms      │ 65.63 ms      │ 37.81 ms      │ 41.03 ms      │ 100     │ 100
│  ├─ 2                     10.31 ms      │ 21.69 ms      │ 16.12 ms      │ 16.01 ms      │ 100     │ 100
│  ├─ 4                     5.695 ms      │ 22.45 ms      │ 12.38 ms      │ 13.62 ms      │ 100     │ 100
│  ├─ 8                     6.038 ms      │ 21.16 ms      │ 12.14 ms      │ 13.29 ms      │ 100     │ 100
│  ├─ 16                    4.911 ms      │ 23.31 ms      │ 16.28 ms      │ 14.7 ms       │ 100     │ 100
│  ╰─ 32                    13.49 ms      │ 26.26 ms      │ 20.29 ms      │ 20 ms         │ 100     │ 100
╰─ roundtrip_scene_update                 │               │               │               │         │
   ├─ 1                     1.399 ms      │ 43.29 ms      │ 41.39 ms      │ 41.28 ms      │ 100     │ 100
   ├─ 2                     41.23 ms      │ 44.5 ms       │ 42.26 ms      │ 42.27 ms      │ 100     │ 100
   ├─ 4                     41.51 ms      │ 44.68 ms      │ 42.75 ms      │ 42.92 ms      │ 100     │ 100
   ╰─ 8                     41.73 ms      │ 45.51 ms      │ 43.97 ms      │ 44.08 ms      │ 100     │ 100

Copy link

linear bot commented Mar 11, 2025

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request adds a suite of benchmarks to measure performance for both MCAP writing and WebSocket server roundtrips, along with updating Cargo.toml to include benchmark definitions and a dependency on divan.

  • Added benchmark scenarios for MCAP writing (json messages, large protobuf, multiple threads)
  • Added benchmark scenarios for WebSocket server roundtrip tests (json messages, scene update, multi-threaded)
  • Exposed a new channel method (“id”) for retrieving a channel’s unique identifier

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

File Description
rust/foxglove/benches/mcap_bench.rs Introduces new MCAP benchmarks with parameterized scenarios.
rust/foxglove/benches/websocketserver_bench.rs Adds WebSocket server benchmarks for roundtrip messaging and threading.
rust/foxglove/Cargo.toml Updates to include bench definitions and the divan dependency.
rust/foxglove/src/encode.rs Adds a new method to retrieve the channel ID to support the new benchmarks.
Comments suppressed due to low confidence (2)

rust/foxglove/benches/mcap_bench.rs:107

  • The function name 'mutlithreaded_log' appears to be misspelled; consider renaming it to 'multithreaded_log' for clarity.
fn mutlithreaded_log(bencher: divan::Bencher, num_threads: usize) {

rust/foxglove/benches/websocketserver_bench.rs:201

  • The function name 'roundtrip_mutlithreaded' seems to have a typo; consider renaming it to 'roundtrip_multithreaded' for consistency.
fn roundtrip_mutlithreaded(bencher: divan::Bencher, num_threads: usize) {

Copy link
Contributor

@bryfox bryfox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me in general.

Locally, between runs, I get wildly different slowest values for the multithreaded log case, even with a single thread. Any idea why?

_ = ws_client.next().await.expect("No serverInfo sent");

// FG-10395 replace this with something more precise
tokio::time::sleep(std::time::Duration::from_millis(50)).await;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a big deal, but why do we need to sleep before adding each client?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I think this can be moved outside the loop

@@ -80,6 +80,11 @@ impl<T: Encode> TypedChannel<T> {
pub fn topic(&self) -> &str {
&self.inner.topic
}

/// Returns the channel ID.
pub fn id(&self) -> ChannelId {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is only needed for clients, do we want it as part of the public API? Or can this be behind some feature?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can hide it if we don't want to expose it, I'll do that for now until we have a reason to expose it

Copy link
Contributor

@gasmith gasmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The end-to-end measurements are nice to have, but I think we might want to target smaller chunks of functionality.

For example, when we're benchmarking the websocket server, I'm not really interested in different kinds of serialization. I want to see how many fixed-sized binary messages we can push per second, for various message sizes.

I'd like to set up 0..N "null" sinks that just drop messages, and measure how much latency we have on the Channel.log() path. It might also be nice to measure allocations on that path. Maybe we also measure allocations for the protobuf and json serializers, for certain schema types.

For particular high-throughput schema types (e.g., compressed video, point cloud), it might be nice to measure how long it takes to shove data into the protobuf struct, and then separately measure how long it takes to serialize.

Comment on lines +27 to +31
let tmpdir = tempfile::tempdir().unwrap();
let path = tmpdir.path().join("test.mcap");
let _writer = McapWriter::new()
.create_new_buffered_file(&path)
.expect("Failed to start mcap writer");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe /dev/null, so we're not measuring filesystem latency?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, nice idea!

MSG_CHANNEL.log(&message);

bencher.bench(|| {
for _ in 0..100 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I first saw the numbers I was a bit dismayed. Now it makes more sense.

Why 100 iterations inside the loop? To amortize the cost of compressing/flushing chunks to the underlying file?

I'm wondering whether we want to measure these costs as part of the SDK benchmark suite, or whether they would belong better in the mcap library itself. There's a wide range of configuration options for the mcap writer, which will have various impacts on its efficiency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's worthwhile doing both

Comment on lines +61 to +63
let mut entities = Vec::with_capacity(num_entities);
for i in 0..num_entities {
entities.push(SceneEntity {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to move these allocations out of the bench() closure?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, this is the usage of the log function for scene update, so we want to measure it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that's the case, let's measure it separately? Otherwise I don't know what part of the measurement is data-prep, vs serialization, vs mcap writing.

Comment on lines +106 to +107
#[divan::bench(args = [1, 2, 4, 8, 16, 32])]
fn mutlithreaded_log(bencher: divan::Bencher, num_threads: usize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Divan has native support for threads which might be useful here.

Comment on lines +47 to +49
.with_inputs(|| {
runtime.block_on(async move {
let mut ws_clients = Vec::with_capacity(num_clients);
Copy link
Contributor

@gasmith gasmith Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're invoking this closure for every iteration of the benchmark, which ends up being extraordinarily time-consuming. Let's set up the clients in advance when we set up the server. You can package them up under a tokio Mutex for interior mutability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point

})
})
.bench_values(|mut ws_clients| {
for _ in 0..50 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 50 messages instead of just one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So that if the cost of the first message is higher, which is the case with mcap, it amortizes it out.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But this is the websocket benchmark.

@eloff
Copy link
Contributor Author

eloff commented Mar 12, 2025

Looks good to me in general.

Locally, between runs, I get wildly different slowest values for the multithreaded log case, even with a single thread. Any idea why?

The slowest case is likely to be dominated by poor scheduling in the kernel, it's basically noise to be discarded.

@eloff eloff closed this May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants