Existing microbenchmarks measure only the `outval.clone()` and not an actual computation

Hi! While investigating a different issue (https://github.com/rust-lang/rust/issues/146497), I found out that compiler optimizes a result computation out of an inner benchmarking loop of `criterion` for many `nalgebra` benchmarks.

For example, here is how assembly looks like for the `mat2_mul_v` benchmark:

<img width="1168" height="2001" alt="Image" src="https://github.com/user-attachments/assets/8c528376-df0d-49b3-8a7e-35fe5a09d6ad" />

It is possible to disable this optimization by applying `black_hole()` on all function arguments inside the loop:

```diff
diff --git a/benches/common/macros.rs b/benches/common/macros.rs
index 2f64f83a1986..e65bac51fd2d 100644
--- a/benches/common/macros.rs
+++ b/benches/common/macros.rs
@@ -4,12 +4,14 @@ macro_rules! bench_binop(
     ($name: ident, $t1: ty, $t2: ty, $binop: ident) => {
         fn $name(bh: &mut criterion::Criterion) {
             use rand::SeedableRng;
+            use std::hint::black_box;
+
             let mut rng = IsaacRng::seed_from_u64(0);
             let a = rng.random::<$t1>();
             let b = rng.random::<$t2>();

             bh.bench_function(stringify!($name), move |bh| bh.iter(|| {
-                a.$binop(b)
+                black_box(a).$binop(black_box(b))
             }));
         }
     }
```

Then the same benchmark looks different:

<img width="1135" height="1700" alt="Image" src="https://github.com/user-attachments/assets/40de7ab6-0a25-41a6-b811-fe87211d3f71" />

Now it measures an actual computation and `criterion` reports the difference:

```text
mat2_mul_v              time:   [1.2087 ns 1.2094 ns 1.2105 ns]
                        change: [+279.66% +279.89% +280.13%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe
```

## How to check this yourself

First, run the benchmark binary under `lldb` with a single benchmark name as a parameter:

```bash
lldb target/release/deps/nalgebra_bench-5cfdacb5106e683c -- --bench mat2_mul_v
```

Then find an address of the `Timespec::now()`:

```text
(lldb) image lookup -r -n ".*::Timespec::now::.*"
1 match found in /home/im/projects/patches/nalgebra/target/release/deps/nalgebra_bench-5cfdacb5106e683c:
        Address: nalgebra_bench-5cfdacb5106e683c[0x000000000042f9a0] (nalgebra_bench-5cfdacb5106e683c.PT_LOAD[0]..text + 4388320)
        Summary: nalgebra_bench-5cfdacb5106e683c`std::sys::pal::unix::time::Timespec::now::h5ef5d0c6e88433fb
```

Then set breakpoint on it:

```text
(lldb) breakpoint set -s nalgebra_bench-5cfdacb5106e683c -a 0x000000000042f9a0
Breakpoint 1: where = nalgebra_bench-5cfdacb5106e683c`std::sys::pal::unix::time::Timespec::now::h5ef5d0c6e88433fb, address = 0x000000000042f9a0
```

Then run the binary:

```text
(lldb) run
Process 384495 launched: '/home/im/projects/patches/nalgebra/target/release/deps/nalgebra_bench-5cfdacb5106e683c' (x86_64)
Process 384495 stopped and restarted: thread 1 received signal: SIGCHLD
Gnuplot not found, using plotters backend
Benchmarking mat2_mul_v: Warming up for 3.0000 sProcess 384495 stopped
* thread #1, name = 'nalgebra_bench-', stop reason = breakpoint 1.1
    frame #0: 0x00005555559839a0 nalgebra_bench-5cfdacb5106e683c`std::sys::pal::unix::time::Timespec::now::h5ef5d0c6e88433fb
nalgebra_bench-5cfdacb5106e683c`std::sys::pal::unix::time::Timespec::now::h5ef5d0c6e88433fb:
->  0x5555559839a0 <+0>: pushq  %rbp
    0x5555559839a1 <+1>: movq   %rsp, %rbp
    0x5555559839a4 <+4>: pushq  %rbx
    0x5555559839a5 <+5>: subq   $0x18, %rsp
```

Now you are inside the `Timespec::now()`, need to step outside:

```text
(lldb) finish
Process 384495 stopped
* thread #1, name = 'nalgebra_bench-', stop reason = step out
    frame #0: 0x0000555555775095 nalgebra_bench-5cfdacb5106e683c`criterion::bencher::Bencher$LT$M$GT$::iter::h66e8fc942edf1939 + 37
nalgebra_bench-5cfdacb5106e683c`criterion::bencher::Bencher$LT$M$GT$::iter::h66e8fc942edf1939:
->  0x555555775095 <+37>: movq   %rax, 0x40(%rsp)
    0x55555577509a <+42>: movl   %edx, %ebp
    0x55555577509c <+44>: movl   $0x1, %edi
    0x5555557750a1 <+49>: callq  0x5555559839a0 ; std::sys::pal::unix::time::Timespec::now::h5ef5d0c6e88433fb
```

Now you are inside the relevant instance of `criterion::bencher::Bencher::iter()` and can check the assembly:

```text
(lldb) disassemble
...
```

Note that the file name, addresses and full function names will be different in your case.

P.S. I will create a PR to fix this.
P.P.S. Title lies a bit: actually, `criterion` measures one computation and a multiple iterations of `outval.clone()`. Final benchmark result still does not make sense though.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Existing microbenchmarks measure only the `outval.clone()` and not an actual computation #1547

How to check this yourself

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Existing microbenchmarks measure only the outval.clone() and not an actual computation #1547

Description

How to check this yourself

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Existing microbenchmarks measure only the `outval.clone()` and not an actual computation #1547