Skip to content

Existing microbenchmarks measure only the outval.clone() and not an actual computation #1547

@im-0

Description

@im-0

Hi! While investigating a different issue (rust-lang/rust#146497), I found out that compiler optimizes a result computation out of an inner benchmarking loop of criterion for many nalgebra benchmarks.

For example, here is how assembly looks like for the mat2_mul_v benchmark:

Image

It is possible to disable this optimization by applying black_hole() on all function arguments inside the loop:

diff --git a/benches/common/macros.rs b/benches/common/macros.rs
index 2f64f83a1986..e65bac51fd2d 100644
--- a/benches/common/macros.rs
+++ b/benches/common/macros.rs
@@ -4,12 +4,14 @@ macro_rules! bench_binop(
     ($name: ident, $t1: ty, $t2: ty, $binop: ident) => {
         fn $name(bh: &mut criterion::Criterion) {
             use rand::SeedableRng;
+            use std::hint::black_box;
+
             let mut rng = IsaacRng::seed_from_u64(0);
             let a = rng.random::<$t1>();
             let b = rng.random::<$t2>();

             bh.bench_function(stringify!($name), move |bh| bh.iter(|| {
-                a.$binop(b)
+                black_box(a).$binop(black_box(b))
             }));
         }
     }

Then the same benchmark looks different:

Image

Now it measures an actual computation and criterion reports the difference:

mat2_mul_v              time:   [1.2087 ns 1.2094 ns 1.2105 ns]
                        change: [+279.66% +279.89% +280.13%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe

How to check this yourself

First, run the benchmark binary under lldb with a single benchmark name as a parameter:

lldb target/release/deps/nalgebra_bench-5cfdacb5106e683c -- --bench mat2_mul_v

Then find an address of the Timespec::now():

(lldb) image lookup -r -n ".*::Timespec::now::.*"
1 match found in /home/im/projects/patches/nalgebra/target/release/deps/nalgebra_bench-5cfdacb5106e683c:
        Address: nalgebra_bench-5cfdacb5106e683c[0x000000000042f9a0] (nalgebra_bench-5cfdacb5106e683c.PT_LOAD[0]..text + 4388320)
        Summary: nalgebra_bench-5cfdacb5106e683c`std::sys::pal::unix::time::Timespec::now::h5ef5d0c6e88433fb

Then set breakpoint on it:

(lldb) breakpoint set -s nalgebra_bench-5cfdacb5106e683c -a 0x000000000042f9a0
Breakpoint 1: where = nalgebra_bench-5cfdacb5106e683c`std::sys::pal::unix::time::Timespec::now::h5ef5d0c6e88433fb, address = 0x000000000042f9a0

Then run the binary:

(lldb) run
Process 384495 launched: '/home/im/projects/patches/nalgebra/target/release/deps/nalgebra_bench-5cfdacb5106e683c' (x86_64)
Process 384495 stopped and restarted: thread 1 received signal: SIGCHLD
Gnuplot not found, using plotters backend
Benchmarking mat2_mul_v: Warming up for 3.0000 sProcess 384495 stopped
* thread #1, name = 'nalgebra_bench-', stop reason = breakpoint 1.1
    frame #0: 0x00005555559839a0 nalgebra_bench-5cfdacb5106e683c`std::sys::pal::unix::time::Timespec::now::h5ef5d0c6e88433fb
nalgebra_bench-5cfdacb5106e683c`std::sys::pal::unix::time::Timespec::now::h5ef5d0c6e88433fb:
->  0x5555559839a0 <+0>: pushq  %rbp
    0x5555559839a1 <+1>: movq   %rsp, %rbp
    0x5555559839a4 <+4>: pushq  %rbx
    0x5555559839a5 <+5>: subq   $0x18, %rsp

Now you are inside the Timespec::now(), need to step outside:

(lldb) finish
Process 384495 stopped
* thread #1, name = 'nalgebra_bench-', stop reason = step out
    frame #0: 0x0000555555775095 nalgebra_bench-5cfdacb5106e683c`criterion::bencher::Bencher$LT$M$GT$::iter::h66e8fc942edf1939 + 37
nalgebra_bench-5cfdacb5106e683c`criterion::bencher::Bencher$LT$M$GT$::iter::h66e8fc942edf1939:
->  0x555555775095 <+37>: movq   %rax, 0x40(%rsp)
    0x55555577509a <+42>: movl   %edx, %ebp
    0x55555577509c <+44>: movl   $0x1, %edi
    0x5555557750a1 <+49>: callq  0x5555559839a0 ; std::sys::pal::unix::time::Timespec::now::h5ef5d0c6e88433fb

Now you are inside the relevant instance of criterion::bencher::Bencher::iter() and can check the assembly:

(lldb) disassemble
...

Note that the file name, addresses and full function names will be different in your case.

P.S. I will create a PR to fix this.
P.P.S. Title lies a bit: actually, criterion measures one computation and a multiple iterations of outval.clone(). Final benchmark result still does not make sense though.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions