Sneaky branch-predictor remembering inputs over benchmark loops

We recently discovered in [https://github.com/JuliaLang/julia/pull/29888](https://github.com/JuliaLang/julia/pull/29888) a somewhat surprising fact about modern CPU: The branch-predictor is capable of remembering astoundingly long periodic patterns. If we run a benchmark loop, then each evaluation and possibly each sample will have identical branching patterns, thus introducing a period. In other words, the sneaky little CPU learns our sample set and we have an emergent defeat device for some of our benchmarks. 

At least the `findall` benchmarks are broken, and probably have been broken forever. I suspect that the logical indexing benchmarks are broken as well. But we should really go over all our benchmarks and figure out which ones are affected. Also, this is interesting and something to keep in mind for all our benchmarks. Indirect branch prediction (the BTB) and D-cache are something to keep in mind as well. 

The likely fix is to increase the size of testsets. Long-term it would be cool to parametrize all benchmarks, and occasionally (rarely) run regression tests on our regression tests: Check that everything has the expected scaling behavior, and explain or fix surprises. Alternatively, we could regenerate new random data between runs. But afaik BenchmarkTools has no support for that (would need new feature to fix `evals/sample = 1`).

Demo:
```
julia> using Printf, BenchmarkTools

julia> function cpu_speed_ghz()
                  # if available, use reported CPU speed instead of current speed (which
                  # may be inaccurate if the CPU is idling)
                  cpu_info = Sys.cpu_info()[1]
                  m = match(r"([\d\.]+)GHz", cpu_info.model)
                  ghz = m ≡ nothing ? cpu_info.speed / 1000 : parse(Float64,  m.captures[1])
              end;

julia> const CPU_SPEED_GHZ = cpu_speed_ghz();

julia> const cpu_model = Sys.cpu_info()[1].model;

julia> begin 
           N=30_000
           list = fill(false, N); list[1:2:end].=true;
           bt0 = @belapsed findall($list)
           list .= rand(Bool, N)
           btL = @belapsed findall($list)
           time_to_cycle = 10^9/N * CPU_SPEED_GHZ
           penalty = 2*(btL-bt0)*time_to_cycle
           @printf("\n\n%s; branch-miss penalty: %4.1f ns = %4.1f cycles\n\n", 
           cpu_model, penalty/CPU_SPEED_GHZ , penalty)

           bt = bt0
           @printf("Period %5d: %7.2f us = %7.2f cycles per idx. Miss-rate %5.2f%%\n", 
                   2, bt*10^6, bt*time_to_cycle, 100*(bt - bt0) *time_to_cycle / penalty )
           for n=[100, 500, 1000, 2000, 2500, 3000, 5000, 10_000, 30_000]
               pat = rand(Bool, n)
               for i=1:n:N list[i:(i+n-1)].=pat end
               bt = @belapsed findall($list)
               @printf("Period %5d: %7.2f us = %7.2f cycles per idx. Miss-rate %5.2f%%\n", 
                   n, bt*10^6, bt*time_to_cycle, 100*(bt - bt0) *time_to_cycle / penalty )
           end
       end;
```
yielding:
```
Intel(R) Core(TM) i5-5###U CPU @ 2.00GHz; branch-miss penalty:  9.9 ns = 19.8 cycles

Period     2:   44.81 us =    2.99 cycles per idx. Miss-rate  0.00%
Period   100:   53.22 us =    3.55 cycles per idx. Miss-rate  2.83%
Period   500:   51.52 us =    3.43 cycles per idx. Miss-rate  2.26%
Period  1000:   51.37 us =    3.42 cycles per idx. Miss-rate  2.21%
Period  2000:   57.85 us =    3.86 cycles per idx. Miss-rate  4.39%
Period  2500:   88.66 us =    5.91 cycles per idx. Miss-rate 14.77%
Period  3000:  121.78 us =    8.12 cycles per idx. Miss-rate 25.93%
Period  5000:  159.28 us =   10.62 cycles per idx. Miss-rate 38.56%
Period 10000:  182.87 us =   12.19 cycles per idx. Miss-rate 46.51%
Period 30000:  192.51 us =   12.83 cycles per idx. Miss-rate 49.75%
```
This is compatible with Agner Fog's tables. And it is absolutely mindboggling that the CPU manages to completely defeat patterns of length 2000. If you have a different CPU-Arch available (Ryzen? Skylake? Power?), then please post similar figures. We should increase testset sizes above the BHT limits for all realistic current and near-future CPUs.

JuliaBox's CPU gives a similar cutoff:
```
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz; branch-miss penalty:  6.3 ns = 14.6 cycles

Period     2:   28.40 us =    2.18 cycles per idx. Miss-rate  0.00%
Period   100:   32.50 us =    2.49 cycles per idx. Miss-rate  2.16%
Period   500:   32.30 us =    2.48 cycles per idx. Miss-rate  2.05%
Period  1000:   32.70 us =    2.51 cycles per idx. Miss-rate  2.26%
Period  2000:   33.40 us =    2.56 cycles per idx. Miss-rate  2.63%
Period  2500:   42.70 us =    3.27 cycles per idx. Miss-rate  7.52%
Period  3000:   71.90 us =    5.51 cycles per idx. Miss-rate 22.87%
Period  5000:  102.20 us =    7.84 cycles per idx. Miss-rate 38.80%
Period 10000:  116.70 us =    8.95 cycles per idx. Miss-rate 46.42%
Period 30000:  123.40 us =    9.46 cycles per idx. Miss-rate 49.95%
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sneaky branch-predictor remembering inputs over benchmark loops #241

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sneaky branch-predictor remembering inputs over benchmark loops #241

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions