Description
We recently discovered in JuliaLang/julia#29888 a somewhat surprising fact about modern CPU: The branch-predictor is capable of remembering astoundingly long periodic patterns. If we run a benchmark loop, then each evaluation and possibly each sample will have identical branching patterns, thus introducing a period. In other words, the sneaky little CPU learns our sample set and we have an emergent defeat device for some of our benchmarks.
At least the findall
benchmarks are broken, and probably have been broken forever. I suspect that the logical indexing benchmarks are broken as well. But we should really go over all our benchmarks and figure out which ones are affected. Also, this is interesting and something to keep in mind for all our benchmarks. Indirect branch prediction (the BTB) and D-cache are something to keep in mind as well.
The likely fix is to increase the size of testsets. Long-term it would be cool to parametrize all benchmarks, and occasionally (rarely) run regression tests on our regression tests: Check that everything has the expected scaling behavior, and explain or fix surprises. Alternatively, we could regenerate new random data between runs. But afaik BenchmarkTools has no support for that (would need new feature to fix evals/sample = 1
).
Demo:
julia> using Printf, BenchmarkTools
julia> function cpu_speed_ghz()
# if available, use reported CPU speed instead of current speed (which
# may be inaccurate if the CPU is idling)
cpu_info = Sys.cpu_info()[1]
m = match(r"([\d\.]+)GHz", cpu_info.model)
ghz = m ≡ nothing ? cpu_info.speed / 1000 : parse(Float64, m.captures[1])
end;
julia> const CPU_SPEED_GHZ = cpu_speed_ghz();
julia> const cpu_model = Sys.cpu_info()[1].model;
julia> begin
N=30_000
list = fill(false, N); list[1:2:end].=true;
bt0 = @belapsed findall($list)
list .= rand(Bool, N)
btL = @belapsed findall($list)
time_to_cycle = 10^9/N * CPU_SPEED_GHZ
penalty = 2*(btL-bt0)*time_to_cycle
@printf("\n\n%s; branch-miss penalty: %4.1f ns = %4.1f cycles\n\n",
cpu_model, penalty/CPU_SPEED_GHZ , penalty)
bt = bt0
@printf("Period %5d: %7.2f us = %7.2f cycles per idx. Miss-rate %5.2f%%\n",
2, bt*10^6, bt*time_to_cycle, 100*(bt - bt0) *time_to_cycle / penalty )
for n=[100, 500, 1000, 2000, 2500, 3000, 5000, 10_000, 30_000]
pat = rand(Bool, n)
for i=1:n:N list[i:(i+n-1)].=pat end
bt = @belapsed findall($list)
@printf("Period %5d: %7.2f us = %7.2f cycles per idx. Miss-rate %5.2f%%\n",
n, bt*10^6, bt*time_to_cycle, 100*(bt - bt0) *time_to_cycle / penalty )
end
end;
yielding:
Intel(R) Core(TM) i5-5###U CPU @ 2.00GHz; branch-miss penalty: 9.9 ns = 19.8 cycles
Period 2: 44.81 us = 2.99 cycles per idx. Miss-rate 0.00%
Period 100: 53.22 us = 3.55 cycles per idx. Miss-rate 2.83%
Period 500: 51.52 us = 3.43 cycles per idx. Miss-rate 2.26%
Period 1000: 51.37 us = 3.42 cycles per idx. Miss-rate 2.21%
Period 2000: 57.85 us = 3.86 cycles per idx. Miss-rate 4.39%
Period 2500: 88.66 us = 5.91 cycles per idx. Miss-rate 14.77%
Period 3000: 121.78 us = 8.12 cycles per idx. Miss-rate 25.93%
Period 5000: 159.28 us = 10.62 cycles per idx. Miss-rate 38.56%
Period 10000: 182.87 us = 12.19 cycles per idx. Miss-rate 46.51%
Period 30000: 192.51 us = 12.83 cycles per idx. Miss-rate 49.75%
This is compatible with Agner Fog's tables. And it is absolutely mindboggling that the CPU manages to completely defeat patterns of length 2000. If you have a different CPU-Arch available (Ryzen? Skylake? Power?), then please post similar figures. We should increase testset sizes above the BHT limits for all realistic current and near-future CPUs.
JuliaBox's CPU gives a similar cutoff:
Intel(R) Xeon(R) CPU E5-2673 v4 @ 2.30GHz; branch-miss penalty: 6.3 ns = 14.6 cycles
Period 2: 28.40 us = 2.18 cycles per idx. Miss-rate 0.00%
Period 100: 32.50 us = 2.49 cycles per idx. Miss-rate 2.16%
Period 500: 32.30 us = 2.48 cycles per idx. Miss-rate 2.05%
Period 1000: 32.70 us = 2.51 cycles per idx. Miss-rate 2.26%
Period 2000: 33.40 us = 2.56 cycles per idx. Miss-rate 2.63%
Period 2500: 42.70 us = 3.27 cycles per idx. Miss-rate 7.52%
Period 3000: 71.90 us = 5.51 cycles per idx. Miss-rate 22.87%
Period 5000: 102.20 us = 7.84 cycles per idx. Miss-rate 38.80%
Period 10000: 116.70 us = 8.95 cycles per idx. Miss-rate 46.42%
Period 30000: 123.40 us = 9.46 cycles per idx. Miss-rate 49.95%