Skip to content

Latest commit

 

History

History
231 lines (186 loc) · 10.4 KB

README.md

File metadata and controls

231 lines (186 loc) · 10.4 KB

STREAMBenchmark

CI@PC2 Coverage

Getting a realistic estimate of the achievable (maximal) memory bandwidth

Note: This package implements a simple variant of the original STREAM benchmark. There also is BandwidthBenchmark.jl, which is a variant of TheBandwidthBenchmark.

memory_bandwidth()

The function memory_bandwidth() estimates the memory bandwidth in megabytes per second (MB/s). It returns a named tuple indicating the median, minimum, and maximum of the four measurements.

A few important remarks upfront:

  • To obtain a reasonable estimate you should start julia with enough threads (e.g. as many as you have physical cores).
  • You should play around with the length of the vectors, used in the streaming kernels, via the keyword argument N. Make it large enough (e.g. # of NUMA nodes times four times the size of the outermost cache size) in particular if you get unreasonably high bandwidths.
  • If possible, you should pin the Julia threads to separate cores. The simplest ways to pin N Julia threads to the first N cores (compact pinning) are 1) settings JULIA_EXLUSIVE=1 or 2) using ThreadPinning.jl's pinthreads(:compact). We will use the latter below.
julia> using ThreadPinning

julia> pinthreads(:compact)

julia> using STREAMBenchmark

julia> memory_bandwidth(verbose=true)
╔══╡ Multi-threaded:
╠══╡ (10 threads)
╟─ COPY:  100205.2 MB/s
╟─ SCALE: 100218.7 MB/s
╟─ ADD:   100364.7 MB/s
╟─ TRIAD: 100293.1 MB/s
╟─────────────────────
║ Median: 100255.9 MB/s
╚═════════════════════
(median = 100255.9, minimum = 100205.2, maximum = 100364.7)

Keyword arguments

  • N (default STREAMBenchmark.default_vector_length()): length of the vectors used in the streaming kernels
  • nthreads (default Threads.nthreads()): Use nthreads threads for the benchmark. It must hold 1 ≤ nthreads ≤ Threads.nthreads().
  • write_allocate (default: true): assume the use / count write allocates.
  • verbose (default: false): verbose output, including the individual results of the streaming kernels.

benchmark()

If you want to run both the single- and multi-threaded benchmark at once you can call benchmark() which produces an output like this:

julia> benchmark()
╔══╡ Single-threaded:
╟─ COPY:  18880.8 MB/s
╟─ SCALE: 18537.2 MB/s
╟─ ADD:   17380.2 MB/s
╟─ TRIAD: 17359.9 MB/s
╟─────────────────────
║ Median: 17958.7 MB/s
╚═════════════════════

╔══╡ Multi-threaded:
╠══╡ (10 threads)
╟─ COPY:  100358.1 MB/s
╟─ SCALE: 100218.2 MB/s
╟─ ADD:   99508.0 MB/s
╟─ TRIAD: 99582.4 MB/s
╟─────────────────────
║ Median: 99900.3 MB/s
╚═════════════════════

(single = (median = 17958.7, minimum = 17359.9, maximum = 18880.8), multi = (median = 99900.3, minimum = 99508.0, maximum = 100358.1))

Scaling

Number of threads

To assess the scaling of the maximal memory bandwidth with the number of threads, we provide the function scaling_benchmark()

julia> y = scaling_benchmark()
# Threads: 1	Max. memory bandwidth: 19058.7
# Threads: 2	Max. memory bandwidth: 37511.2
# Threads: 3	Max. memory bandwidth: 55204.6
# Threads: 4	Max. memory bandwidth: 68706.6
# Threads: 5	Max. memory bandwidth: 76869.9
# Threads: 6	Max. memory bandwidth: 83669.9
# Threads: 7	Max. memory bandwidth: 88656.0
# Threads: 8	Max. memory bandwidth: 93701.0
# Threads: 9	Max. memory bandwidth: 97093.6
# Threads: 10	Max. memory bandwidth: 101293.9
10-element Vector{Float64}:
  19058.7
  37511.2
  55204.6
  68706.6
  76869.9
  83669.9
  88656.0
  93701.0
  97093.6
 101293.9
 
julia> using UnicodePlots

julia> lineplot(1:length(y), y, title = "Bandwidth Scaling", xlabel = "# cores", ylabel = "MB/s", border = :ascii, canvas = AsciiCanvas)

                            Bandwidth Scaling
               +----------------------------------------+
        110000 |                                        |
               |                                   __r-*|
               |                            __--"""     |
               |                      __-*""            |
               |                 ._-*"                  |
               |              .r*"                      |
               |           .r"`                         |
   MB/s        |         .*'                            |
               |       ./`                              |
               |      .'                                |
               |    ./                                  |
               |  .r`                                   |
               | ./                                     |
               |*`                                      |
         10000 |                                        |
               +----------------------------------------+
                1                                     10
                                 # cores

Vector length

By default a vector length of four times the size of the outermost cache is used (a rule of thumb "laid down by Dr. Bandwidth"). To measure the memory bandwidth for a few other factorsas well you might want to use STREAMBenchmark.vector_length_dependence():

julia> STREAMBenchmark.vector_length_dependence()
1: 3604480 => 121692.2
2: 7208960 => 99755.5
3: 10813440 => 98705.5
4: 14417920 => 98660.5
Dict{Int64, Float64} with 4 entries:
  10813440 => 98705.5
  7208960  => 99755.5
  3604480  => 1.21692e5
  14417920 => 98660.5

Comparison with original STREAM benchmark

We can download and compile the C source code of the original STREAM benchmark via STREAMBenchmark.jl:

julia> using STREAMBenchmark

julia> STREAMBenchmark.download_original_STREAM()
- Creating folder "stream"
- Downloading C STREAM benchmark
- Done.

julia> STREAMBenchmark.compile_original_STREAM(compiler=:gcc, multithreading=false)
- Trying to compile "stream.c" using gcc
  Using options: -O3 -DSTREAM_ARRAY_SIZE=14417920
- Done.

julia> STREAMBenchmark.execute_original_STREAM()
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 14417920 (elements), Offset = 0 (elements)
Memory per array = 110.0 MiB (= 0.1 GiB).
Total memory required = 330.0 MiB (= 0.3 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 11047 microseconds.
   (= 11047 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           11039.8     0.020987     0.020896     0.021092
Scale:          12491.1     0.018509     0.018468     0.018537
Add:            13370.0     0.025934     0.025881     0.026183
Triad:          13396.9     0.025903     0.025829     0.026223
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

julia> memory_bandwidth(verbose=true, nthreads=1, write_allocate=false) # the original benchmark doesn't count / assumes the absence of write-allocates
╔══╡ Single-threaded:
╠══╡ (1 threads)
╟─ COPY:  12749.1 MB/s
╟─ SCALE: 12468.2 MB/s
╟─ ADD:   13095.3 MB/s
╟─ TRIAD: 13131.2 MB/s
╟─────────────────────
║ Median: 12922.2 MB/s
╚═════════════════════
(median = 12922.2, minimum = 12468.2, maximum = 13131.2)

Further Options and Comments

LoopVectorization

You can make STREAMBenchmarks.jl use LoopVectorization's @avxt instead of @threads by setting STREAMBenchmark.avxt() = true. Note, however, that this only works if nthreads=1 (single thread is used) or nthreads=Threads.nthreads() (all threads are used). This because @avxt isn't compatible with our way to let the benchmark only run on a subset of the available Julia threads.

Thread pinning

It is recommended to either set the environmental variable JULIA_EXCLUSIVE = 1 or use pinthreads(:compact) from ThreadPinning.jl to pin the used Julia threads to the first 1:nthreads cores.

See https://discourse.julialang.org/t/thread-affinitization-pinning-julia-threads-to-cores/58069 for a discussion of other options like numactl (with caveats).

Resources

Acknowledgements