This repository aims to benchmark Matrix Multiply (SGEMM) hand-tuned libraries and code generation stacks on a single thread on one CPU core. The focus will be on machine learning workloads so FP32 or smaller and irregular sizes of matrices. The goal is to expose high performance atomic kernels that can then be used to build highly efficient higher level implemenations spanning multiple cores or distributed across systems.
Note: 8GB Mac Mini runs roughly 25% slower than the 16GB version on other tests.
Clone the repo along with submodules.
git clone --recurse-submodules https://github.com/mmperf/mmperf.git
Create a virtual environment and install requirements. Python 3.8 is required.
cd mmperf
python3 -m venv ./mmperf_env
source mmperf_env/bin/activate
pip install -r requirements.txt
pip install -r ./external/llvm-project/mlir/python/requirements.txt
Build the project specifying the backend(s) to run matmul. Below is a command to build mmperf with MLIR backend.
cmake -GNinja \
-DCMAKE_CXX_COMPILER=clang++-11 \
-DCMAKE_C_COMPILER=clang-11 \
-DUSE_MLIR=ON \
-B build .
cmake --build build
Another example to build with all available backends. Assumes you have MKL, OpenBLAS, and Halide installed (see below for installation details)
HALIDE_DIR=/home/foo/lokal/halide/ MKL_DIR=/opt/intel/oneapi/mkl/latest/ cmake -GNinja \
-DCMAKE_CXX_COMPILER=clang++-11 \
-DCMAKE_C_COMPILER=clang-11 \
-DMKL_DIR=/opt/intel/oneapi/mkl/latest/ \
-DUSE_MLIR=ON \
-DUSE_MKL=ON \
-DUSE_RUY=ON \
-DUSE_HALIDE=ON \
-DUSE_OPENBLAS=ON \
-DUSE_IREE=ON \
-DIREE_DYLIB=ON \
-B build .
cmake --build build
With any GPU backends, NVIDIA CUDA-11.4 should be pre-installed on your system. To install NVIDIA CUDA-11.4 toolkit, please refer to this link. Make sure environment variables $PATH
and $LD_LIBRARY_PATH
are correctly configured. CUDA Compiler should be set as nvcc
in the command line. For example, to compile the MLIR-CUDA backend run:
cmake -GNinja \
-DCMAKE_CXX_COMPILER=clang++-11 \
-DCMAKE_C_COMPILER=clang-11 \
-DCMAKE_CUDA_COMPILER=nvcc \
-DUSE_MLIR_CUDA=ON \
-B build .
cmake --build build
Another example to compare mmperf results among cuBLAS, IREE-CUDA and TVM-CUDA (TVM Auto-scheduler, a.k.a. Ansor) with this command line:
cmake -GNinja \
-DCMAKE_CXX_COMPILER=clang++-11 \
-DCMAKE_C_COMPILER=clang-11 \
-DCMAKE_CUDA_COMPILER=nvcc \
-DUSE_IREE=ON \
-DIREE_CUDA=ON \
-DUSE_CUBLAS=ON \
-DUSE_TVM_CUDA=ON \
-DTVM_ENABLE_CUDA=ON \
-DUSE_TVM_TUNED=ON \
-DTVM_LIB_DIR=/path/to/tvm-tuner
-DSIZE_FILE=benchmark_sizes/bert_large_matmul.txt
-B build .
Note: -DTVM_LIB_DIR
should be set as the absolute path of where TVM binaries located. For how to run TVM auto-scheduler, please refer to this README.
The building of submodule external/llvm-project
can be space and time consuming. If you already have your own standalone llvm
and don't want to fetch and compile this submodule, you scan specify the llvm
on your system with PREBUILT_LLVM_PATH
compilation flag:
cmake -GNinja \
-DCMAKE_CXX_COMPILER=clang++-11 \
-DCMAKE_C_COMPILER=clang-11 \
-DPREBUILT_LLVM_PATH=$HOME/opt/llvm \
-DUSE_MLIR=ON \
-B build .
cmake --build build
To compile llvm
from scratch, you might want all of these as well:
echo "deb http://apt.llvm.org/DISTRO_NAME/ llvm-toolchain-DISTRO_NAME main" >> /etc/apt/sources.list
wget -O - https://apt.llvm.org/llvm-snapshot.gpg.key | apt-key add -
apt-get update && apt-get upgrade -y
apt-get install -y clang-11 clang-tools-11 libc++1-11 libc++-11-dev \
libc++abi1-11 libc++abi-11-dev libclang1-11 libclang-11-dev \
libclang-common-11-dev libclang-cpp11 libclang-cpp11-dev liblld-11 \
liblld-11-dev liblldb-11 liblldb-11-dev libllvm11 libomp-11-dev \
libomp5-11 lld-11 lldb-11 llvm-11 llvm-11-dev llvm-11-runtime \
llvm-11-tools libfuzzer-11-dev
We use AOT compilation to generate binaries for matrix multiplication for specified backends and run them to generate the benchmarking numbers. To generate performance numbers and get a comparison plot run:
python3 mmperf.py ./build/matmul/ results
results
folder will be created in the mmperf top-level directory which will contain GLOPS for every matmul size and every backend. A plot comparing performances of backends will also be generated in matmul.png
.
Each generated binary can also be executed individually. To run a specific matrix size (say 24x64x512) for a backend run:
./build/matmul/matmul_<LIBRARY>_24x64x512
Matrix sizes: benchmark_sizes
folder has text files containing the matrix sizes that mmperf runs on. You can change the matrix size input file by editing SIZE_FILE
option in cmake/common.cmake
. Default is benchmark_all_sizes.txt
.
Precision: The default precision for all backends is FP32. FP16 benchmark has been added to cuBLAS, IREE, and triton backends. To enable FP16, use flag -DUSE_FP16=ON
To build mlir with iree-llvm-sandbox, enable the flag -DUSE_IREE_LLVM_SANDBOX=ON
.
To run iree-llvm-sandbox, and get the best performance among different experts, run:
cmake --build build/matmul --target sandbox_search
Note: This command will perform search on all expert configurations, and output the best configs to mlir_sandbox_configs
for each matmul size.
To compare results between original mlir-sandbox and nodai-sandbox, run:
python mmperf.py ./build/matmul results -sandbox -sandbox_configs=/path/to/mlir_sandbox_configs -nodai_configs=/path/to/nodai_sandbox_configs
Note: If you just want to plot results for mlir_sandbox, enable -sandbox_configs
(to load configs from previous run) or -benchmark_path
(to rerun the search). Optional flag: -num_iters
to change the number of iterations for matmul test.
python mmperf.py ./build/matmul results -triton -dtype f16 -benchmark_path=./benchmark_sizes/bert_large_matmul.txt
git clone https://github.com/halide/Halide.git --recurse-submodules
cd Halide/
sudo apt install libclang-11-dev clang-11 liblld-11-dev
LLD_DIR=/usr/lib/llvm-11/lib/cmake/lld cmake . -GNinja \
-DCMAKE_BUILD_TYPE=Release \
-DTARGET_WEBASSEMBLY=OFF \
-DCMAKE_INSTALL_PREFIX=/home/<foo>/lokal/
ninja
ninja install
export HALIDE_DIR=/home/<foo>/lokal/halide
sudo apt install libopenblas-dev
git clone https://github.com/flame/blis
cd blis
./configure --prefix=/home/foo/lokal/ --enable-cblas -c amd64
make -j 16
make install
Download and install from https://software.intel.com/content/www/us/en/develop/articles/installation-guide-for-intel-oneapi-toolkits.html
This benchmark was run on an Intel Xeon CPU running at 3.1GHz. The machine has 256Kb L1 cache, 8Mb L2 cache and 24.8Mb L3 cache. It supports AVX-512 instructions. The peak performance of the machine is 3.1 x 8 x 2 x 2 = 99.2 GFLOPS for double precision and 198.4 GFLOPS for single precision.