This repo provides a microbenchmark for GEMM kernels on NVIDIA GPUs with Ampere Architecture (sm_80). It includes both a CUDA kernel benchmark and a Python extension benchmark.
- NVIDIA GPU with Ampere Architecture (sm_80)
- CUDA 12.2
- Build the project:
$ make
- Run a benchmark with specific parameters:
$ ./csrc/bench/main --groups=16 --m=64 --n=64 --k=768 --iterations=3
Where:
- --groups: Number of groups
- --m, --n, --k: Problem size dimensions
- --iterations: Number of iterations
- For more information on available options:
$ ./csrc/bench/main --help
- Export the CUDA kernel as a Python extension:
$ python ./python/testbed/lib.py
$ cd out && TORCH_CUDA_ARCH_LIST="8.0" python setup.py install --user
- Run the benchmark:
$ python ./python/testbed/multi_gemm.py > perf.txt
- CUTLASS Examples
- "02_pytorch_extension_grouped_gemm" Notebook: A guide to implementing grouped GEMM operations as PyTorch extensions.
- "gemm_grouped" CUDA Example: Example code and documentation for grouped GEMM operations in CUDA.