GEMM Kernel Microbenchmark

This repo provides a microbenchmark for GEMM kernels on NVIDIA GPUs with Ampere Architecture (sm_80). It includes both a CUDA kernel benchmark and a Python extension benchmark.

Requirements

NVIDIA GPU with Ampere Architecture (sm_80)
CUDA 12.2

Getting Started

CUDA Kernel Benchmark

Build the project:

$ make

Run a benchmark with specific parameters:

$ ./csrc/bench/main --groups=16 --m=64 --n=64 --k=768 --iterations=3

Where:

--groups: Number of groups
--m, --n, --k: Problem size dimensions
--iterations: Number of iterations

For more information on available options:

$ ./csrc/bench/main --help

Python Extension Benchmark

Export the CUDA kernel as a Python extension:

$ python ./python/testbed/lib.py
$ cd out && TORCH_CUDA_ARCH_LIST="8.0" python setup.py install --user

Run the benchmark:

$ python ./python/testbed/multi_gemm.py > perf.txt

References

CUTLASS Examples
- "02_pytorch_extension_grouped_gemm" Notebook: A guide to implementing grouped GEMM operations as PyTorch extensions.
- "gemm_grouped" CUDA Example: Example code and documentation for grouped GEMM operations in CUDA.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.vscode		.vscode
csrc		csrc
python		python
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GEMM Kernel Microbenchmark

Requirements

Getting Started

CUDA Kernel Benchmark

Python Extension Benchmark

References

About

Releases

Packages

Languages

jssonx/gemm-kernel-microbenchmark

Folders and files

Latest commit

History

Repository files navigation

GEMM Kernel Microbenchmark

Requirements

Getting Started

CUDA Kernel Benchmark

Python Extension Benchmark

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages