CUTLASS 3.9.0 #2262

hwu36 · 2025-04-25T01:53:42Z

hwu36
Apr 25, 2025
Maintainer

Support for Blackwell SM120 kernels for GeForce GPUs in CUTLASS 3.x API:
- Collective mainloops that target for:
  - Blockscaled datatypes with support for dense GEMM
  - Blockscaled datatypes with support for sparse GEMM
- New GEMM and epilogue dispatch policies for collectives, kernel layers, and builders.
- Blackwell SM120 epilogue and full set of EVT fusions.
Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM120 architecture:
Set of unit tests that demonstrate the usage of both sparse and dense Blackwell SM120 blockscaled GEMM.
Support for Blackwell SM100 Sparse kernels:
- Collective mainloop that target for
  - SM100 Sparse GEMM
Set of example that demonstrate the usage of the 3.x API for targeting Blackwell SM100 Sparse GEMM:
Set of unit tests that demonstrate the usage of sparse and blockscaled sparse Blackwell SM100 GEMM.
A new Multi-head Latent Attention (MLA) for SM100 Blackwell architecture in CUTLASS example covers the flashMLA-like weight-absorbed decoding use-case.
A new FMHA Backward kernel for SM100 Blackwell architecture extends CUTLASS example to show how the five backward pass MMAs can be fused into a single kernel to achieve high performance.
A new distributed GEMM example for SM100 Blackwell architecture.
Enhancement and new support of block-wise and group-wise GEMM for Hopper and Blackwell architectures:
- Enhancement of blockwise GEMM for Hopper architecture.
- Enhancement of groupwise GEMM for Hopper architecture.
- Support for grouped GEMM with blockwise and groupwise scaling for Hopper architecture.
- Support for grouped-wise GEMM in CUTLASS profiler.
- Support for blockwise GEMM for Blackwell architecture.
- Support for groupwise GEMM for Blackwell architecture.
- Support for grouped GEMM with blockwise and groupwise scaling for Blackwell architecture.
Added support for enhanced kernel performance search (auto-tuning) in CUTLASS profiler:
- Sorting performance results by GFLOPs/second: Users can now sort the final performance report based on GFLOPs/second, making it easier to identify the most efficient kernels.
- Exhaustive search for best kernel performance in GFLOPs/second: The profiler now searches for the best-performing kernel across a range of problem sizes, swizzle sizes, rasterization orders, and dynamic cluster configurations to maximize performance.
- Performance search under a fixed GEMM shape: Enables exhaustive tuning within a fixed GEMM shape, exploring various kernel parameters to find the best configuration.
- More detailed introductions and examples to leverage this feature can be found in profiler.md.
Support void as the D element in sm100 kernel epilogues.

This discussion was created from the release CUTLASS 3.9.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUTLASS 3.9.0 #2262

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

CUTLASS 3.9.0 #2262

hwu36 Apr 25, 2025 Maintainer

Replies: 0 comments

hwu36
Apr 25, 2025
Maintainer