Release CUTLASS 4.1.0 · NVIDIA/cutlass

CuTe DSL

Add aarch64 support, you can now pip install nvidia-cutlass-dsl on GB200 systems!
More examples demonstrating how to use CuTe DSL to write peak-performance kernels
- Blackwell Mamba2 SSD
- Blackwell SM100 persistent dense blockscaled GEMM with static scheduling
API updates
- Please refer to FUNCTIONALITY.md for details

CUTLASS C++

Further enhance Blackwell SM100 Attention kernels in example 77.
- Add variable sequence length support for FMHA Backward kernel.
- Add varlen test support to Backward runner.
- Codes support empty batch sequences.
Replace subbyte_iterator with cute::recast_ptr when constructing logical iterators/arrays.
CuTe changes:
- Rewrite ArithTuple and ScaledBasis for robustness and clarity.
- Remove buggy and kludgy get_layoutA|B|C_MN and friends from Atoms/TiledX.
- Factor out print_latex and friends and rewrite.
- Factor out print_svg and friends and rewrite.
Support Blackwell SM100 SIMT packed fp32x2 kernels.
Support residual add for implicit gemm kernels.
Various fixes for CUTLASS C++ Python interface's EVT tracer:
- Add verifier for sm90 to report the invalid input.
- When adding an edge to the graph, if the edge already exists, add an identity compute node to avoid having multiple parallel edges.
- Register operations of tanh, sigmoid, exp, gelu to the python ast frontend.
- Replace the NotImplemented Error by packing all nodes into a single topological visitor node as a fallback.
Fix profiler bugs in exhaustive perf search.
- Fix incorrect cluster shape output issue when doing exhaustive search.
- Fix a bug in profiler grouped GEMM for setting tile scheduler swizzles, cluster shapes, and raster orders.
Fix some profiler issues.
- Complete the reference for Blackwell blockwise gemm kernels.
- Fix incorrect regex logic for L1 test.

Provide feedback