CuTe DSL
- Add aarch64 support, you can now pip install
nvidia-cutlass-dsl
on GB200 systems! - More examples demonstrating how to use CuTe DSL to write peak-performance kernels
- API updates
- Please refer to FUNCTIONALITY.md for details
CUTLASS C++
- Further enhance Blackwell SM100 Attention kernels in example 77.
- Add variable sequence length support for FMHA Backward kernel.
- Add varlen test support to Backward runner.
- Codes support empty batch sequences.
- Replace
subbyte_iterator
withcute::recast_ptr
when constructing logical iterators/arrays. - CuTe changes:
- Rewrite ArithTuple and ScaledBasis for robustness and clarity.
- Remove buggy and kludgy
get_layoutA|B|C_MN
and friends from Atoms/TiledX. - Factor out
print_latex
and friends and rewrite. - Factor out
print_svg
and friends and rewrite.
- Support Blackwell SM100 SIMT packed fp32x2 kernels.
- Support residual add for implicit gemm kernels.
- Various fixes for CUTLASS C++ Python interface's EVT tracer:
- Add verifier for sm90 to report the invalid input.
- When adding an edge to the graph, if the edge already exists, add an identity compute node to avoid having multiple parallel edges.
- Register operations of tanh, sigmoid, exp, gelu to the python ast frontend.
- Replace the NotImplemented Error by packing all nodes into a single topological visitor node as a fallback.
- Fix profiler bugs in exhaustive perf search.
- Fix incorrect cluster shape output issue when doing exhaustive search.
- Fix a bug in profiler grouped GEMM for setting tile scheduler swizzles, cluster shapes, and raster orders.
- Fix some profiler issues.
- Complete the reference for Blackwell blockwise gemm kernels.
- Fix incorrect regex logic for L1 test.