CuTe DSL
CuTe DSL is a Python DSL centered around CuTe's abstractions
- Enables authoring kernels in Python to reach peak performance on NVIDIA GPUs
- Core DSL implementation files
- DSL quick start
- DSL Overview
- Educational notebooks for getting started with CuTe DSL
CUTLASS C++
- Support Family Specific Architecture Features which was introduced in CUDA 12.9
- Further improved Blockwise and Groupwise GEMMs on Hopper and Blackwell
- Enhance Blackwell SM100 Attention kernels in example 77
- Add Blackwell SM100 implicit GEMM conv fprop/dgrad/wgrad unit tests
- New Hopper SM90 FMHA example, similar in design to the existing Blackwell FMHA
- Cute enhancements: CuTe C++ reduce op
- Other functional and performance enhancements