Releases: NVIDIA/cutlass
Releases · NVIDIA/cutlass
CUTLASS 4.4.1
CUTLASS 4.4.0
CuTe DSL
-
New features
- CuTe DSL now supports CUDA toolkit 13.1!
- Set up with cutlass/python/CuTeDSL/setup.sh --cu13
- Refer to https://docs.nvidia.com/cutlass/latest/media/docs/pythonDSL/quick_start.html for more details
- GB300 is now supported in CuTe DSL with CTK 13.1
- Refer to SM103 batched 3xFP4 blockscaled GEMM kernel for example kernel
- cute.experimental: introduce a higher-level, composable layer on top of existing CuTe DSL APIs (not a separate abstraction), which can be mixed with existing Cute DSL building blocks.
- Fragment-free programming model: copy/dot APIs take memrefs directly instead of descriptors/fragments.
- Automatic TMA descriptor generation and update insertion.
- Automatic vectorization and predication for SIMT copies.
- New pipeline abstraction with convenience wrappers
- New Partition ops to simplify partitioning logic.
- Device-side TMA descriptor allocation, initialization, and management
- These examples can be found here https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/experimental
- Ahead of Time (AoT) compilation is now available!
- Refer to files under https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/cute/export for example usage
- JAX support - you can now use CuTeDSL along with JAX
- Refer to files under https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/jax for example usage
- Introduced versioning support in DSL:
- cutlass.version for a string representation of DSL version
- cutlass.CUDA_VERSION for a version class to tell the CUDA version used for DSL
- Added CopyDsmemStoreOp to store data to distributed shared memory with explicit synchronization.
- Grouped GEMM example now supports device-only problem shapes.
- We allow grid carve-out without problem shapes being available on host.
- Tma+LdMatrix features for loading+unpacking narrow-width types (refer to mixed_input_fmha_decode.py for example usage).
- It is possible now to have customized epilogue fusion for persistent dense GEMM through a Python Epilogue Fusion Configuration (EFC) function, somewhat similar to CUTLASS C++ EVT. It also provides a PyTorch evaluator to compare the results.
- CuTe DSL now supports CUDA toolkit 13.1!
-
More examples of authorizing peak-performance kernels
- SM103 batched 3xFP4 blockscaled GEMM kernel
- Mixed input FMHA decode example with support for int4 KV (int8 KV supported in 4.3)
- New acc_scale grouped mixed input gemm kernel variant is introduced to deliver better performance for decoding cases.
- All mixed_input_gemm examples are moved into a separate folder
mixed_input_gemm. Common utility functions are also extracted into mixed_input_host_utils.py under the same folder.
-
Bug fixing and improvements
- Fixed an issue that both branches of if are executed
- Fixed
cute.printfwith f-string - Fixed an indexing issue of scalar tensor
- Fixed small K reference check error for cta_tile_n = 256 case with overlapping accumulator optimization in Blackwell SM100 persistent dense blockscaled GEMM with static scheduling.
-
API changes
- Deprecate get_num_tmem_alloc_cols from blackwell_helpers.py. Use the one from tmem_allocator.py instead.
- Deprecate SM100_TMEM_CAPACITY_COLUMNS and SM100_TMEM_MIN_ALLOC_COLUMNS.
- LdMatrix16x16x8bOp and StMatrix16x8x8bOp now require explicit transpose=True when calling init, to avoid ambiguity in data transposition.
- LdMatrix16x16x8bOp copy traits updated to be faithful to PTX without permutations. Permuted variant is renamed to LdMatrix16x8x8bOp.
- Grouped GEMM example takes the argument --host_problem_shape_available. If the argument is provided, grid is carved out based upon the host problem shapes, otherwise, we launch maximum possible SMs.
- hardware_info.get_max_active_cluster support pass in specific stream to query. Useful for green context based SM partition.
- group_bulk_copy_modes in async bulk copy example is now deprecated, use group_modes directly instead.
- Deprecate nvvm wrapper from using nvvm enum, use str instead.
- cute.arch.calc_packed_f32x2_op default enable ftz to default disable ftz
- In CuTe DSL with CTK 13.1, following APIs in cutlass.cute.arch now require string literal instead of enum as argument:
- fence_proxy
- fence_view_async_tmem_op
- calc_packed_f32x2_op
- warp_redux_sync
- atomic_add
- atomic_and
- atomic_or
- atomic_xor
- atomic_max
- atomic_min
- atomic_exch
- atomic_cas
- store
- load
-
Use 'Advanced control file' for mixed input gemm examples for better performance.
- Advanced control file is an experimental feature of CUDA compiler. The controls file contains internal compiler settings tuned for specific kernels with a specific version of CUDA toolkit to get better GPU kernel code. More details and documentation on how to create these controls files will be provided in future CUDA toolkit release. Note: The advanced compiler control file is not expected to work for kernels that it was not tuned for. There is no compatibility guarantee, and the controls file will not work for CUDA toolkit with a different version.
CUTLASS C++
- Add example 93 for Blackwell low latency generation phase GQA kernel.
- Flash Decoding with cluster reduction.
- Kernel design details please check Readme.
- Add Blackwell SM100 State Space Decomposition (SSD) kernel in example 112.
- Add Hopper SM90 State Space Decomposition (SSD) kernel in example 111.
- Add example 94 for Ada FP8xFP8 -> BF16 GEMM with blockwise dequantization of input matrices in the MMA loop with FP32 accumulation.
- Generate additional device/kernel/threadblock files in CUTLASS include directory that add functionality to carry the scaling tensors + use them in MMA loop.
- Add gemm_blockwise to include files in default_mma_core_sm80
- Add Hopper e2m1 to fp32 optimized conversion and e2m1 * TF32 tensor core GEMM.
- Set MmaType to tfloat32_t for FP32 mode.
- TF32 provides FP32 inputs with reduced precision (19-bit vs 32-bit)
- Set TileShapeK=64 for TF32 (K must be multiple of 8)
- Shuffle optimization enabled via
compute_memory_reordering_atom<tfloat32_t>() - E2M1 -> FP32 -> TF32 TC path for mixed-precision GEMM
- Enable example 55 with TF32 support
- Add support for arbitrary application-provided strides for block-scale tensors.
- Users and applications now must pass valid block-scale strides in all cases, even when the tensor is packed.
- Support 4x blockscaled public ptx for CUDA 13.1.
- Allow non-static
TmaGbasisinAuxTmaParams.- Some cases in attention kernel may require non-static
tma_gbasis. - Relax the restriction on
TmaGbasisparameter ofAuxTmaParamsand users are allowed to manually construct a dynamic gbasis.
- Some cases in attention kernel may require non-static
- Fix some kernel issues:
- Fix MSVC pre process issue.
- Fix a self assign issue in GEMV kernel.
- Fix a TMA descriptor bug where the CUDA driver is not properly setting the OOB address gen mode correctly.
- Fix memory fence for clc scheduler in Blackwell SM120 pingpong kernel.
- Fix missing SMEM alignment in Blackwell SM120 scale factors.
- Fix a PDL issue for grouped gemm.
- Fix divide-by-zero issue in canimplement for sm100 implicit gemm kernels.
- Fix cluster swizzle for Grouped GEMMs.
- Move host-side swizzling heuristics to device.
- Apply swizzle per group based on problem shape and max swizzle size.
- Improve examples and unit tests.
- Fix some profiler issues:
- Fix a core dump issue for nvfp4 grouped GEMM kernel.
- Fix inconsistent GEMM verification logic.
- Rework grouped gemm verification logic for different types.
- Fix api break change in using nvMatmulHeuristics.
- Fix some failed links under
media/docs. - Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
- Optimal code generation with CUDA toolkit versions 13.1.
CUTLASS 4.3.5
CuTe DSL
- Bug fixing and improvements
- Fixed the unexpected CPU overhead issue introduced by 4.3.4
- Update copyright to 2026.
CUTLASS C++
- Update copyright to 2026.
- Use CUDA Driver Get Version Runtime APIs Rather than Driver APIs.
CUTLASS 4.3.4
CuTe DSL
-
New features
- Added PDL support along with example Kernel launch with Programmatic Dependent Launch
-
Bug fixing and improvements
- Fixed a frame refcnt issue with cuda graph
- Enhancement for tvm-ffi AoT case for earlier module unload
- Fixed order issue in
make_smem_layout_ain utils/hopper_helpers.py
CUTLASS C++
- Work around a driver TMA descriptor related bug which will cause occasionally errors on Blackwell when the tensor's backing memory allocation is less than 128KB and it is not a dense non-overlapping tensor.
CUTLASS 4.3.3
CuTe DSL
-
New features
- Supported namedtuple and kwargs for JIT function arguments in tvm-ffi
- Supported variadic tuples for JIT function argument in tvm-ffi
-
Bug fixing and improvements
- Fixed an issue when JIT function argument with union type annotation for tvm-ffi
- Clearer error message for the case of runtime error cudaErrorInsufficientDriver
CUTLASS 4.3.2
CuTe DSL
-
New features
- New env var
CUTE_DSL_CACHE_DIRto specify the path for dumping caches
- New env var
-
Bug fixing and improvements
- Fixed an issue of CUDA JitExecutor when unloading kernels
- Fixed an issue of allocating max smem when there's statically allocated smem
CUTLASS 4.3.1
CuTe DSL
- New features
- Added Blackwell SM103 support
- Multiple dependent DSOs in the wheel have been merged into one single DSO
- Bug fixing and improvements
- Fixed device reset issue with tvm-ffi
- Fixed tvm-ffi export compiled function
CUTLASS C++
- Support blockscaled variant of ragged contiguous grouped gemm with the new simplified MoE API in example 92.
- The new example works for all microscaling types.
CUTLASS 4.3.0
CuTe DSL
- New features:
- Supported Apache TVM-FFI for further reduced host runtime overhead for JIT functions, better PyTorch and ML frameworks interopability
- Added fake tensor and stream to decouple compile jit function with "from_dlpack" flow. Now we no longer require users to have real tensor when compile jit function.
- Added FastDivmodDivisor with Python operator overloads, new APIs, Cute dialect integration, and optimized static tile scheduler performance for faster index mapping.
- Added l2 cache evict priority for tma related ops. Users could do fine-grain l2 cache control.
- Debuggability improvements:
- Supported source location tracking for DSL APIs (Allow tools like
nsightprofiling to correlate perf metrics with Python source code) - Supported dumping PTX and CUBIN code: Hello World Example
- Supported source location tracking for DSL APIs (Allow tools like
- More examples and notebooks to get started with CuTe DSL:
- Improved performance of elementwise example:
- Generalize code to handle list of input tensors
- Generalize TV layout computation to handle different data types
- Improved Blackwell SM100 persistent dense GEMM with static scheduling:
- To demonstrate usage of new Pipeline APIs
PipelineProducerandPipelineConsumerto simplify code without explicit pipeline state management (Exiting APIs are still maintained) - Separated epilogue code for non-TMA and TMA implementation
- To demonstrate usage of new Pipeline APIs
- Tutorial for Blackwell GEMM: Basic Blackwell SM100 GEMM
- Baseline Blackwell GEMM achieves 84% SOL performance with MNK 8K
- More examples are coming for demo of optimization:
Baseline + X
- Tutorial for Async Pipeline API
- Reworked elementwise add notebook with more details and detailed explanation about TV layout
- Updated implementation to handle general data type and multiple inputs
- Updated explanation for TV layout in simpler language
- Added visualization of TV Layout with 3rd party utils
- Benchmark and autotune demonstration
- Improved performance of elementwise example:
- More examples of authorizing peak-performance kernels:
- Blackwell SM100 mixed-input GEMM
- Blackwell SM100 persistent blockwise dense GEMM
- Blackwell SM100 persistent blockwise contiguous grouped dense GEMM
- Blackwell SM100 persistent blockwise masked grouped dense GEMM
- Blackwell SM100 fmha bwd
- Blackwell SM100 mla
- Hopper SM90 persistent dense GEMM with static scheduling
- Blackwell GeForce batched dense GEMM
- Ampere HSTU Attention
- API updates:
- Please refer to DSL API changelog for details
- Bug fixings and improvements
- Add mma_tiler_n=64 and mma_tiler_n=192 support in Blackwell SM100 persistent dense blockscaled GEMM with static scheduling.
- Fixed
TensorSSA.reduceto support static value as initial value - Updated docstring for following APIs to be more concise and easier to understand:
make_layout_tvis_staticPipelineAsyncSmemAllocator
- Fixed documentation for
pipeline,utilsandcute.math - Added overlapping accumulator optimization for block tile N = 256 case for better epilogue latency hiding in Blackwell SM100 persistent dense blockscaled GEMM with static scheduling.
- Fixed TensorSSA.getitem indexing to match CuTe's indexing convention
- Fixed an issue with cutlass.max and cutlass.min
- Fixed an issue with mark_compact_shape_dynamic
CUTLASS C++
- Further enhance Blackwell SM100 Attention kernels in example 77.
- Add softmax skip correction.
- Fix a shared memory allocation bug where it needs to opt in maximum dynamics shared memory explicitly once it exceeds 48KB.
- Fix a dead hang issue caused by early return warp.
- Add support through cmdline argument lists for
batch,no_verif,cluster_shapeandcluster_shape_fallbackin example 89. - Add Ragged Contiguous Grouped gemm kernel in example 92.
- This kernel uses a TMA 3D load to load the weights matrix and use the tensormap update method to load activations.
- Add 256x128 tile size support for Hopper SM90 deepgemm in example 67.
- Performance is optimized to align with Deepseek implementation.
- Simplification of API for MoE gemms.
- Instead of requiring users to call several cute utilities to set up the stride, API
moe_stride_utilsis introduced to help setup strides in the kernel. - Instead of requiring users to set vectors like
problem_shapes_deviceandproblem_shapes_hosts, a new problem shape struct calledMoEProblemShapeis introduced which takes in max_m, max_n, max_k and counts vector as input and deduce problem shapes internally whenever required.
- Instead of requiring users to call several cute utilities to set up the stride, API
- Enable GEMM_K = 0 in grouped gemm.
- Optimize group gemm kernels by enabling async TMA desc update.
- Support Blackwell SM100 convolution stream-K kernel.
- Unit tests: fprop_streamK, dgrad_streamK, wgrad_streamK.
- Add Blackwell SM100 sparse gemm compressor unit tests.
- Unit tests: compressor_fp16.
- Add sub-bytes and runtime data type support in compressor unit test testbed.
- Add profiler support for:
- Blackwell SM100 and SM120 blockscaled sparse kernels.
- New MoE grouped gemm API.
- Blackwell SM100 cpasync kernel.
- Fix some kernel issues:
- Fix a race check issue of Blackwell SM103 kernels by adding missing elect one for prefetch barrier initialization.
- Allow user to directly specify the number of stages for Hopper sm90 mixed input gemm.
- Remove warnings caused by cuda vector type alignment setting in CUDA 13.
- Remove problematic
cutlass::int8_tand replace it withint8_t. - Fix a few bugs in distributed gemm API and examples.
- Fix handling negative zero in sparse compressor.
- Add missing
wait_on_dependent_gridsfor PDL use case.
- Fix some profiler issues:
- Add some missing reference kernels.
- Support VoidC reference kernels.
- Add calculation of scale factor A and B in function
bytes_with_problem_shapeof block scaled profiler. - Fix an issue when epilogue tile N is not divided by default subtile N.
- Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
- Optimal code generation with CUDA toolkit versions 13.0U1.
CUTLASS 4.2.1
CuTe DSL
- Bug fixings and improvements
- Fixed an issue when running DSL codes with cuda-python 13.0
- Fixed an issue when running inductor with DSL codes
- Fixed an issue with unexpected logging when running DSL codes in FlashInfer
- Fixed the issue reported in #2647
- Fixed an issue when conditional define of variables outside of dynamic control flow
CUTLASS C++
- Bypass EVT for nosmem blockwise kernels on Blackwell.
- Rename cutlass/python/cutlass directory to cutlass/python/cutlass_cppgen.
CUTLASS 4.2.0
CuTe DSL
- More Python versions are now supported for both x86-64 and aarch64, including
- Python 3.10, 3.11, 3.12, and 3.13
- Added new example and updated notebook to get started with CuTe DSL
- Call kernels with dlpack bypassed
- Updates on TensorSSA demonstration
- Added a section for introducing the broadcast
- API updates
- Please refer to DSL API changelog for details
- Bug fixings and improvements
- Fixed
cute.print_tensorfor coordinate tensor - Fixed
cute.printfor tuple of layouts - Fixed frozen object is not properly updated after fully assigned in dynamic control flow
- Fixed assign tuple/list element in a dynamic control flow may cause compilation failure
- Improved error message when CUDA context is not initialized
- Improved docstring of congruent and weakly_congruent
- Fixed
CUTLASS C++
- Support for Blackwell SM103 kernels for B300 GPUs.
- Collective mainloop codes: Blockscaled datatypes with support for dense GEMM mainloop
- New GEMM and epilogue dispatch policies for collectives, kernel layers, and builders.
- Kernel codes: Blockscaled datatypes with support for dense GEMM kernel.
- Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM103 architecture:
- Set of unit tests that demonstrate the usage of Blackwell SM103 blockscaled GEMM
- Unit test files with prefix name of
sm103_under GEMM device unit tests.
- Unit test files with prefix name of
- Support for Blackwell SM121 kernels for DGX Spark GPUs.
- Share the major codes with Blackwell SM120 kernels.
- Add support for heuristics-based kernel filtering and autotuning using
nvidia-matmul-heuristicsto find the best kernels for a given scenario.- Details please refer to heuristics doc.
- Further enhance Blackwell SM100 Attention kernels in example 77.
- Add fused reduction kernel support for cutlass MLA.
- Add softmax skip correction.
- Support for GQA in FMHA backward kernel.
- Fix an issue where
get_unmasked_trip_countmay return a negative value. - Fix an issue where mbarriers are initialized with a zero arrival count.
- Fix a corner case issue where the sequence length of q is not a multiple of tile_q.
- Remove tma padding for forward kernel inputs.
- Add Blackwell SM100 kernels for MoEs (focusing on Low-Latency inference performance): example 92. It uses TMA (for weights) and CPASYNC (for tokens) to load input matrices and allow only one problem dimension to vary across groups/experts, unlike general Grouped GEMMs. Note: further API simplifications and kernel improvements are upcoming. Any feedback on API is welcome.
- Further enhance blockwise and groupwise GEMMs on Hopper and Blackwell
- On Blackwell SM120, a blockwise gemm kernel is added: example 87.
- On Hopper, add K major scale factor support for SM90 blockwise kernels.
- On Hopper, relax the restriction that the k dimension of the problem size has to be the multiple of the k dimension of the tile size.
- On Hopper, grouped version supports the case when k = 0.
- Support for Blackwell SM100 fp4 gemv kernels.
- Kernel codes: Gemv kernel.
- Example codes: example 91
- Support for Blackwell SM100 legacy mixed input GEMM kernels.
- Collective mainloop codes: Mixed input mainloop.
- Kernel codes: Mixed input kernel.
- Example codes: example 86.
- Support for Blackwell SM100 cpasync kernel.
- Collective mainloop codes: cpasync mainloop.
- Kernel codes: cpasync kernel.
- Support Blackwell SM120 mixed input blockscaled grouped GEMM.
- Instantiating more Blackwell kernels in profiler.
- Blackwell SM100 and SM103 kernels support
CUTLASS_LIBRARY_INSTANTIATION_LEVELto instantiate all possible combinations. - To use this feature,
CUTLASS_LIBRARY_KERNELSmust be non-empty. Profiler will combineCUTLASS_LIBRARY_KERNELSandCUTLASS_LIBRARY_INSTANTIATION_LEVELto instantiate specific kernels. - Details please check Profiler Doc.
- Blackwell SM100 and SM103 kernels support
- Fix some profiler issues:
- Modify default cluster callback values to none 0 to avoid profiler failure when these values are not set in command line.
- Fix some no output and timeout issues.
- Fix Pingpong Blockwise Hopper library generation.
- From CUDA 13.0, the Blackwell SM101 for Thor GPUs is renamed to SM110.
- For CUDA toolkit version < 13.0, SM101 is still used for Thor GPUs.
- For CUDA toolkit version >= 13.0, SM110 is used for Thor GPUs and SM101 is no longer valid.
- Rename legacy Python API package from
cutlasstocutlass_cppgenand add Blackwell EVT support to legacy Python interface.- Restructuring the C++ Blackwell SM100 Collective Epilogue Builder to work with the Python interface's
EpilogueDescriptors. - Added Blackwell SM100 EVT Emitter on the Python side and routed most emission through Hopper SM90 Emitter.
- Added some support for running SM100 kernels via the Python interface.
- Restructuring the C++ Blackwell SM100 Collective Epilogue Builder to work with the Python interface's
- CuTe changes:
- Fix inaccurate GridDim calculation under CuTe tutorial.
- Add movmatrix support.
- Fix smallest MMA-N allowed for Blackwell fp8 and fp16 gemm kernels.
- Support fp16 accmulator for sm89 fp8 mma.
- Shorten
nullspaceimplementation. - Isolate and comment on
cosizehacks. - Important documentation correction:
E<0,1> == 1@0@1.
- Fix some kernel issues:
- Fix Hopper SM90 group gemm kernel to only use the commit group and wait group instead of also waiting on mbarriers.
- Fix a tiny bug when K is large for Blackwell SM103 fp4 grouped GEMM kernel.
- Add following unit tests:
- Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
- Optimal code generation with CUDA toolkit versions 13.0U1.