Skip to content

Releases: NVIDIA/cutlass

CUTLASS 4.4.1

28 Feb 03:30
4370102

Choose a tag to compare

CuTe DSL

  • Bug fixing and improvements
    • Fixed a segfault issue with tvm-ffi on aarch64

CUTLASS 4.4.0

26 Feb 04:01
c213bfd

Choose a tag to compare

CuTe DSL

  • New features

    • CuTe DSL now supports CUDA toolkit 13.1!
    • GB300 is now supported in CuTe DSL with CTK 13.1
    • cute.experimental: introduce a higher-level, composable layer on top of existing CuTe DSL APIs (not a separate abstraction), which can be mixed with existing Cute DSL building blocks.
      • Fragment-free programming model: copy/dot APIs take memrefs directly instead of descriptors/fragments.
      • Automatic TMA descriptor generation and update insertion.
      • Automatic vectorization and predication for SIMT copies.
      • New pipeline abstraction with convenience wrappers
      • New Partition ops to simplify partitioning logic.
      • Device-side TMA descriptor allocation, initialization, and management
      • These examples can be found here https://github.com/NVIDIA/cutlass/tree/main/examples/python/CuTeDSL/experimental
    • Ahead of Time (AoT) compilation is now available!
    • JAX support - you can now use CuTeDSL along with JAX
    • Introduced versioning support in DSL:
      • cutlass.version for a string representation of DSL version
      • cutlass.CUDA_VERSION for a version class to tell the CUDA version used for DSL
    • Added CopyDsmemStoreOp to store data to distributed shared memory with explicit synchronization.
    • Grouped GEMM example now supports device-only problem shapes.
    • We allow grid carve-out without problem shapes being available on host.
    • Tma+LdMatrix features for loading+unpacking narrow-width types (refer to mixed_input_fmha_decode.py for example usage).
    • It is possible now to have customized epilogue fusion for persistent dense GEMM through a Python Epilogue Fusion Configuration (EFC) function, somewhat similar to CUTLASS C++ EVT. It also provides a PyTorch evaluator to compare the results.
  • More examples of authorizing peak-performance kernels

    • SM103 batched 3xFP4 blockscaled GEMM kernel
    • Mixed input FMHA decode example with support for int4 KV (int8 KV supported in 4.3)
    • New acc_scale grouped mixed input gemm kernel variant is introduced to deliver better performance for decoding cases.
    • All mixed_input_gemm examples are moved into a separate folder mixed_input_gemm. Common utility functions are also extracted into mixed_input_host_utils.py under the same folder.
  • Bug fixing and improvements

  • API changes

    • Deprecate get_num_tmem_alloc_cols from blackwell_helpers.py. Use the one from tmem_allocator.py instead.
    • Deprecate SM100_TMEM_CAPACITY_COLUMNS and SM100_TMEM_MIN_ALLOC_COLUMNS.
    • LdMatrix16x16x8bOp and StMatrix16x8x8bOp now require explicit transpose=True when calling init, to avoid ambiguity in data transposition.
    • LdMatrix16x16x8bOp copy traits updated to be faithful to PTX without permutations. Permuted variant is renamed to LdMatrix16x8x8bOp.
    • Grouped GEMM example takes the argument --host_problem_shape_available. If the argument is provided, grid is carved out based upon the host problem shapes, otherwise, we launch maximum possible SMs.
    • hardware_info.get_max_active_cluster support pass in specific stream to query. Useful for green context based SM partition.
    • group_bulk_copy_modes in async bulk copy example is now deprecated, use group_modes directly instead.
    • Deprecate nvvm wrapper from using nvvm enum, use str instead.
    • cute.arch.calc_packed_f32x2_op default enable ftz to default disable ftz
    • In CuTe DSL with CTK 13.1, following APIs in cutlass.cute.arch now require string literal instead of enum as argument:
      • fence_proxy
      • fence_view_async_tmem_op
      • calc_packed_f32x2_op
      • warp_redux_sync
      • atomic_add
      • atomic_and
      • atomic_or
      • atomic_xor
      • atomic_max
      • atomic_min
      • atomic_exch
      • atomic_cas
      • store
      • load
  • Use 'Advanced control file' for mixed input gemm examples for better performance.

    • Advanced control file is an experimental feature of CUDA compiler. The controls file contains internal compiler settings tuned for specific kernels with a specific version of CUDA toolkit to get better GPU kernel code. More details and documentation on how to create these controls files will be provided in future CUDA toolkit release. Note: The advanced compiler control file is not expected to work for kernels that it was not tuned for. There is no compatibility guarantee, and the controls file will not work for CUDA toolkit with a different version.

CUTLASS C++

  • Add example 93 for Blackwell low latency generation phase GQA kernel.
    • Flash Decoding with cluster reduction.
    • Kernel design details please check Readme.
  • Add Blackwell SM100 State Space Decomposition (SSD) kernel in example 112.
  • Add Hopper SM90 State Space Decomposition (SSD) kernel in example 111.
  • Add example 94 for Ada FP8xFP8 -> BF16 GEMM with blockwise dequantization of input matrices in the MMA loop with FP32 accumulation.
    • Generate additional device/kernel/threadblock files in CUTLASS include directory that add functionality to carry the scaling tensors + use them in MMA loop.
    • Add gemm_blockwise to include files in default_mma_core_sm80
  • Add Hopper e2m1 to fp32 optimized conversion and e2m1 * TF32 tensor core GEMM.
    • Set MmaType to tfloat32_t for FP32 mode.
    • TF32 provides FP32 inputs with reduced precision (19-bit vs 32-bit)
    • Set TileShapeK=64 for TF32 (K must be multiple of 8)
    • Shuffle optimization enabled via compute_memory_reordering_atom<tfloat32_t>()
    • E2M1 -> FP32 -> TF32 TC path for mixed-precision GEMM
    • Enable example 55 with TF32 support
  • Add support for arbitrary application-provided strides for block-scale tensors.
    • Users and applications now must pass valid block-scale strides in all cases, even when the tensor is packed.
  • Support 4x blockscaled public ptx for CUDA 13.1.
  • Allow non-static TmaGbasis in AuxTmaParams.
    • Some cases in attention kernel may require non-static tma_gbasis.
    • Relax the restriction on TmaGbasis parameter of AuxTmaParams and users are allowed to manually construct a dynamic gbasis.
  • Fix some kernel issues:
    • Fix MSVC pre process issue.
    • Fix a self assign issue in GEMV kernel.
    • Fix a TMA descriptor bug where the CUDA driver is not properly setting the OOB address gen mode correctly.
    • Fix memory fence for clc scheduler in Blackwell SM120 pingpong kernel.
    • Fix missing SMEM alignment in Blackwell SM120 scale factors.
    • Fix a PDL issue for grouped gemm.
    • Fix divide-by-zero issue in canimplement for sm100 implicit gemm kernels.
    • Fix cluster swizzle for Grouped GEMMs.
      • Move host-side swizzling heuristics to device.
      • Apply swizzle per group based on problem shape and max swizzle size.
      • Improve examples and unit tests.
  • Fix some profiler issues:
    • Fix a core dump issue for nvfp4 grouped GEMM kernel.
    • Fix inconsistent GEMM verification logic.
    • Rework grouped gemm verification logic for different types.
    • Fix api break change in using nvMatmulHeuristics.
  • Fix some failed links under media/docs.
  • Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
  • Optimal code generation with CUDA toolkit versions 13.1.

CUTLASS 4.3.5

09 Jan 06:08
4faf1a1

Choose a tag to compare

CuTe DSL

  • Bug fixing and improvements
    • Fixed the unexpected CPU overhead issue introduced by 4.3.4
  • Update copyright to 2026.

CUTLASS C++

  • Update copyright to 2026.
  • Use CUDA Driver Get Version Runtime APIs Rather than Driver APIs.

CUTLASS 4.3.4

24 Dec 05:49
1810164

Choose a tag to compare

CuTe DSL

  • New features

  • Bug fixing and improvements

    • Fixed a frame refcnt issue with cuda graph
    • Enhancement for tvm-ffi AoT case for earlier module unload
    • Fixed order issue in make_smem_layout_a in utils/hopper_helpers.py

CUTLASS C++

  • Work around a driver TMA descriptor related bug which will cause occasionally errors on Blackwell when the tensor's backing memory allocation is less than 128KB and it is not a dense non-overlapping tensor.

CUTLASS 4.3.3

12 Dec 05:12
d55f6be

Choose a tag to compare

CuTe DSL

  • New features

    • Supported namedtuple and kwargs for JIT function arguments in tvm-ffi
    • Supported variadic tuples for JIT function argument in tvm-ffi
  • Bug fixing and improvements

    • Fixed an issue when JIT function argument with union type annotation for tvm-ffi
    • Clearer error message for the case of runtime error cudaErrorInsufficientDriver

CUTLASS 4.3.2

05 Dec 18:51
5c149f5

Choose a tag to compare

CuTe DSL

  • New features

    • New env var CUTE_DSL_CACHE_DIR to specify the path for dumping caches
  • Bug fixing and improvements

    • Fixed an issue of CUDA JitExecutor when unloading kernels
    • Fixed an issue of allocating max smem when there's statically allocated smem

CUTLASS 4.3.1

02 Dec 03:22

Choose a tag to compare

CuTe DSL

  • New features
    • Added Blackwell SM103 support
    • Multiple dependent DSOs in the wheel have been merged into one single DSO
  • Bug fixing and improvements
    • Fixed device reset issue with tvm-ffi
    • Fixed tvm-ffi export compiled function

CUTLASS C++

  • Support blockscaled variant of ragged contiguous grouped gemm with the new simplified MoE API in example 92.
    • The new example works for all microscaling types.

CUTLASS 4.3.0

24 Nov 22:24
e67e63c

Choose a tag to compare

CuTe DSL

CUTLASS C++

  • Further enhance Blackwell SM100 Attention kernels in example 77.
    • Add softmax skip correction.
    • Fix a shared memory allocation bug where it needs to opt in maximum dynamics shared memory explicitly once it exceeds 48KB.
    • Fix a dead hang issue caused by early return warp.
  • Add support through cmdline argument lists for batch, no_verif, cluster_shape and cluster_shape_fallback in example 89.
  • Add Ragged Contiguous Grouped gemm kernel in example 92.
    • This kernel uses a TMA 3D load to load the weights matrix and use the tensormap update method to load activations.
  • Add 256x128 tile size support for Hopper SM90 deepgemm in example 67.
    • Performance is optimized to align with Deepseek implementation.
  • Simplification of API for MoE gemms.
    • Instead of requiring users to call several cute utilities to set up the stride, API moe_stride_utils is introduced to help setup strides in the kernel.
    • Instead of requiring users to set vectors like problem_shapes_device and problem_shapes_hosts, a new problem shape struct called MoEProblemShape is introduced which takes in max_m, max_n, max_k and counts vector as input and deduce problem shapes internally whenever required.
  • Enable GEMM_K = 0 in grouped gemm.
  • Optimize group gemm kernels by enabling async TMA desc update.
  • Support Blackwell SM100 convolution stream-K kernel.
  • Add Blackwell SM100 sparse gemm compressor unit tests.
    • Unit tests: compressor_fp16.
    • Add sub-bytes and runtime data type support in compressor unit test testbed.
  • Add profiler support for:
    • Blackwell SM100 and SM120 blockscaled sparse kernels.
    • New MoE grouped gemm API.
    • Blackwell SM100 cpasync kernel.
  • Fix some kernel issues:
    • Fix a race check issue of Blackwell SM103 kernels by adding missing elect one for prefetch barrier initialization.
    • Allow user to directly specify the number of stages for Hopper sm90 mixed input gemm.
    • Remove warnings caused by cuda vector type alignment setting in CUDA 13.
    • Remove problematic cutlass::int8_t and replace it with int8_t.
    • Fix a few bugs in distributed gemm API and examples.
    • Fix handling negative zero in sparse compressor.
    • Add missing wait_on_dependent_grids for PDL use case.
  • Fix some profiler issues:
    • Add some missing reference kernels.
    • Support VoidC reference kernels.
    • Add calculation of scale factor A and B in function bytes_with_problem_shape of block scaled profiler.
    • Fix an issue when epilogue tile N is not divided by default subtile N.
  • Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
  • Optimal code generation with CUDA toolkit versions 13.0U1.

CUTLASS 4.2.1

24 Sep 05:23
f3fde58

Choose a tag to compare

CuTe DSL

  • Bug fixings and improvements
    • Fixed an issue when running DSL codes with cuda-python 13.0
    • Fixed an issue when running inductor with DSL codes
    • Fixed an issue with unexpected logging when running DSL codes in FlashInfer
    • Fixed the issue reported in #2647
    • Fixed an issue when conditional define of variables outside of dynamic control flow

CUTLASS C++

  • Bypass EVT for nosmem blockwise kernels on Blackwell.
  • Rename cutlass/python/cutlass directory to cutlass/python/cutlass_cppgen.

CUTLASS 4.2.0

18 Sep 03:32

Choose a tag to compare

CuTe DSL

  • More Python versions are now supported for both x86-64 and aarch64, including
    • Python 3.10, 3.11, 3.12, and 3.13
  • Added new example and updated notebook to get started with CuTe DSL
  • API updates
  • Bug fixings and improvements
    • Fixed cute.print_tensor for coordinate tensor
    • Fixed cute.print for tuple of layouts
    • Fixed frozen object is not properly updated after fully assigned in dynamic control flow
    • Fixed assign tuple/list element in a dynamic control flow may cause compilation failure
    • Improved error message when CUDA context is not initialized
    • Improved docstring of congruent and weakly_congruent

CUTLASS C++

  • Support for Blackwell SM103 kernels for B300 GPUs.
  • Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM103 architecture:
  • Set of unit tests that demonstrate the usage of Blackwell SM103 blockscaled GEMM
  • Support for Blackwell SM121 kernels for DGX Spark GPUs.
    • Share the major codes with Blackwell SM120 kernels.
  • Add support for heuristics-based kernel filtering and autotuning using nvidia-matmul-heuristics to find the best kernels for a given scenario.
  • Further enhance Blackwell SM100 Attention kernels in example 77.
    • Add fused reduction kernel support for cutlass MLA.
    • Add softmax skip correction.
    • Support for GQA in FMHA backward kernel.
    • Fix an issue where get_unmasked_trip_count may return a negative value.
    • Fix an issue where mbarriers are initialized with a zero arrival count.
    • Fix a corner case issue where the sequence length of q is not a multiple of tile_q.
    • Remove tma padding for forward kernel inputs.
  • Add Blackwell SM100 kernels for MoEs (focusing on Low-Latency inference performance): example 92. It uses TMA (for weights) and CPASYNC (for tokens) to load input matrices and allow only one problem dimension to vary across groups/experts, unlike general Grouped GEMMs. Note: further API simplifications and kernel improvements are upcoming. Any feedback on API is welcome.
  • Further enhance blockwise and groupwise GEMMs on Hopper and Blackwell
    • On Blackwell SM120, a blockwise gemm kernel is added: example 87.
    • On Hopper, add K major scale factor support for SM90 blockwise kernels.
    • On Hopper, relax the restriction that the k dimension of the problem size has to be the multiple of the k dimension of the tile size.
    • On Hopper, grouped version supports the case when k = 0.
  • Support for Blackwell SM100 fp4 gemv kernels.
  • Support for Blackwell SM100 legacy mixed input GEMM kernels.
  • Support for Blackwell SM100 cpasync kernel.
  • Support Blackwell SM120 mixed input blockscaled grouped GEMM.
  • Instantiating more Blackwell kernels in profiler.
    • Blackwell SM100 and SM103 kernels support CUTLASS_LIBRARY_INSTANTIATION_LEVEL to instantiate all possible combinations.
    • To use this feature, CUTLASS_LIBRARY_KERNELS must be non-empty. Profiler will combine CUTLASS_LIBRARY_KERNELS and CUTLASS_LIBRARY_INSTANTIATION_LEVEL to instantiate specific kernels.
    • Details please check Profiler Doc.
  • Fix some profiler issues:
    • Modify default cluster callback values to none 0 to avoid profiler failure when these values are not set in command line.
    • Fix some no output and timeout issues.
    • Fix Pingpong Blockwise Hopper library generation.
  • From CUDA 13.0, the Blackwell SM101 for Thor GPUs is renamed to SM110.
    • For CUDA toolkit version < 13.0, SM101 is still used for Thor GPUs.
    • For CUDA toolkit version >= 13.0, SM110 is used for Thor GPUs and SM101 is no longer valid.
  • Rename legacy Python API package from cutlass to cutlass_cppgen and add Blackwell EVT support to legacy Python interface.
    • Restructuring the C++ Blackwell SM100 Collective Epilogue Builder to work with the Python interface's EpilogueDescriptors.
    • Added Blackwell SM100 EVT Emitter on the Python side and routed most emission through Hopper SM90 Emitter.
    • Added some support for running SM100 kernels via the Python interface.
  • CuTe changes:
    • Fix inaccurate GridDim calculation under CuTe tutorial.
    • Add movmatrix support.
    • Fix smallest MMA-N allowed for Blackwell fp8 and fp16 gemm kernels.
    • Support fp16 accmulator for sm89 fp8 mma.
    • Shorten nullspace implementation.
    • Isolate and comment on cosize hacks.
    • Important documentation correction: E<0,1> == 1@0@1.
  • Fix some kernel issues:
    • Fix Hopper SM90 group gemm kernel to only use the commit group and wait group instead of also waiting on mbarriers.
    • Fix a tiny bug when K is large for Blackwell SM103 fp4 grouped GEMM kernel.
  • Add following unit tests:
  • Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
  • Optimal code generation with CUDA toolkit versions 13.0U1.