Skip to content

Releases: NVIDIA/cutlass

CUTLASS 4.1.0

22 Jul 02:14
e51efbf
Compare
Choose a tag to compare

CuTe DSL

CUTLASS C++

  • Further enhance Blackwell SM100 Attention kernels in example 77.
    • Add variable sequence length support for FMHA Backward kernel.
    • Add varlen test support to Backward runner.
    • Codes support empty batch sequences.
  • Replace subbyte_iterator with cute::recast_ptr when constructing logical iterators/arrays.
  • CuTe changes:
    • Rewrite ArithTuple and ScaledBasis for robustness and clarity.
    • Remove buggy and kludgy get_layoutA|B|C_MN and friends from Atoms/TiledX.
    • Factor out print_latex and friends and rewrite.
    • Factor out print_svg and friends and rewrite.
  • Support Blackwell SM100 SIMT packed fp32x2 kernels.
  • Support residual add for implicit gemm kernels.
  • Various fixes for CUTLASS C++ Python interface's EVT tracer:
    • Add verifier for sm90 to report the invalid input.
    • When adding an edge to the graph, if the edge already exists, add an identity compute node to avoid having multiple parallel edges.
    • Register operations of tanh, sigmoid, exp, gelu to the python ast frontend.
    • Replace the NotImplemented Error by packing all nodes into a single topological visitor node as a fallback.
  • Fix profiler bugs in exhaustive perf search.
    • Fix incorrect cluster shape output issue when doing exhaustive search.
    • Fix a bug in profiler grouped GEMM for setting tile scheduler swizzles, cluster shapes, and raster orders.
  • Fix some profiler issues.
    • Complete the reference for Blackwell blockwise gemm kernels.
    • Fix incorrect regex logic for L1 test.

CUTLASS 4.0.0

27 Jun 14:17
b995f93
Compare
Choose a tag to compare

CuTe DSL

CuTe DSL is a Python DSL centered around CuTe's abstractions

CUTLASS C++

CUTLASS 3.9.2

04 May 04:25
ad7b2f5
Compare
Choose a tag to compare
  • Fixed Blockwise and Groupwise GEMM hang issue when problem size K is 128.
  • Optimal code generation with CUDA toolkit versions 12.9.

CUTLASS 3.9.1

01 May 04:29
f535c33
Compare
Choose a tag to compare
  • Fixed Group Gemm hang issue in CUTLASS 3.x
  • Improved Hopper Blockwise and Groupwise GEMM performance.

CUTLASS 3.9.0

25 Apr 01:53
e94e888
Compare
Choose a tag to compare

CUTLASS 3.8.0

21 Feb 05:32
afa1772
Compare
Choose a tag to compare

CUTLASS 3.8 is the first release that supports the NVIDIA Blackwell SM100 architecture.
For a background on Blackwell's new features, please consult the PTX documentation for CUDA 12.8.

Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
CUTLASS team is working on a fix.

CUTLASS 3.7.0

18 Jan 15:07
b78588d
Compare
Choose a tag to compare
  • A new Hopper blockwise scaling FP8 GEMM where the operands and block scaling tensor are staged via shared memory.
  • Distributed GEMM is an experimental pipelined Tensor Parallelism implementation utilizing existing CUTLASS kernels and CUDA runtime features, which can hide the most of communication behind computation.
  • Improved persistent grid launch for Hopper kernels with large cluster sizes (>= size of 4) using the new make_kernel_hardware_info API as shown in example 48.
  • Enabled high precision accumulation for Hopper FP8 Sparse GEMM.

CUTLASS 3.6.0

25 Dec 22:19
bf9da7b
Compare
Choose a tag to compare

CUTLASS 3.5.1

29 Aug 20:15
f7b19de
Compare
Choose a tag to compare

CUTLASS 3.5.0

12 Apr 01:40
7d49e6c
Compare
Choose a tag to compare
  • Implicit GEMM Convolutions targeting Hopper SM90A via WGMMA + TMA im2col.
    • Native implementation in CUTLASS 3.x using CuTe, mirroring the same design hierarchy as that of GEMMs.
    • Support for 1D, 2D, and 3D convolutions in a rank-agnostic fashion.
    • Support for Fprop, Dgrad, and Wgrad algorithms.
    • CUTLASS profiler support for 2D and 3D convolutions implemented via the 3.x API.
    • NOTE: this is a beta release. Further updates to CUTLASS will include major performance improvements, feature enablement, and possible breaking changes to the API until 3.7 release. Your feedback is welcome on the design!
  • Support for Ada (SM89) FP8 tensor cores via the 2.x API. Requires CUDA 12.4 or newer.
  • Ampere gather/scatter convolution example in CuTe and CUTLASS 3.x.
    • Showcasing how custom kernels can be written and optimized using CUTLASS 3.x and CuTe and the general strategy for implementing convolutions as specializations of GETTs.
    • Implementation of a coarse grained sparse gather/scatter kernel achieving peak performance on Ampere class tensor cores.
  • 32x and 16x tile sizes are added to CUTLASS 2.x to improve the performance of narrow-tall and wide-short matrices.
  • Updates to CuTe documentation for cute::Tensor<>, MMA atoms, and an overhauled CuTe GEMM tutorial series.
  • Extensions to CuTe to support L2 prefetching and TMA store+reductions.
  • Remove C++11 requirement on a few CUTLASS 2.x API header files. All CUTLASS files now require C++17.
  • Fixes to greatly reduce build warnings.
  • Updates and bugfixes from the community (thanks!)