CUTLASS 4.3.0

hwu36 released this 24 Nov 22:24

· 16 commits to release/4.3 since this release

e67e63c

CuTe DSL

New features:
- Supported Apache TVM-FFI for further reduced host runtime overhead for JIT functions, better PyTorch and ML frameworks interopability
- Added fake tensor and stream to decouple compile jit function with "from_dlpack" flow. Now we no longer require users to have real tensor when compile jit function.
- Added FastDivmodDivisor with Python operator overloads, new APIs, Cute dialect integration, and optimized static tile scheduler performance for faster index mapping.
- Added l2 cache evict priority for tma related ops. Users could do fine-grain l2 cache control.
Debuggability improvements:
- Supported source location tracking for DSL APIs (Allow tools like nsight profiling to correlate perf metrics with Python source code)
- Supported dumping PTX and CUBIN code: Hello World Example
More examples and notebooks to get started with CuTe DSL:
- Improved performance of elementwise example:
  - Generalize code to handle list of input tensors
  - Generalize TV layout computation to handle different data types
- Improved Blackwell SM100 persistent dense GEMM with static scheduling:
  - To demonstrate usage of new Pipeline APIs PipelineProducer and PipelineConsumer to simplify code without explicit pipeline state management (Exiting APIs are still maintained)
  - Separated epilogue code for non-TMA and TMA implementation
- Tutorial for Blackwell GEMM: Basic Blackwell SM100 GEMM
  - Baseline Blackwell GEMM achieves 84% SOL performance with MNK 8K
  - More examples are coming for demo of optimization: Baseline + X
- Tutorial for Async Pipeline API
- Reworked elementwise add notebook with more details and detailed explanation about TV layout
  - Updated implementation to handle general data type and multiple inputs
  - Updated explanation for TV layout in simpler language
  - Added visualization of TV Layout with 3rd party utils
- Benchmark and autotune demonstration
More examples of authorizing peak-performance kernels:
API updates:
- Please refer to DSL API changelog for details
Bug fixings and improvements
- Add mma_tiler_n=64 and mma_tiler_n=192 support in Blackwell SM100 persistent dense blockscaled GEMM with static scheduling.
- Fixed TensorSSA.reduce to support static value as initial value
- Updated docstring for following APIs to be more concise and easier to understand:
  - make_layout_tv
  - is_static
  - PipelineAsync
  - SmemAllocator
- Fixed documentation for pipeline, utils and cute.math
- Added overlapping accumulator optimization for block tile N = 256 case for better epilogue latency hiding in Blackwell SM100 persistent dense blockscaled GEMM with static scheduling.
- Fixed TensorSSA.getitem indexing to match CuTe's indexing convention
- Fixed an issue with cutlass.max and cutlass.min
- Fixed an issue with mark_compact_shape_dynamic

CUTLASS C++

Further enhance Blackwell SM100 Attention kernels in example 77.
- Add softmax skip correction.
- Fix a shared memory allocation bug where it needs to opt in maximum dynamics shared memory explicitly once it exceeds 48KB.
- Fix a dead hang issue caused by early return warp.
Add support through cmdline argument lists for batch, no_verif, cluster_shape and cluster_shape_fallback in example 89.
Add Ragged Contiguous Grouped gemm kernel in example 92.
- This kernel uses a TMA 3D load to load the weights matrix and use the tensormap update method to load activations.
Add 256x128 tile size support for Hopper SM90 deepgemm in example 67.
- Performance is optimized to align with Deepseek implementation.
Simplification of API for MoE gemms.
- Instead of requiring users to call several cute utilities to set up the stride, API moe_stride_utils is introduced to help setup strides in the kernel.
- Instead of requiring users to set vectors like problem_shapes_device and problem_shapes_hosts, a new problem shape struct called MoEProblemShape is introduced which takes in max_m, max_n, max_k and counts vector as input and deduce problem shapes internally whenever required.
Enable GEMM_K = 0 in grouped gemm.
Optimize group gemm kernels by enabling async TMA desc update.
Support Blackwell SM100 convolution stream-K kernel.
- Unit tests: fprop_streamK, dgrad_streamK, wgrad_streamK.
Add Blackwell SM100 sparse gemm compressor unit tests.
- Unit tests: compressor_fp16.
- Add sub-bytes and runtime data type support in compressor unit test testbed.
Add profiler support for:
- Blackwell SM100 and SM120 blockscaled sparse kernels.
- New MoE grouped gemm API.
- Blackwell SM100 cpasync kernel.
Fix some kernel issues:
- Fix a race check issue of Blackwell SM103 kernels by adding missing elect one for prefetch barrier initialization.
- Allow user to directly specify the number of stages for Hopper sm90 mixed input gemm.
- Remove warnings caused by cuda vector type alignment setting in CUDA 13.
- Remove problematic cutlass::int8_t and replace it with int8_t.
- Fix a few bugs in distributed gemm API and examples.
- Fix handling negative zero in sparse compressor.
- Add missing wait_on_dependent_grids for PDL use case.
Fix some profiler issues:
- Add some missing reference kernels.
- Support VoidC reference kernels.
- Add calculation of scale factor A and B in function bytes_with_problem_shape of block scaled profiler.
- Fix an issue when epilogue tile N is not divided by default subtile N.
Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
Optimal code generation with CUDA toolkit versions 13.0U1.

Assets 2

0 Join discussion