You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Set of unit tests that demonstrate the usage of sparse and blockscaled sparse Blackwell SM100 GEMM.
A new Multi-head Latent Attention (MLA) for SM100 Blackwell architecture in CUTLASS example covers the flashMLA-like weight-absorbed decoding use-case.
A new FMHA Backward kernel for SM100 Blackwell architecture extends CUTLASS example to show how the five backward pass MMAs can be fused into a single kernel to achieve high performance.
Added support for enhanced kernel performance search (auto-tuning) in CUTLASS profiler:
Sorting performance results by GFLOPs/second: Users can now sort the final performance report based on GFLOPs/second, making it easier to identify the most efficient kernels.
Exhaustive search for best kernel performance in GFLOPs/second: The profiler now searches for the best-performing kernel across a range of problem sizes, swizzle sizes, rasterization orders, and dynamic cluster configurations to maximize performance.
Performance search under a fixed GEMM shape: Enables exhaustive tuning within a fixed GEMM shape, exploring various kernel parameters to find the best configuration.
More detailed introductions and examples to leverage this feature can be found in profiler.md.
Support void as the D element in sm100 kernel epilogues.
This discussion was created from the release CUTLASS 3.9.0.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
void
as the D element in sm100 kernel epilogues.This discussion was created from the release CUTLASS 3.9.0.
Beta Was this translation helpful? Give feedback.
All reactions