Skip to content

CUTLASS 4.1.0

Latest
Compare
Choose a tag to compare
@hwu36 hwu36 released this 28 Jul 03:57
· 5 commits to main since this release
e51efbf

CuTe DSL

CUTLASS C++

  • Further enhance Blackwell SM100 Attention kernels in example 77.
    • Add variable sequence length support for FMHA Backward kernel.
    • Add varlen test support to Backward runner.
    • Codes support empty batch sequences.
  • Replace subbyte_iterator with cute::recast_ptr when constructing logical iterators/arrays.
  • CuTe changes:
    • Rewrite ArithTuple and ScaledBasis for robustness and clarity.
    • Remove buggy and kludgy get_layoutA|B|C_MN and friends from Atoms/TiledX.
    • Factor out print_latex and friends and rewrite.
    • Factor out print_svg and friends and rewrite.
  • Support Blackwell SM100 SIMT packed fp32x2 kernels.
  • Support residual add for implicit gemm kernels.
  • Various fixes for CUTLASS C++ Python interface's EVT tracer:
    • Add verifier for sm90 to report the invalid input.
    • When adding an edge to the graph, if the edge already exists, add an identity compute node to avoid having multiple parallel edges.
    • Register operations of tanh, sigmoid, exp, gelu to the python ast frontend.
    • Replace the NotImplemented Error by packing all nodes into a single topological visitor node as a fallback.
  • Fix profiler bugs in exhaustive perf search.
    • Fix incorrect cluster shape output issue when doing exhaustive search.
    • Fix a bug in profiler grouped GEMM for setting tile scheduler swizzles, cluster shapes, and raster orders.
  • Fix some profiler issues.
    • Complete the reference for Blackwell blockwise gemm kernels.
    • Fix incorrect regex logic for L1 test.