Skip to content

[BUG] [Blackwell] Enable FP4/tcgen05 support for sm_121 (DGX Spark) in CuTe DSL #2947

@Teora

Description

@Teora

Which component has the problem?

CuTe DSL

Bug Report

Describe the bug
The CuTe DSL (within CUTLASS) hard-restricts ⁠tcgen05 (Blackwell Tensor Core) operations to specific architectures such as ⁠sm_100a and ⁠sm_103a. This prevents NVIDIA DGX Spark users, whose hardware identifies as ⁠sm_121, from using FP4 Tensor Core features. Despite the hardware's official support for 1 PFLOPS of FP4 performance, the software library blocks these operations due to a missing architecture whitelist entry in the DSL.

Steps/Code to reproduce bug
Attempt to compile or run any CuTe-based kernel utilizing tcgen05 operations on an NVIDIA DGX Spark (GB10) system. The following operations are confirmed to be blocked:

  1. MmaMXF4NVF4Op (FP4 MMA):
    • File: cutlass/cute/nvgpu/tcgen05/mma.py
    • Error: expects arch to be one of [Arch.sm_100a, Arch.sm_103a], but got Arch.sm_121
  2. _S2TCopyBase (SMEM → TMEM Copy):
    • File: cutlass/cute/nvgpu/tcgen05/copy.py
  3. Ld32x32bOp & other TMEM operations:
    • File: cutlass/cute/nvgpu/tcgen05/tmem.py

Current "Whack-a-mole" workaround requires manual patching of each file:

# cutlass/cute/nvgpu/tcgen05/mma.py (Example patch)
admissible_archs = [
    Arch.sm_100a,
    Arch.sm_103a,
    Arch.sm_121,   # Manually adding this
    Arch.sm_121a,
    Arch.sm_121f,
]

Expected behavior
The CuTe DSL should recognize sm_121 (and its variants sm_121a, sm_121f) as valid Blackwell architectures capable of executing tcgen05 and FP4 operations, consistent with NVIDIA's official hardware specifications and marketing for DGX Spark.

Environment details (please complete the following information):

  • Environment location: Bare-metal (NVIDIA DGX Spark)
  • GPU: NVIDIA GB10 (Blackwell)
  • Compute Capability: sm_121
  • CUDA Version: 13.0
  • CUTLASS/CuTe Version: v4.3.5 (or latest)

Additional context
This is a critical blocker for FP4 adoption on the DGX Spark platform. NVIDIA's official documentation highlights "1 PFLOPS @ FP4" as a key selling point for this system, but the feature is currently inaccessible via CuTe due to this software-level restriction.

References:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions