-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Which component has the problem?
CuTe DSL
Bug Report
Describe the bug
The CuTe DSL (within CUTLASS) hard-restricts tcgen05 (Blackwell Tensor Core) operations to specific architectures such as sm_100a and sm_103a. This prevents NVIDIA DGX Spark users, whose hardware identifies as sm_121, from using FP4 Tensor Core features. Despite the hardware's official support for 1 PFLOPS of FP4 performance, the software library blocks these operations due to a missing architecture whitelist entry in the DSL.
Steps/Code to reproduce bug
Attempt to compile or run any CuTe-based kernel utilizing tcgen05 operations on an NVIDIA DGX Spark (GB10) system. The following operations are confirmed to be blocked:
MmaMXF4NVF4Op(FP4 MMA):- File:
cutlass/cute/nvgpu/tcgen05/mma.py - Error:
expects arch to be one of [Arch.sm_100a, Arch.sm_103a], but got Arch.sm_121
- File:
_S2TCopyBase(SMEM → TMEM Copy):- File:
cutlass/cute/nvgpu/tcgen05/copy.py
- File:
Ld32x32bOp& other TMEM operations:- File:
cutlass/cute/nvgpu/tcgen05/tmem.py
- File:
Current "Whack-a-mole" workaround requires manual patching of each file:
# cutlass/cute/nvgpu/tcgen05/mma.py (Example patch)
admissible_archs = [
Arch.sm_100a,
Arch.sm_103a,
Arch.sm_121, # Manually adding this
Arch.sm_121a,
Arch.sm_121f,
]Expected behavior
The CuTe DSL should recognize sm_121 (and its variants sm_121a, sm_121f) as valid Blackwell architectures capable of executing tcgen05 and FP4 operations, consistent with NVIDIA's official hardware specifications and marketing for DGX Spark.
Environment details (please complete the following information):
- Environment location: Bare-metal (NVIDIA DGX Spark)
- GPU: NVIDIA GB10 (Blackwell)
- Compute Capability:
sm_121 - CUDA Version: 13.0
- CUTLASS/CuTe Version: v4.3.5 (or latest)
Additional context
This is a critical blocker for FP4 adoption on the DGX Spark platform. NVIDIA's official documentation highlights "1 PFLOPS @ FP4" as a key selling point for this system, but the feature is currently inaccessible via CuTe due to this software-level restriction.
References:
- NVIDIA Blog: How DGX Spark's Performance Enables Intensive AI Tasks
- NVIDIA DGX Spark Datasheet (PDF)
- Related Issues: [BUG] cutlass.cute.nvgpu.common.OpError: OpError: expects arch to be one of ['sm_100a', 'sm_100f'], but got sm_121a #2802, [BUG] [Python DSL] BlockScaledMmaOp restricts FP4 operations to sm_100a only, blocks sm_120/sm_121 #2800, [FEA] [DSL] Request to support sm_121 #2614