[BUG] [Blackwell] Enable FP4/tcgen05 support for sm_121 (DGX Spark) in CuTe DSL

### Which component has the problem?

CuTe DSL

### Bug Report

**Describe the bug**
The CuTe DSL (within CUTLASS) hard-restricts ⁠`tcgen05` (Blackwell Tensor Core) operations to specific architectures such as ⁠`sm_100a` and `⁠sm_103a`. This prevents **NVIDIA DGX Spark** users, whose hardware identifies as `⁠sm_121`, from using FP4 Tensor Core features. Despite the hardware's official support for **1 PFLOPS of FP4 performance**, the software library blocks these operations due to a missing architecture whitelist entry in the DSL.

**Steps/Code to reproduce bug**
Attempt to compile or run any CuTe-based kernel utilizing `tcgen05` operations on an NVIDIA DGX Spark (GB10) system. The following operations are confirmed to be blocked:

1. **`MmaMXF4NVF4Op` (FP4 MMA):** 
   - File: `cutlass/cute/nvgpu/tcgen05/mma.py`
   - Error: `expects arch to be one of [Arch.sm_100a, Arch.sm_103a], but got Arch.sm_121`
2. **`_S2TCopyBase` (SMEM → TMEM Copy):** 
   - File: `cutlass/cute/nvgpu/tcgen05/copy.py`
3. **`Ld32x32bOp` & other TMEM operations:** 
   - File: `cutlass/cute/nvgpu/tcgen05/tmem.py`

Current "Whack-a-mole" workaround requires manual patching of each file:
```python
# cutlass/cute/nvgpu/tcgen05/mma.py (Example patch)
admissible_archs = [
    Arch.sm_100a,
    Arch.sm_103a,
    Arch.sm_121,   # Manually adding this
    Arch.sm_121a,
    Arch.sm_121f,
]
```

**Expected behavior**
The CuTe DSL should recognize `sm_121` (and its variants `sm_121a`, `sm_121f`) as valid Blackwell architectures capable of executing `tcgen05` and FP4 operations, consistent with NVIDIA's official hardware specifications and marketing for DGX Spark.

**Environment details (please complete the following information):**
- **Environment location:** Bare-metal (NVIDIA DGX Spark)
- **GPU:** NVIDIA GB10 (Blackwell)
- **Compute Capability:** `sm_121`
- **CUDA Version:** 13.0
- **CUTLASS/CuTe Version:** v4.3.5 (or latest)

**Additional context**
This is a critical blocker for FP4 adoption on the DGX Spark platform. NVIDIA's official documentation highlights **"1 PFLOPS @ FP4"** as a key selling point for this system, but the feature is currently inaccessible via CuTe due to this software-level restriction.

**References:**
- [NVIDIA Blog: How DGX Spark's Performance Enables Intensive AI Tasks](https://developer.nvidia.com/blog/how-nvidia-dgx-sparks-performance-enables-intensive-ai-tasks/)
- [NVIDIA DGX Spark Datasheet (PDF)](https://www.connection.com/media/5woftgoz/nvidia-dgx-spark-datasheet.pdf)
- **Related Issues:** #2802, #2800, #2614 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] [Blackwell] Enable FP4/tcgen05 support for sm_121 (DGX Spark) in CuTe DSL #2947

Which component has the problem?

Bug Report

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] [Blackwell] Enable FP4/tcgen05 support for sm_121 (DGX Spark) in CuTe DSL #2947

Description

Which component has the problem?

Bug Report

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions