-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Open
Labels
Description
Which component has the problem?
CuTe DSL
Bug Report
Describe the bug
Out-of-bounds TMA memory access in the CuTe tutorial example wgmma_tma_sm90.cu during pipeline drain phase. The main loop issues new TMA copies even when k_tile_count <= 0, potentially reading beyond the valid tile range.
Steps/Code to reproduce bug
File: cutlass/examples/cute/tutorial/hopper/wgmma_tma_sm90.cu
Lines 239-249:
if ((warp_idx == 0) && lane_predicate)
{
int pipe = write_state.index();
ConsumerBarType::wait(&consumer_mbar[pipe], write_state.phase());
ProducerBarType::arrive_and_expect_tx(&producer_mbar[pipe], tma_transaction_bytes);
copy(tma_a.with(producer_mbar[pipe]), tAgA(_,k_tile), tAsA(_,pipe)); // ← Out-of-bounds when k_tile >= size<1>(tAgA)
copy(tma_b.with(producer_mbar[pipe]), tBgB(_,k_tile), tBsB(_,pipe)); // ← Out-of-bounds when k_tile >= size<1>(tBgB)
++write_state;
}
--k_tile_count;
++k_tile;
Issue Flow:
1. Line 145: k_tile_count = size<1>(tAgA) (e.g., 10 tiles)
2. Lines 168-180: Prefetch loop decrements k_tile_count by K_PIPE_MAX (e.g., k_tile_count becomes 10 - 3 = 7)
3. Line 221: Main loop continues while k_tile_count > -K_PIPE_MAX (continues for drain phase)
4. Bug: When k_tile_count <= 0, the code is draining the pipeline but still issues new TMA copies at lines 246-247
5. Line 251: ++k_tile continues incrementing, causing k_tile >= size<1>(tAgA)
6. Result: tAgA(_,k_tile) and tBgB(_,k_tile) access out-of-bounds memory
To reproduce:
cd /path/to/cutlass/build
cmake .. -DCUTLASS_NVCC_ARCHS_ENABLED="90a" -DCUTLASS_ENABLE_EXAMPLES=ON
make cute_tutorial_wgmma_tma_sm90
./examples/cute/tutorial/hopper/cute_tutorial_wgmma_tma_sm90
The program may appear to run successfully due to lucky memory access, but it's reading invalid data during the last iterations.
Expected behavior
The TMA copy operations should only be issued when there are valid tiles remaining to fetch. During the pipeline drain phase (k_tile_count <= 0), no new copies should be initiated.
Proposed fix
Add a guard condition to check k_tile_count > 0 before issuing TMA copies:
// Only issue new TMA copies if there are more tiles to fetch
if ((warp_idx == 0) && lane_predicate && (k_tile_count > 0))
{
int pipe = write_state.index();
ConsumerBarType::wait(&consumer_mbar[pipe], write_state.phase());
ProducerBarType::arrive_and_expect_tx(&producer_mbar[pipe], tma_transaction_bytes);
copy(tma_a.with(producer_mbar[pipe]), tAgA(_,k_tile), tAsA(_,pipe));
copy(tma_b.with(producer_mbar[pipe]), tBgB(_,k_tile), tBsB(_,pipe));
++write_state;
}
Environment details
- Environment location: Bare-metal
- GPU: NVIDIA H200 (SM90a)
- CUDA Version: 12.9.86
- CUTLASS Version: 4.3.5
- OS: Linux (Ubuntu)
- Compiler: GCC 13.3.0
Additional context