[BUG] [CuTe] Out-of-bounds TMA access in wgmma_tma_sm90.cu during pipeline drain

### Which component has the problem?

CuTe DSL

### Bug Report

## Describe the bug

  Out-of-bounds TMA memory access in the CuTe tutorial example `wgmma_tma_sm90.cu` during pipeline drain phase. The main loop issues new TMA copies even when `k_tile_count <= 0`, potentially reading beyond the valid tile range.

  ## Steps/Code to reproduce bug

  **File:** `cutlass/examples/cute/tutorial/hopper/wgmma_tma_sm90.cu`

  **Lines 239-249:**
  ```cuda
  if ((warp_idx == 0) && lane_predicate)
  {
    int pipe = write_state.index();
    ConsumerBarType::wait(&consumer_mbar[pipe], write_state.phase());
    ProducerBarType::arrive_and_expect_tx(&producer_mbar[pipe], tma_transaction_bytes);
    copy(tma_a.with(producer_mbar[pipe]), tAgA(_,k_tile), tAsA(_,pipe));  // ← Out-of-bounds when k_tile >= size<1>(tAgA)
    copy(tma_b.with(producer_mbar[pipe]), tBgB(_,k_tile), tBsB(_,pipe));  // ← Out-of-bounds when k_tile >= size<1>(tBgB)
    ++write_state;
  }
  --k_tile_count;
  ++k_tile;

  Issue Flow:
  1. Line 145: k_tile_count = size<1>(tAgA) (e.g., 10 tiles)
  2. Lines 168-180: Prefetch loop decrements k_tile_count by K_PIPE_MAX (e.g., k_tile_count becomes 10 - 3 = 7)
  3. Line 221: Main loop continues while k_tile_count > -K_PIPE_MAX (continues for drain phase)
  4. Bug: When k_tile_count <= 0, the code is draining the pipeline but still issues new TMA copies at lines 246-247
  5. Line 251: ++k_tile continues incrementing, causing k_tile >= size<1>(tAgA)
  6. Result: tAgA(_,k_tile) and tBgB(_,k_tile) access out-of-bounds memory

  To reproduce:
  cd /path/to/cutlass/build
  cmake .. -DCUTLASS_NVCC_ARCHS_ENABLED="90a" -DCUTLASS_ENABLE_EXAMPLES=ON
  make cute_tutorial_wgmma_tma_sm90
  ./examples/cute/tutorial/hopper/cute_tutorial_wgmma_tma_sm90

  The program may appear to run successfully due to lucky memory access, but it's reading invalid data during the last iterations.

  Expected behavior

  The TMA copy operations should only be issued when there are valid tiles remaining to fetch. During the pipeline drain phase (k_tile_count <= 0), no new copies should be initiated.

  Proposed fix

  Add a guard condition to check k_tile_count > 0 before issuing TMA copies:

  // Only issue new TMA copies if there are more tiles to fetch
  if ((warp_idx == 0) && lane_predicate && (k_tile_count > 0))
  {
    int pipe = write_state.index();
    ConsumerBarType::wait(&consumer_mbar[pipe], write_state.phase());
    ProducerBarType::arrive_and_expect_tx(&producer_mbar[pipe], tma_transaction_bytes);
    copy(tma_a.with(producer_mbar[pipe]), tAgA(_,k_tile), tAsA(_,pipe));
    copy(tma_b.with(producer_mbar[pipe]), tBgB(_,k_tile), tBsB(_,pipe));
    ++write_state;
  }

  Environment details

  - Environment location: Bare-metal
  - GPU: NVIDIA H200 (SM90a)
  - CUDA Version: 12.9.86
  - CUTLASS Version: 4.3.5
  - OS: Linux (Ubuntu)
  - Compiler: GCC 13.3.0

  Additional context


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] [CuTe] Out-of-bounds TMA access in wgmma_tma_sm90.cu during pipeline drain #2944

Which component has the problem?

Bug Report

Describe the bug

Steps/Code to reproduce bug

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] [CuTe] Out-of-bounds TMA access in wgmma_tma_sm90.cu during pipeline drain #2944

Description

Which component has the problem?

Bug Report

Describe the bug

Steps/Code to reproduce bug

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions