Skip to content

Conversation

@Johnsonms
Copy link
Contributor

@Johnsonms Johnsonms commented Jan 10, 2026

Add k_tile_count guard to prevent TMA copies during pipeline drain #2944


Summary

This PR fixes an out-of-bounds memory access bug in wgmma_tma_sm90.cu
that occurred during the pipeline drain phase.

The main loop continued issuing TMA copy operations even after
k_tile_count <= 0, leading to invalid accesses to the tAgA and tBgB
tensors once k_tile advanced beyond the valid tile range.

The fix adds a guard to ensure that new TMA copies are only issued when
valid tiles remain. During the drain phase, the loop now correctly
consumes only pre-fetched data already present in the pipeline.


Problem

  • During the pipeline drain phase, k_tile continues to increment

  • TMA copy operations were still issued when k_tile_count <= 0

  • This resulted in out-of-bounds memory accesses to: tAgA and tBgB


Solution

  • Add a k_tile_count > 0 guard before issuing TMA copy operations

  • During the drain phase (k_tile_count <= 0):

    • No new TMA copies are issued

    • The loop consumes only previously fetched pipeline data


Changes

  • Add k_tile_count > 0 guard before TMA copy (line 240)

  • Add an explanatory comment clarifying drain-phase behavior


Impact

  • Prevents potential memory corruption

  • Ensures correct and safe TMA pipeline usage

  • Makes the tutorial code more robust and semantically correct


Performance Results


Performance | 11,845.5 GFLOP/s | 12,503.9 GFLOP/s
Exec Time | 0.0227 ms | 0.0215 ms

Improvement

  • +658.4 GFLOP/s (~5.6% faster)

  • −0.0012 ms execution time

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant