Fix out-of-bounds TMA access in wgmma_tma_sm90 tutorial #2945
+2
−1
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Add
k_tile_countguard to prevent TMA copies during pipeline drain #2944Summary
This PR fixes an out-of-bounds memory access bug in
wgmma_tma_sm90.cuthat occurred during the pipeline drain phase.
The main loop continued issuing TMA copy operations even after
k_tile_count <= 0, leading to invalid accesses to thetAgAandtBgBtensors once
k_tileadvanced beyond the valid tile range.The fix adds a guard to ensure that new TMA copies are only issued when
valid tiles remain. During the drain phase, the loop now correctly
consumes only pre-fetched data already present in the pipeline.
Problem
During the pipeline drain phase,
k_tilecontinues to incrementTMA copy operations were still issued when
k_tile_count <= 0This resulted in out-of-bounds memory accesses to: tAgA and tBgB
Solution
Add a
k_tile_count > 0guard before issuing TMA copy operationsDuring the drain phase (
k_tile_count <= 0):No new TMA copies are issued
The loop consumes only previously fetched pipeline data
Changes
Add
k_tile_count > 0guard before TMA copy (line 240)Add an explanatory comment clarifying drain-phase behavior
Impact
Prevents potential memory corruption
Ensures correct and safe TMA pipeline usage
Makes the tutorial code more robust and semantically correct
Performance Results
Performance | 11,845.5 GFLOP/s | 12,503.9 GFLOP/s
Exec Time | 0.0227 ms | 0.0215 ms
Improvement
+658.4 GFLOP/s (~5.6% faster)
−0.0012 ms execution time