Skip to content

Conversation

@tenpercent
Copy link
Contributor

Summary

  • Fix 2-way LDS bank conflicts in CShuffleEpilogue by applying XOR swizzle on N1 (low bit of N)
  • With 2-byte FP16 elements and 4-byte LDS banks, adjacent columns share the same bank causing conflicts
  • XOR swizzle interleaves even/odd columns into different physical rows, eliminating conflicts

Changes

  • include/ck_tile/ops/epilogue/cshuffle_epilogue.hpp: XOR swizzle transform in MakeLdsBlockDescriptor()
  • test/ck_tile/epilogue/test_cshuffle_epilogue.cpp: Added FP16 and 7 FP8 E4M3 test cases
  • test/ck_tile/epilogue/test_cshuffle_epilogue_util.hpp: Fixed buffer initialization

Test coverage

  • FP16 tests with various warp tile configurations
  • FP8 E4M3 tests:
    • 16x16x16 (WMMA): 2x2x1, 1x4x1, 4x1x1 warp layouts
    • 32x32x16, 16x16x32, 32x32x32, 16x16x64 (MFMA)

Profiling results

Bank conflicts/access on MI350 (gfx950): 0.00-0.08 across all tested configurations

Test plan

  • Run test_ck_tile_cshuffle_epilogue to verify correctness
  • Profile with rocprof-compute to verify bank conflict reduction

Problem: CShuffleEpilogue exhibited 2-way LDS bank conflicts during
store operations. With 2-byte FP16 elements and 4-byte LDS banks,
adjacent columns share the same bank, causing conflicts when MFMA warp
distribution has adjacent threads accessing adjacent columns.

Solution: Apply XOR swizzle on N1 (low bit of N) to interleave even/odd
columns into different physical rows:
- M' = M ^ (N & 1)
- Even columns (N=0,2,4...) stay in physical rows M
- Odd columns (N=1,3,5...) go to physical rows M^1

This spreads adjacent columns to different bank regions, eliminating
conflicts for all tile sizes including 16x16x16 WMMA configurations.

Test coverage:
- Added FP16 tests with various warp tile configurations
- Added 7 FP8 E4M3 test cases with various warp tile configurations:
  - 16x16x16 (WMMA): 2x2x1, 1x4x1, 4x1x1 warp layouts
  - 32x32x16, 16x16x32, 32x32x32, 16x16x64 (MFMA)
- Fixed test utility to initialize entire thread buffer for correct
  output regardless of tile distribution
- FP8 tests compare raw byte values (0x40 = 2.0) due to OCP vs FNUZ
  format differences between device and host (to be updated for gfx942)

Profiling results show 0.00-0.08 bank conflicts/access across all
tested configurations on MI350 (gfx950).
@tenpercent tenpercent force-pushed the cshuffle-epilogue-bank-conflict-tests branch from 2090c5e to 6b7e5d3 Compare January 24, 2026 04:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants