Skip to content

Conversation

@big-yellow-duck
Copy link

@big-yellow-duck big-yellow-duck commented Jan 13, 2026

Motivation

This adds preliminary support for gfx1201 to use gemm_a8w8_blockscale from triton which is used in Qwen/Qwen3-0.6B-FP8

Moving forward, more triton kernels can be tuned to optimize the performance of gfx1201.

Technical Details

  • Added a base tuning script that is adaptable to other operations.
  • Added a tuning script to tune the triton kernel parameters for gemm_a8w8_blockscale.
  • the tuning script benchmarks different kernel parameter such as num_warps and waves_per_eu to find the optimal execution time for a set of operations.

Test Plan

test the tuned configs using aiter/op_tests/triton_tests/gemm/basic/test_gemm_a8w8_blockscale.py

pytest op_tests/triton_tests/gemm/basic/test_gemm_a8w8_blockscale.py

Test Result

126 tests have passed
2 skipped, (where N or K don't meet preshuffle kernel constraints: N must be multiple of 16, K must be multiple of 32)

Submission Checklist

@big-yellow-duck big-yellow-duck changed the title Support gfx1201 min Support gfx1201 for triton gemm_a8w8_blockscale Jan 16, 2026
@big-yellow-duck big-yellow-duck marked this pull request as ready for review January 23, 2026 02:50
@big-yellow-duck big-yellow-duck requested a review from a team January 23, 2026 02:50
@azaidy azaidy changed the title Support gfx1201 for triton gemm_a8w8_blockscale [TRITON] Support gfx1201 for triton gemm_a8w8_blockscale Jan 23, 2026
@azaidy azaidy requested review from azaidy and vgokhale January 23, 2026 03:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants