I have a question regarding memory layout compatibility between the following PTX instruction:
mma.sync.aligned.m16n8k64.row.col
.kind::mxf4nvf4.block_scale.scale_vec::4X
.f32.e2m1.e2m1.f32.ue4m3
and the newer tcgen05 128×128×64 Tensor Core instructions.
Main Question
On current GPU architectures(SM101), only tcgen05 Tensor Core instructions are supported, while the above mma.sync(SM120+) instruction is no longer available.
Therefore, my goal is to replace the existing mma.sync.m16n8k64 instruction sequence with tcgen05 instructions (for example, a single tcgen05 128×128×64 operation, or another supported mnk shape).
The key constraint is that I would like to understand whether this replacement can be done without changing the existing memory layout of A, B, and their block scales.
Is it possible to directly replace mma.sync.aligned.m16n8k64.mxf4nvf4.block_scale.scale_vec::4X with a tcgen05 instruction (e.g. 128×128×64 or another shape), while reusing the same memory layout for:
- A operand fragments
- B operand fragments
- Block-scale vectors and their indexing
In other words:
-
Does tcgen05 assume a compatible fragment ordering, row/col layout, and scale grouping as the m16n8k64 MMA instruction?
-
Or does tcgen05 require a different packing / swizzling / scale layout, making a full repacking of A/B matrices and scale vectors unavoidable?