Skip to content

[QST]Question about memory layout compatibility between mma.sync.m16n8k64.mxf4nvf4 and tcgen05 128×128×64 instructions #2951

@KarlDe1

Description

@KarlDe1

I have a question regarding memory layout compatibility between the following PTX instruction:

mma.sync.aligned.m16n8k64.row.col
.kind::mxf4nvf4.block_scale.scale_vec::4X
.f32.e2m1.e2m1.f32.ue4m3

and the newer tcgen05 128×128×64 Tensor Core instructions.

Main Question

On current GPU architectures(SM101), only tcgen05 Tensor Core instructions are supported, while the above mma.sync(SM120+) instruction is no longer available.

Therefore, my goal is to replace the existing mma.sync.m16n8k64 instruction sequence with tcgen05 instructions (for example, a single tcgen05 128×128×64 operation, or another supported mnk shape).

The key constraint is that I would like to understand whether this replacement can be done without changing the existing memory layout of A, B, and their block scales.

Is it possible to directly replace mma.sync.aligned.m16n8k64.mxf4nvf4.block_scale.scale_vec::4X with a tcgen05 instruction (e.g. 128×128×64 or another shape), while reusing the same memory layout for:

  • A operand fragments
  • B operand fragments
  • Block-scale vectors and their indexing

In other words:

  • Does tcgen05 assume a compatible fragment ordering, row/col layout, and scale grouping as the m16n8k64 MMA instruction?

  • Or does tcgen05 require a different packing / swizzling / scale layout, making a full repacking of A/B matrices and scale vectors unavoidable?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions