Open
Description
When using the mma.sync.aligned.m16n8k32.row.col.f32.e4m3.e4m3.f32 instruction on an Hopper for FP8 GEMM, I found the performance to be slow. Upon analyzing with NCU, I observed that this instruction first converts the data to FP16 before performing the FP16 GEMM.
Is this an issue with my usage, or is there a way to avoid this conversion? Additionally, I noticed that the wgmma.mma_async.sync.aligned.m64n8k32.f32.e4m3.e4m3 instruction does not undergo FP16 conversion, but it only supports reading matrix B from shared memory. My requirement is to perform FP8 GEMM with both matrices A and B being read from registers without compromising performance. Are there any other methods to achieve this? Thank you very much for your response.