You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using the mma.sync.aligned.m16n8k32.row.col.f32.e4m3.e4m3.f32 instruction on an Hopper for FP8 GEMM, I found the performance to be slow. Upon analyzing with NCU, I observed that this instruction first converts the data to FP16 before performing the FP16 GEMM.
Is this an issue with my usage, or is there a way to avoid this conversion? Additionally, I noticed that the wgmma.mma_async.sync.aligned.m64n8k32.f32.e4m3.e4m3 instruction does not undergo FP16 conversion, but it only supports reading matrix B from shared memory. My requirement is to perform FP8 GEMM with both matrices A and B being read from registers without compromising performance. Are there any other methods to achieve this? Thank you very much for your response.
The text was updated successfully, but these errors were encountered:
When using the mma.sync.aligned.m16n8k32.row.col.f32.e4m3.e4m3.f32 instruction on an Hopper for FP8 GEMM, I found the performance to be slow. Upon analyzing with NCU, I observed that this instruction first converts the data to FP16 before performing the FP16 GEMM.
Is this an issue with my usage, or is there a way to avoid this conversion? Additionally, I noticed that the wgmma.mma_async.sync.aligned.m64n8k32.f32.e4m3.e4m3 instruction does not undergo FP16 conversion, but it only supports reading matrix B from shared memory. My requirement is to perform FP8 GEMM with both matrices A and B being read from registers without compromising performance. Are there any other methods to achieve this? Thank you very much for your response.
The text was updated successfully, but these errors were encountered: