Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] fp8 gemm #1986

Open
yangjianfengo1 opened this issue Dec 15, 2024 · 0 comments
Open

[QST] fp8 gemm #1986

yangjianfengo1 opened this issue Dec 15, 2024 · 0 comments

Comments

@yangjianfengo1
Copy link

When using the mma.sync.aligned.m16n8k32.row.col.f32.e4m3.e4m3.f32 instruction on an Hopper for FP8 GEMM, I found the performance to be slow. Upon analyzing with NCU, I observed that this instruction first converts the data to FP16 before performing the FP16 GEMM.
Image
Is this an issue with my usage, or is there a way to avoid this conversion? Additionally, I noticed that the wgmma.mma_async.sync.aligned.m64n8k32.f32.e4m3.e4m3 instruction does not undergo FP16 conversion, but it only supports reading matrix B from shared memory. My requirement is to perform FP8 GEMM with both matrices A and B being read from registers without compromising performance. Are there any other methods to achieve this? Thank you very much for your response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant