Skip to content

Split bf16 and fp16 out for CK fp8_rowwise #4419

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

cthi
Copy link
Contributor

@cthi cthi commented Jun 30, 2025

Summary:
While integrating AMD fp8 rowwise in torch, I noticed the library size is a lot larger than FBGEMM 1.2.0. The cause is we recently added support for fp16 output (in addition to bf16) in fp8 rowwise in D74770197, so this would cause the size to double. You can also see this in the nightly wheel - AMD goes from

This diff does 2 changes:

  • Split out the fp16 out into their own files, while sharing common kernel template fp8_rowwise_common.h, this will let us control which kernels to add in, e.g. only add bf16 out for torch.
  • The fp16 output is for a future rec-sys related use-case, so we can probably remove the llama specific tuning/optimized kernels for now. We can add new tuning to the fp16 out use-case for recsys after.

Differential Revision: D77544503

Copy link

netlify bot commented Jun 30, 2025

Deploy Preview for pytorch-fbgemm-docs ready!

Name Link
🔨 Latest commit e7a2ff2
🔍 Latest deploy log https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/6862f82a5fb6fb0008ff6afe
😎 Deploy Preview https://deploy-preview-4419--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77544503

Summary:
Pull Request resolved: pytorch#4419

While integrating AMD fp8 rowwise in torch, I noticed the library size is a lot larger than FBGEMM 1.2.0. The cause is we recently added support for fp16 output (in addition to bf16) in fp8 rowwise in D74770197, so this would cause the size to double. You can also see this in the nightly wheel.

This diff does 2 changes:
- Split out the fp16 out into their own files, while sharing common CK kernel implementation template `fp8_rowwise_common.h`, this will let us control which kernels to add in, e.g. only add bf16 out for torch.
  - For bf16 we will use `fp8fp8bf16` filename prefix
  - For fp16 we will use `fp8fp8fp16` filename prefix
- The fp16 output is for a future rec-sys related use-case, so we can probably remove the llama specific tuning/optimized kernels for now (also reducing the size for that kernel). We can add new tuning to the fp16 out use-case for recsys after.

Differential Revision: D77544503
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D77544503

@cthi cthi force-pushed the export-D77544503 branch from 2716e7c to e7a2ff2 Compare June 30, 2025 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants