Split bf16 and fp16 out for CK fp8_rowwise #4419

cthi · 2025-06-30T18:07:59Z

Summary:
While integrating AMD fp8 rowwise in torch, I noticed the library size is a lot larger than FBGEMM 1.2.0. The cause is we recently added support for fp16 output (in addition to bf16) in fp8 rowwise in D74770197, so this would cause the size to double. You can also see this in the nightly wheel - AMD goes from

This diff does 2 changes:

Split out the fp16 out into their own files, while sharing common kernel template fp8_rowwise_common.h, this will let us control which kernels to add in, e.g. only add bf16 out for torch.
The fp16 output is for a future rec-sys related use-case, so we can probably remove the llama specific tuning/optimized kernels for now. We can add new tuning to the fp16 out use-case for recsys after.

Differential Revision: D77544503

netlify · 2025-06-30T18:08:04Z

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`e7a2ff2`
🔍 Latest deploy log	https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/6862f82a5fb6fb0008ff6afe
😎 Deploy Preview	https://deploy-preview-4419--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

facebook-github-bot · 2025-06-30T18:08:30Z

This pull request was exported from Phabricator. Differential Revision: D77544503

Summary: Pull Request resolved: pytorch#4419 While integrating AMD fp8 rowwise in torch, I noticed the library size is a lot larger than FBGEMM 1.2.0. The cause is we recently added support for fp16 output (in addition to bf16) in fp8 rowwise in D74770197, so this would cause the size to double. You can also see this in the nightly wheel. This diff does 2 changes: - Split out the fp16 out into their own files, while sharing common CK kernel implementation template `fp8_rowwise_common.h`, this will let us control which kernels to add in, e.g. only add bf16 out for torch. - For bf16 we will use `fp8fp8bf16` filename prefix - For fp16 we will use `fp8fp8fp16` filename prefix - The fp16 output is for a future rec-sys related use-case, so we can probably remove the llama specific tuning/optimized kernels for now (also reducing the size for that kernel). We can add new tuning to the fp16 out use-case for recsys after. Differential Revision: D77544503

facebook-github-bot · 2025-06-30T20:48:39Z

This pull request was exported from Phabricator. Differential Revision: D77544503

facebook-github-bot added the cla signed label Jun 30, 2025

facebook-github-bot added the fb-exported label Jun 30, 2025

cthi force-pushed the export-D77544503 branch from 2716e7c to e7a2ff2 Compare June 30, 2025 20:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Split bf16 and fp16 out for CK fp8_rowwise #4419

Split bf16 and fp16 out for CK fp8_rowwise #4419

Uh oh!

cthi commented Jun 30, 2025

Uh oh!

netlify bot commented Jun 30, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Jun 30, 2025

Uh oh!

facebook-github-bot commented Jun 30, 2025

Uh oh!

Uh oh!

Split bf16 and fp16 out for CK fp8_rowwise #4419

Are you sure you want to change the base?

Split bf16 and fp16 out for CK fp8_rowwise #4419

Uh oh!

Conversation

cthi commented Jun 30, 2025

Uh oh!

netlify bot commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Uh oh!

facebook-github-bot commented Jun 30, 2025

Uh oh!

facebook-github-bot commented Jun 30, 2025

Uh oh!

Uh oh!

netlify bot commented Jun 30, 2025 •

edited

Loading