Support skip scaling for input tensor for Triton rowwise FP8 kernel #4362

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

jianj01 wants to merge 1 commit into pytorch:main from jianj01:export-D76759999

jianj01 commented Jun 17, 2025

Summary:
X-link: https://github.com/facebookresearch/FBGEMM/pull/1431

The scaling process (generating inputs to FP8 GEMM) adds non-trivial
cost for FP8 quantization, and can offset the gain of FP8 GEMM, especially
in memory bound case.

By re-designing the activation layer of the model, most of the elements
of input (more specifically, activation, corresponding to input a in this
revision) tensor to FP8 GEMM could be within FP8 range.

It is true that, even the output is within the FP8 range, the FP8 scaling is still
helpful in accuracy, since the scaling process tries to use full FP8 bits to
encode the information as much as possible. So evaluation on E2E model
quality is needed per model basis.

A study has shown, by replacing the FP8 scaling/process with FP8 clamp
on an Ads 500x inference model, the E2E throughput can further improve by 7%,
E2E QPS increasing from 21% (FP8) -> 28% (FP8 with skip scaling).

We supported this case for Triton row-wise kernel in this revision.

Reviewed By: y-x-c

Differential Revision: D76759999

netlify bot commented Jun 17, 2025 •

edited

Loading

✅ Deploy Preview for pytorch-fbgemm-docs ready!

Name	Link
🔨 Latest commit	`733c0b5`
🔍 Latest deploy log	https://app.netlify.com/projects/pytorch-fbgemm-docs/deploys/685335510df69f0008d2beb8
😎 Deploy Preview	https://deploy-preview-4362--pytorch-fbgemm-docs.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

facebook-github-bot added the cla signed label

Contributor

facebook-github-bot commented Jun 17, 2025

This pull request was exported from Phabricator. Differential Revision: D76759999

facebook-github-bot added the fb-exported label

Contributor

facebook-github-bot commented Jun 18, 2025

This pull request was exported from Phabricator. Differential Revision: D76759999

jianj01 force-pushed the export-D76759999 branch from 1d9715b to 3efacfc Compare

June 18, 2025 21:30

jianj01 added a commit to jianj01/FBGEMM-1 that referenced this pull request


          Support skip scaling for input tensor for Triton rowwise FP8 kernel (p…

3efacfc

…ytorch#4362)

Summary:
Pull Request resolved: pytorch#4362

X-link: facebookresearch/FBGEMM#1431

The scaling process (generating inputs to FP8 GEMM) adds  non-trivial
cost for FP8 quantization, and can offset the gain of FP8 GEMM, especially
in memory bound case.

By [re-designing the activation layer](https://fb.workplace.com/groups/1033540429995021/permalink/24472616132327454) of the model, most of the elements
of input (more specifically, activation, corresponding to input `a` in this
revision) tensor to FP8 GEMM could be within FP8 range.

It is true that, even the output is within the FP8 range, the FP8 scaling is still
helpful in accuracy, since the scaling process tries to use full FP8 bits to
encode the information as much as possible. So evaluation on E2E model
quality is needed per model basis.

[A study](https://docs.google.com/document/d/1jEOVqOIn3cKe3PgFcHsxW9EKYi_-6QHnIuiPY8gLEwQ/edit?tab=t.0) has shown, by replacing the FP8 scaling/process with FP8 clamp
on an Ads 500x inference model, the E2E throughput can further improve by **7%**,
E2E QPS increasing from 21% (FP8) -> 28% (FP8 with skip scaling).

We supported this case for Triton row-wise kernel in this revision.

Reviewed By: jwfromm

Differential Revision: D76759999

Contributor

facebook-github-bot commented Jun 18, 2025

This pull request was exported from Phabricator. Differential Revision: D76759999

jianj01 added a commit to jianj01/FBGEMM-1 that referenced this pull request


          Support skip scaling for input tensor for Triton rowwise FP8 kernel (p…

6bf05c9

…ytorch#4362)

Summary:
Pull Request resolved: pytorch#4362

X-link: facebookresearch/FBGEMM#1431

The scaling process (generating inputs to FP8 GEMM) adds  non-trivial
cost for FP8 quantization, and can offset the gain of FP8 GEMM, especially
in memory bound case.

By [re-designing the activation layer](https://fb.workplace.com/groups/1033540429995021/permalink/24472616132327454) of the model, most of the elements
of input (more specifically, activation, corresponding to input `a` in this
revision) tensor to FP8 GEMM could be within FP8 range.

It is true that, even the output is within the FP8 range, the FP8 scaling is still
helpful in accuracy, since the scaling process tries to use full FP8 bits to
encode the information as much as possible. So evaluation on E2E model
quality is needed per model basis.

[A study](https://docs.google.com/document/d/1jEOVqOIn3cKe3PgFcHsxW9EKYi_-6QHnIuiPY8gLEwQ/edit?tab=t.0) has shown, by replacing the FP8 scaling/process with FP8 clamp
on an Ads 500x inference model, the E2E throughput can further improve by **7%**,
E2E QPS increasing from 21% (FP8) -> 28% (FP8 with skip scaling).

We supported this case for Triton row-wise kernel in this revision.

Reviewed By: jwfromm

Differential Revision: D76759999

jianj01 force-pushed the export-D76759999 branch from 3efacfc to 6bf05c9 Compare

June 18, 2025 21:37

Contributor

facebook-github-bot commented Jun 18, 2025

This pull request was exported from Phabricator. Differential Revision: D76759999

jianj01 added a commit to jianj01/FBGEMM-1 that referenced this pull request


          Support skip scaling for input tensor for Triton rowwise FP8 kernel (p…

3f6f8c0

…ytorch#4362)

Summary:
Pull Request resolved: pytorch#4362

X-link: facebookresearch/FBGEMM#1431

The scaling process (generating inputs to FP8 GEMM) adds  non-trivial
cost for FP8 quantization, and can offset the gain of FP8 GEMM, especially
in memory bound case.

By [re-designing the activation layer](https://fb.workplace.com/groups/1033540429995021/permalink/24472616132327454) of the model, most of the elements
of input (more specifically, activation, corresponding to input `a` in this
revision) tensor to FP8 GEMM could be within FP8 range.

It is true that, even the output is within the FP8 range, the FP8 scaling is still
helpful in accuracy, since the scaling process tries to use full FP8 bits to
encode the information as much as possible. So evaluation on E2E model
quality is needed per model basis.

[A study](https://docs.google.com/document/d/1jEOVqOIn3cKe3PgFcHsxW9EKYi_-6QHnIuiPY8gLEwQ/edit?tab=t.0) has shown, by replacing the FP8 scaling/process with FP8 clamp
on an Ads 500x inference model, the E2E throughput can further improve by **7%**,
E2E QPS increasing from 21% (FP8) -> 28% (FP8 with skip scaling).

We supported this case for Triton row-wise kernel in this revision.

Reviewed By: jwfromm

Differential Revision: D76759999

jianj01 force-pushed the export-D76759999 branch from 6bf05c9 to 3f6f8c0 Compare

June 18, 2025 21:42


          Support skip scaling for input tensor for Triton rowwise FP8 kernel (p…

733c0b5

…ytorch#4362)

Summary:
Pull Request resolved: pytorch#4362

X-link: facebookresearch/FBGEMM#1431

The scaling process (generating inputs to FP8 GEMM) adds  non-trivial
cost for FP8 quantization, and can offset the gain of FP8 GEMM, especially
in memory bound case.

By [re-designing the activation layer](https://fb.workplace.com/groups/1033540429995021/permalink/24472616132327454) of the model, most of the elements
of input (more specifically, activation, corresponding to input `a` in this
revision) tensor to FP8 GEMM could be within FP8 range.

It is true that, even the output is within the FP8 range, the FP8 scaling is still
helpful in accuracy, since the scaling process tries to use full FP8 bits to
encode the information as much as possible. So evaluation on E2E model
quality is needed per model basis.

[A study](https://docs.google.com/document/d/1jEOVqOIn3cKe3PgFcHsxW9EKYi_-6QHnIuiPY8gLEwQ/edit?tab=t.0) has shown, by replacing the FP8 scaling/process with FP8 clamp
on an Ads 500x inference model, the E2E throughput can further improve by **7%**,
E2E QPS increasing from 21% (FP8) -> 28% (FP8 with skip scaling).

We supported this case for Triton row-wise kernel in this revision.

Reviewed By: jwfromm

Differential Revision: D76759999

Contributor

facebook-github-bot commented Jun 18, 2025

This pull request was exported from Phabricator. Differential Revision: D76759999

jianj01 force-pushed the export-D76759999 branch from 3f6f8c0 to 733c0b5 Compare

June 18, 2025 21:53

facebook-github-bot closed this in

6152f34

facebook-github-bot added the Merged label

Contributor

facebook-github-bot commented Jun 19, 2025

This pull request has been merged in 6152f34.

gchalump added feature:fp8 feature:triton category:new feature:gemm labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category:new cla signed fb-exported feature:fp8 feature:gemm feature:triton Merged