[sgl-kernel] Opt per_token_quant_fp8 with warp reduce #8130

yuan-luo · 2025-07-18T02:51:05Z

Motivation

Optimize per_token_quant_fp8 kernel with warp reduce and cute tensor. Obtained 5% speedup in large seq len.

Main:
$python ./sgl-kernel/benchmark/bench_per_token_quant_fp8.py
INFO 07-18 13:09:53 [__init__.py:244] Automatically detected platform cuda.
✅ All implementations match
per-token-dynamic-quant-fp8-performance:
    batch_size  seq_len         VLLM   SGL Kernel
0         16.0     64.0    27.904000    26.303999
1         16.0    128.0    45.968000    42.112000
2         16.0    256.0    84.352002    75.903997
3         16.0    512.0   156.512007   136.511996
4         16.0   1024.0   299.791992   257.375985
5         16.0   2048.0   588.207990   499.392003
6         16.0   4096.0  1165.744007   985.823989
7         32.0     64.0    46.080001    42.080000
8         32.0    128.0    84.320001    74.368000
9         32.0    256.0   156.415999   136.480004
10        32.0    512.0   300.191998   257.759988
11        32.0   1024.0   587.856025   497.840002
12        32.0   2048.0  1164.160013   985.599995
13        32.0   4096.0  2314.687967  1953.472018
14        64.0     64.0    84.448002    75.007997
15        64.0    128.0   156.639993   134.976000
16        64.0    256.0   300.303996   257.647991
17        64.0    512.0   587.423980   497.983992
18        64.0   1024.0  1164.224029   985.648006
19        64.0   2048.0  2315.167904  1953.439951
20        64.0   4096.0  4616.159916  3888.000011
21       128.0     64.0   156.384006   136.447996
22       128.0    128.0   300.096005   257.887989
23       128.0    256.0   587.375998   497.855991
24       128.0    512.0  1164.127946   985.567987
25       128.0   1024.0  2315.007925  1953.536034
26       128.0   2048.0  4611.519814  3887.968063
27       128.0   4096.0  9215.872288  7764.287949

This PR:
$python ./sgl-kernel/benchmark/bench_per_token_quant_fp8.py
INFO 07-18 13:13:04 [__init__.py:244] Automatically detected platform cuda.
✅ All implementations match
per-token-dynamic-quant-fp8-performance:
    batch_size  seq_len         VLLM   SGL Kernel
0         16.0     64.0    27.872000    29.056000
1         16.0    128.0    46.112001    46.271998
2         16.0    256.0    84.288001    76.448001
3         16.0    512.0   156.287998   142.912000
4         16.0   1024.0   299.968004   261.312008
5         16.0   2048.0   588.127971   494.383991
6         16.0   4096.0  1165.727973   952.736020
7         32.0     64.0    46.048000    46.208002
8         32.0    128.0    84.224001    76.191999
9         32.0    256.0   156.544000   143.935993
10        32.0    512.0   299.968004   261.664003
11        32.0   1024.0   587.344021   493.600011
12        32.0   2048.0  1165.055990   952.863991
13        32.0   4096.0  2314.304113  1858.175993
14        64.0     64.0    84.256001    77.760004
15        64.0    128.0   156.287998   142.848000
16        64.0    256.0   300.384015   261.440009
17        64.0    512.0   587.711990   492.832005
18        64.0   1024.0  1165.312052   952.319980
19        64.0   2048.0  2314.975977  1858.399987
20        64.0   4096.0  4617.663860  3667.775989
21       128.0     64.0   156.351998   143.360004
22       128.0    128.0   300.319999   261.536002
23       128.0    256.0   587.423980   493.024006
24       128.0    512.0  1164.064050   952.319980
25       128.0   1024.0  2314.975977  1858.047962
26       128.0   2048.0  4616.447926  3667.215943
27       128.0   4096.0  9214.528084  7298.208237

Modifications

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling and Accuracy Results.
For reviewers: If you haven't made any contributions to this PR and are only assisting with merging the main branch, please remove yourself as a co-author when merging the PR.
Please feel free to join our Slack channel at https://slack.sglang.ai to discuss your PR.

gemini-code-assist

Summary of Changes

Hello @yuan-luo, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on a significant performance optimization for the per_token_quant_fp8 CUDA kernel, which is critical for efficient FP8 quantization in the SGLang framework. By refactoring the kernel to utilize advanced CUDA features like warp-level parallelism and the cute tensor library, it achieves notable speed improvements across various batch and sequence lengths.

Highlights

Performance Optimization: I've re-implemented the per_token_quant_fp8_kernel to leverage warp-level reductions and cute tensors for improved efficiency, resulting in up to 10% speedup as demonstrated by the provided benchmarks.
Kernel Architecture Refinement: The kernel now transitions from block-level reduction (cub::block::block_reduce) to warp-level reduction (warpReduceMax) for calculating the maximum value per token, enabling more granular and efficient parallelization.
CUDA cute Tensor Integration: I've adopted the cute tensor library for managing global memory access within the kernel, providing a more expressive and potentially optimized way to handle tensor operations.
Optimized Kernel Launch Configuration: The kernel launch parameters have been adjusted to align with the new warp-centric design, processing multiple tokens per CTA (specifically, 8 tokens per 256-thread CTA).

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the per_token_quant_fp8 CUDA kernel to improve performance by processing multiple tokens per thread block, with each token handled by a single warp. This is a solid optimization strategy. The review identified a critical race condition in the use of shared memory for the scaling factor, which would lead to incorrect quantization results. A medium-severity issue with an unnecessary const_cast was also found.

sgl-kernel/csrc/gemm/per_token_quant_fp8.cu

luoyuan.luo added 6 commits July 18, 2025 10:16

Refactor per_token_quant_fp8

3bd115f

Tiled processing two passes

2154768

Introduce cute TiledCopy for small & large kernel

fc6e6e7

Fix precision issues

cae33b8

Fix precision issues

c310190

Fix precision issues

33c1013

yuan-luo requested review from zhyncs, ispobock, HandH1998, BBuf, yizhang2077, merrymercy, yinfan98 and HaiShaw as code owners July 18, 2025 02:51

gemini-code-assist bot reviewed Jul 18, 2025

View reviewed changes

sgl-kernel/csrc/gemm/per_token_quant_fp8.cu Show resolved Hide resolved

sgl-kernel/csrc/gemm/per_token_quant_fp8.cu Outdated Show resolved Hide resolved

Introduce warp reduce in per_token_quant_fp8

8974703

yuan-luo force-pushed the per_token_quant_fp8 branch from 7d3e37f to 8974703 Compare July 18, 2025 06:44

ispobock assigned BBuf and HandH1998 Jul 18, 2025

yuan-luo added 3 commits July 19, 2025 09:47

Merge branch 'main' into per_token_quant_fp8

bdf3a3a

Merge branch 'main' into per_token_quant_fp8

2431b02

Merge branch 'main' into per_token_quant_fp8

6f4739c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[sgl-kernel] Opt per_token_quant_fp8 with warp reduce #8130

[sgl-kernel] Opt per_token_quant_fp8 with warp reduce #8130

yuan-luo commented Jul 18, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[sgl-kernel] Opt per_token_quant_fp8 with warp reduce #8130

Are you sure you want to change the base?

[sgl-kernel] Opt per_token_quant_fp8 with warp reduce #8130

Conversation

yuan-luo commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yuan-luo commented Jul 18, 2025 •

edited

Loading