Skip to content

[sgl-kernel] Opt per_token_quant_fp8 with warp reduce #8130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

yuan-luo
Copy link
Contributor

@yuan-luo yuan-luo commented Jul 18, 2025

Motivation

Optimize per_token_quant_fp8 kernel with warp reduce and cute tensor. Obtained 5% speedup in large seq len.

Main:
$python ./sgl-kernel/benchmark/bench_per_token_quant_fp8.py
INFO 07-18 13:09:53 [__init__.py:244] Automatically detected platform cuda.
✅ All implementations match
per-token-dynamic-quant-fp8-performance:
    batch_size  seq_len         VLLM   SGL Kernel
0         16.0     64.0    27.904000    26.303999
1         16.0    128.0    45.968000    42.112000
2         16.0    256.0    84.352002    75.903997
3         16.0    512.0   156.512007   136.511996
4         16.0   1024.0   299.791992   257.375985
5         16.0   2048.0   588.207990   499.392003
6         16.0   4096.0  1165.744007   985.823989
7         32.0     64.0    46.080001    42.080000
8         32.0    128.0    84.320001    74.368000
9         32.0    256.0   156.415999   136.480004
10        32.0    512.0   300.191998   257.759988
11        32.0   1024.0   587.856025   497.840002
12        32.0   2048.0  1164.160013   985.599995
13        32.0   4096.0  2314.687967  1953.472018
14        64.0     64.0    84.448002    75.007997
15        64.0    128.0   156.639993   134.976000
16        64.0    256.0   300.303996   257.647991
17        64.0    512.0   587.423980   497.983992
18        64.0   1024.0  1164.224029   985.648006
19        64.0   2048.0  2315.167904  1953.439951
20        64.0   4096.0  4616.159916  3888.000011
21       128.0     64.0   156.384006   136.447996
22       128.0    128.0   300.096005   257.887989
23       128.0    256.0   587.375998   497.855991
24       128.0    512.0  1164.127946   985.567987
25       128.0   1024.0  2315.007925  1953.536034
26       128.0   2048.0  4611.519814  3887.968063
27       128.0   4096.0  9215.872288  7764.287949

This PR:
$python ./sgl-kernel/benchmark/bench_per_token_quant_fp8.py
INFO 07-18 13:13:04 [__init__.py:244] Automatically detected platform cuda.
✅ All implementations match
per-token-dynamic-quant-fp8-performance:
    batch_size  seq_len         VLLM   SGL Kernel
0         16.0     64.0    27.872000    29.056000
1         16.0    128.0    46.112001    46.271998
2         16.0    256.0    84.288001    76.448001
3         16.0    512.0   156.287998   142.912000
4         16.0   1024.0   299.968004   261.312008
5         16.0   2048.0   588.127971   494.383991
6         16.0   4096.0  1165.727973   952.736020
7         32.0     64.0    46.048000    46.208002
8         32.0    128.0    84.224001    76.191999
9         32.0    256.0   156.544000   143.935993
10        32.0    512.0   299.968004   261.664003
11        32.0   1024.0   587.344021   493.600011
12        32.0   2048.0  1165.055990   952.863991
13        32.0   4096.0  2314.304113  1858.175993
14        64.0     64.0    84.256001    77.760004
15        64.0    128.0   156.287998   142.848000
16        64.0    256.0   300.384015   261.440009
17        64.0    512.0   587.711990   492.832005
18        64.0   1024.0  1165.312052   952.319980
19        64.0   2048.0  2314.975977  1858.399987
20        64.0   4096.0  4617.663860  3667.775989
21       128.0     64.0   156.351998   143.360004
22       128.0    128.0   300.319999   261.536002
23       128.0    256.0   587.423980   493.024006
24       128.0    512.0  1164.064050   952.319980
25       128.0   1024.0  2314.975977  1858.047962
26       128.0   2048.0  4616.447926  3667.215943
27       128.0   4096.0  9214.528084  7298.208237

Modifications

Checklist

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on a significant performance optimization for the per_token_quant_fp8 CUDA kernel, which is critical for efficient FP8 quantization in the SGLang framework. By refactoring the kernel to utilize advanced CUDA features like warp-level parallelism and the cute tensor library, it achieves notable speed improvements across various batch and sequence lengths.

Highlights

  • Performance Optimization: I've re-implemented the per_token_quant_fp8_kernel to leverage warp-level reductions and cute tensors for improved efficiency, resulting in up to 10% speedup as demonstrated by the provided benchmarks.
  • Kernel Architecture Refinement: The kernel now transitions from block-level reduction (cub::block::block_reduce) to warp-level reduction (warpReduceMax) for calculating the maximum value per token, enabling more granular and efficient parallelization.
  • CUDA cute Tensor Integration: I've adopted the cute tensor library for managing global memory access within the kernel, providing a more expressive and potentially optimized way to handle tensor operations.
  • Optimized Kernel Launch Configuration: The kernel launch parameters have been adjusted to align with the new warp-centric design, processing multiple tokens per CTA (specifically, 8 tokens per 256-thread CTA).
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the per_token_quant_fp8 CUDA kernel to improve performance by processing multiple tokens per thread block, with each token handled by a single warp. This is a solid optimization strategy. The review identified a critical race condition in the use of shared memory for the scaling factor, which would lead to incorrect quantization results. A medium-severity issue with an unnecessary const_cast was also found.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants