[Kernel] Enable fp8 support for pplx and BatchedTritonExperts. #18864

bnellnm · 2025-05-28T23:40:02Z

Enable full fp8 support for pplx and BatchedTritonExperts.

Replace world_size/dp_size arguments to PrepareAndFinalize and Experts constructors with num_dispatchers.
Reduce use of duplicate information for setup, i.e. try to get all the parameters from the FusedMoEConfig rather than all2all_manager or random variables.
Rewrote the pplx tests so that they run in a loop on the spawned process rather than spawning a process for each test point. The original slow test points can still be run with the --optional pytest flag.
Add a bunch more quantization tests to cover all the combinations of per-token, per-tensor and blocked.

I've verified all the combinations from here work properly: dispatch_combine fp8 support matrix by branch + model.xlsx
with DP=2/TP=1, DP=2/TP=2 and DP=4/TP=1.

lm-eval results for RedHatAI/Llama-4-Scout-17B-16E-Instruct-FP8-dynamic with pplx, DP=4, TP=1.

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value|   |Stderr|
|-----|------:|----------------|-----:|-----------|---|----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  | 0.86|±  |0.0349|
|     |       |strict-match    |     5|exact_match|↑  | 0.81|±  |0.0394|

cc @ElizaWszola

github-actions · 2025-05-28T23:40:10Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

mergify · 2025-05-28T23:40:37Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/model_executor/layers/fused_moe/fused_batched_moe.py

mergify · 2025-06-03T03:38:38Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-06-13T02:45:00Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-06-26T22:27:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @bnellnm.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Bill Nell <[email protected]>

vllm/model_executor/layers/fused_moe/batched_deep_gemm_moe.py

vllm/model_executor/layers/fused_moe/config.py

vllm/model_executor/layers/fused_moe/cutlass_moe.py

vllm/model_executor/layers/fused_moe/deepep_ht_prepare_finalize.py

Signed-off-by: Bill Nell <[email protected]>

vllm/model_executor/layers/fused_moe/fused_batched_moe.py

vllm/model_executor/layers/fused_moe/layer.py

varun-sundar-rabindranath · 2025-07-02T21:20:38Z

LGTM! Really nice cleanups @bnellnm 🙌

bnellnm · 2025-07-02T21:21:44Z

LGTM! Really nice cleanups @bnellnm 🙌

Thanks!

Signed-off-by: Bill Nell <[email protected]>

vllm/model_executor/layers/fused_moe/fused_batched_moe.py

vllm/model_executor/layers/fused_moe/pplx_prepare_finalize.py

vllm/model_executor/models/granitemoe.py

Signed-off-by: Bill Nell <[email protected]>

mergify bot added the v1 label May 28, 2025

mergify bot added the needs-rebase label May 28, 2025

bnellnm force-pushed the batch-fp8 branch from fa64b5a to d86e3f0 Compare May 28, 2025 23:41

mergify bot removed the needs-rebase label May 28, 2025

tlrmchlsmth mentioned this pull request Jun 2, 2025

[Bugfix][EP+DP] Use pplx-kernel internode instead of intranode #19034

Merged

varun-sundar-rabindranath reviewed Jun 2, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/fused_batched_moe.py Show resolved Hide resolved

mergify bot added the needs-rebase label Jun 3, 2025

bnellnm force-pushed the batch-fp8 branch from 20881d7 to 680de26 Compare June 13, 2025 02:25

mergify bot removed the needs-rebase label Jun 13, 2025

mergify bot added needs-rebase qwen Related to Qwen models labels Jun 13, 2025

bnellnm force-pushed the batch-fp8 branch 2 times, most recently from 911339b to f92734e Compare June 24, 2025 21:09

bnellnm mentioned this pull request Jun 25, 2025

[Kernels] MoE refactor #19636

Merged

bnellnm force-pushed the batch-fp8 branch from 3d226f5 to 347cda2 Compare June 26, 2025 21:20

bnellnm marked this pull request as ready for review June 26, 2025 21:20

bnellnm requested review from tlrmchlsmth, WoosukKwon, mgoin and robertgshaw2-redhat as code owners June 26, 2025 21:20

mergify bot removed the needs-rebase label Jun 26, 2025

bnellnm changed the title ~~[Kernel] Fix fp8 support for pplx and BatchedTritonExperts.~~ [Kernel] Enable fp8 support for pplx and BatchedTritonExperts. Jun 26, 2025

mergify bot added the needs-rebase label Jun 26, 2025

bnellnm force-pushed the batch-fp8 branch from 347cda2 to 7219559 Compare June 27, 2025 03:13

mergify bot removed the needs-rebase label Jun 27, 2025

bnellnm added 6 commits July 2, 2025 13:23

fixup world_size/dp_size params

70fa1dd

Signed-off-by: Bill Nell <[email protected]>

fix tests

9c56206

Signed-off-by: Bill Nell <[email protected]>

more test fixes

653942f

Signed-off-by: Bill Nell <[email protected]>

fix merge

d2dd405

Signed-off-by: Bill Nell <[email protected]>

trim testcases

ae91a5e

Signed-off-by: Bill Nell <[email protected]>

fix lint

76c697a

Signed-off-by: Bill Nell <[email protected]>

bnellnm force-pushed the batch-fp8 branch from a3c1533 to 76c697a Compare July 2, 2025 13:24

bnellnm added 3 commits July 2, 2025 16:04

ping

a5c8e85

Signed-off-by: Bill Nell <[email protected]>

fix num_dispatchers for TP+DP

285b2bc

Signed-off-by: Bill Nell <[email protected]>

fix unit test

286d988

Signed-off-by: Bill Nell <[email protected]>