-
-
Notifications
You must be signed in to change notification settings - Fork 11.6k
[Kernel][MoE] optimize moe_align_block_size
#29642
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Kernel][MoE] optimize moe_align_block_size
#29642
Conversation
Signed-off-by: Jinzhen Lin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces several well-implemented optimizations to moe_align_block_size, resulting in significant performance gains as demonstrated by the benchmark results. The optimizations include using a tighter memory allocation for small batches, parallelizing data initialization within the CUDA kernels, and filtering invalid experts earlier in the expert parallelism path. Additionally, this PR includes a critical correctness fix for expert parallelism mode by ensuring an intermediate buffer is zero-initialized, preventing potential errors from uninitialized memory. The changes are clean, well-reasoned, and thoroughly tested. Overall, this is an excellent contribution that improves both performance and correctness.
Signed-off-by: Jinzhen Lin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
Signed-off-by: Jinzhen Lin <[email protected]>
…n/vllm into optimize-moe-align-block-size
Introduction
This PR optimize
moe_align_block_sizefrom three aspects:For a small batch size, use the smallest possible value for
max_num_tokens_padded.In the CUDA Kernel, use additional thread or threadblock resources to fill
sorted_token_ids. The previous CUDA Kernel used very few computational resources, and the filling ofsorted_token_idsand the counting of experts were performed sequentially. Sincesorted_token_idsis only used again at the very end of the CUDA kernel, I changed this to be parallelized to accelerate kernel execution.For EP, all invalid experts are filtered out directly when counting the number of experts. This accelerates the execution of
moe_align_block_sizeand also leads to a cleaner and faster implementation for the subsequent MoE kernel.Kernel Bench
On RTX 4090
Kernel Accuracy Test
Tested with
All test cases are passed.
E2E Accuracy Test
GSM8K (2-shot)
With Triton Kernel
With Marlin Kernel