feat(cuda): Optimize MoE scatter_output with parallel CUDA schedule #3398

Chandan-Sugreevu · 2025-12-12T01:08:57Z

Optimizes the scatter_output operator used in Mixture-of-Experts (MoE) inference by replacing the default serial execution with a custom TVM schedule.

The new schedule (_schedule_scatter_output) fuses the token and hidden dimension loops and explicitly binds them to CUDA blocks and threads (1024 threads/block). This ensures the scatter operation runs in parallel on the GPU , significantly improving memory bandwidth utilization compared to the previous T.serial implementation.

feat(cuda): Optimize MoE scatter_output with parallel CUDA schedule

bb087f4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(cuda): Optimize MoE scatter_output with parallel CUDA schedule #3398

feat(cuda): Optimize MoE scatter_output with parallel CUDA schedule #3398

Uh oh!

Chandan-Sugreevu commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

feat(cuda): Optimize MoE scatter_output with parallel CUDA schedule #3398

Are you sure you want to change the base?

feat(cuda): Optimize MoE scatter_output with parallel CUDA schedule #3398

Uh oh!

Conversation

Chandan-Sugreevu commented Dec 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant