Skip to content

Conversation

@Chandan-Sugreevu
Copy link

Optimizes the scatter_output operator used in Mixture-of-Experts (MoE) inference by replacing the default serial execution with a custom TVM schedule.

The new schedule (_schedule_scatter_output) fuses the token and hidden dimension loops and explicitly binds them to CUDA blocks and threads (1024 threads/block). This ensures the scatter operation runs in parallel on the GPU , significantly improving memory bandwidth utilization compared to the previous T.serial implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant