Skip to content

Conversation

@pwilkin
Copy link
Collaborator

@pwilkin pwilkin commented Nov 29, 2025

This massively reduces the number of splits for the Qwen3 Next graph by placing the initial gate tensor on the backend, otherwise it's put on the CPU which recursively poisons all other layers, leading to splits.

@pwilkin pwilkin requested a review from CISC as a code owner November 29, 2025 01:16
@pwilkin
Copy link
Collaborator Author

pwilkin commented Nov 29, 2025

On the test server this improves pp512 t/s from 900 to 1300.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant