Qwen3 MoE Fused

Update: Transformers v5 will soon be released and it supports the fused MoE kernels. We will implement the LoRA in PEFT. This repo only supports Transformers v4.

The Qwen3 MoE model (and all other MoE models) in HF Transformers is notoriously slow, because it uses a for loop to access the experts. The purpose of this repo is to fine-tune Qwen3-30B-A3B on a single GPU with 24GB VRAM and achieve high throughput. The implementation is compatible with the HF Transformers ecosystem, such as LoRA, bitsandbytes 4-bit quantization, and Unsloth. See example_train_30b_a3b_unsloth.py for the usage.

Fused linear layer

The critical part is to implement the moe_fused_linear function:

output[b, o] = sum_i weight[selected_experts[b], o, i] * input[b, i]

There are already several good implementations, such as triton-kernels, llama.cpp, vLLM, fanshiqing/grouped_gemm. torch._grouped_mm is also being implemented. We need to sort input by the experts to improve the memory coalescence of weight, and more optimizations are explained in https://pytorch.org/blog/accelerating-moes-with-a-triton-persistent-cache-aware-grouped-gemm-kernel/

The implementation in this repo is largely based on the MoE kernel in Unsloth, which is based on the Triton grouped GEMM. I've added strides, masks, and autotune configs for small or 'thin' matrices, which are needed for LoRA.

I aim to keep the code readable and easy to follow. I only used the most mature features of Triton, such as load and store, rather than things like TMA and swizzle. I've benchmarked it on RTX 3080 and it's close to the theoretical fp16 and bf16 performance.

LoRA

The LoRA for the fused linear layer is defined by first creating a LoRA for the linear layer in each expert, then stack them along the experts dimension. For the weight tensor with shape (num_experts, out_features, in_features), the two LoRA weights have shape lora_A: (num_experts, lora_rank, in_features), lora_B: (num_experts, out_features, lora_rank). Therefore, we can losslessly convert between the fused and the unfused formats, and a previously trained LoRA can continue to be trained.

The functions in qwen3_moe_fused/convert.py can convert a model or a LoRA between the fused and the unfused formats. After you train a LoRA in the fused format, you can convert it to the unfused format, then convert it to other formats such as GGUF. llama.cpp already supports this kind of LoRA. Support in vLLM is being implemented, see vllm-project/vllm#21229

TODO

Convert GGUF Q4 with Unsloth Dynamic (UD) quantization (such as https://huggingface.co/unsloth/Qwen3-30B-A3B-Instruct-2507-GGUF ) to the fused bnb format, and see if it has higher intelligence than the regular bnb 4-bit.
This should work with Qwen3-Next with minimal modification. I haven't started trying this, but feel free to ask if you need it.
Multi-GPU support. I don't have multiple GPUs at home so I'm not focusing on this. It works with HF Accelerate, see #1 (comment) . It should be straightforward to do data parallel and model/pipeline parallel. If you use Unsloth, you can follow https://docs.unsloth.ai/basics/multi-gpu-training-with-unsloth . Feel free to ask if you see any error.
Fuse 4-bit dequant and MoE linear, see qwen3_moe_fused/quantize/layer.py. Currently I've written a kernel in qwen3_moe_fused/grouped_gemm/forward_4bit.py but it's slower than the unfused version when the batch size is large.

License

The files in qwen3_moe_fused/grouped_gemm/ are modified from the Unsloth MoE kernels so they are AGPLv3 licensed, see the explanation. For more robust and performant integration, it's possible to use the MIT licensed triton-kernels as an alternative.

The rest of this repo, including files modified from Transformers, PEFT, and bitsandbytes, are Apache-2.0 licensed.

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
qwen3_moe_fused		qwen3_moe_fused
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
benchmark_moe_fused_linear.py		benchmark_moe_fused_linear.py
benchmark_moe_fused_linear_4bit.py		benchmark_moe_fused_linear_4bit.py
benchmark_moe_fused_linear_bwd_dw.py		benchmark_moe_fused_linear_bwd_dw.py
benchmark_moe_fused_linear_bwd_dx.py		benchmark_moe_fused_linear_bwd_dx.py
example_create_tiny.py		example_create_tiny.py
example_infer_30b_a3b.py		example_infer_30b_a3b.py
example_train_30b_a3b_unsloth.py		example_train_30b_a3b_unsloth.py
example_train_tiny.py		example_train_tiny.py
example_train_tiny_unsloth.py		example_train_tiny_unsloth.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
test_fast_lora.py		test_fast_lora.py
test_lora.py		test_lora.py
test_model.py		test_model.py
test_moe_fused_linear.py		test_moe_fused_linear.py
test_moe_fused_linear_bwd.py		test_moe_fused_linear_bwd.py
test_quantize.py		test_quantize.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Qwen3 MoE Fused

Fused linear layer

LoRA

TODO

License

About

Uh oh!

Uh oh!

Contributors 2

Languages

License

woct0rdho/transformers-qwen3-moe-fused

Folders and files

Latest commit

History

Repository files navigation

Qwen3 MoE Fused

Fused linear layer

LoRA

TODO

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 2

Languages