-
Notifications
You must be signed in to change notification settings - Fork 615
🚀 Best Practices for Training Qwen3/Qwen3-MoE #4030
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Model Inference:Thinking Mode:
Non-Thinking Mode:
Model Quantization:Qwen3-32B-AWQ: https://modelscope.cn/models/swift/Qwen3-32B-AWQ Qwen3-30B-A3B-AWQ: https://modelscope.cn/models/swift/Qwen3-30B-A3B-AWQ |
请问vllm版本选择多少 |
vllm==0.8.5 |
将HF格式的权重转为Megatron格式失败:
errors: |
It's still on the main branch now, and the version ms-swift==3.4.0 will be released tonight. |
请求增加对Qwen3-8B的自我认知训练的NoteBook文件 我在魔塔提供的PAI-DSW中使用“ |
能否添加全参数微调的脚本? |
You can refer to the example here and modify the https://github.com/modelscope/ms-swift/blob/main/examples/train/full/qwen2_5_32b.sh |
已加入自我认知微调的demo |
If I currently have data without a reasoning process, but I want to use this data to fine-tune Qwen3, should I simply add /no_think after the prompt and prefix the response with |
Perhaps you can refer to this for a solution: ms-swift/swift/llm/dataset/dataset/llm.py Line 835 in 51cafe5
|
如何将微调成功后的模型导出为GGUF格式? |
@Jintao-Huang 在不采用推理的情况下,是否仍然可以使用Qwen2.5 的模板微调模型? |
When using --packing true, please additionally use --attn_impl flash_attn. This was missed in the best practices. |
中文版 notebook: https://modelscope.cn/notebook/share/ipynb/d4d8765f/qwen3.ipynb
English Version
We are thrilled to hear about the open-source release of Qwen3 and Qwen3-MoE. The CPT/SFT/DPO/GRPO for Qwen3/Qwen3-MoE has been supported at the first time by the ms-swift large model training framework. Meanwhile, it also supports the Megatron training (CPT/SFT) implementation for Qwen3/Qwen3-MoE, which is 10 times faster than the training speed achieved using transformers on MoE models.
We will showcase a runnable fine-tuning demo and provide the format for custom datasets.
Before starting the fine-tuning process, please ensure that your environment is properly set up.
Qwen3-8B SFT
The script for training Qwen3-8B is as follows, which can be run on the free A10 computing resources provided by ModelScope: https://modelscope.cn/my/mynotebook
The format for a custom dataset is as follows (the
system
field is optional). Simply specify--dataset <dataset_path>
:For more information, refer to the custom dataset documentation: https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html
10-Minute Quick Self-Cognition Fine-Tuning Demo (GPU Memory Usage: 22GB)
ref:
ms-swift/swift/llm/dataset/dataset/llm.py
Line 835 in 51cafe5
Inference and test the fine-tuning results:
CUDA_VISIBLE_DEVICES=0 \ swift infer \ --adapters output/vx-xxx/checkpoint-xxx \ --stream true \ --temperature 0 \ --max_new_tokens 2048
Qwen3-8B GRPO
Taking Qwen3-8B as an example, the following uses the ms-swift framework to conduct GRPO training. For more details about GRPO, refer to the GRPO documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO.html
The AI-MO/NuminaMath-TIR dataset is used, and the accuracy function is employed to compute the model’s response accuracy reward. The following environment needs to be installed to calculate rewards:
The custom dataset format is similar to SFT, where the assistant part is optional. If using the accuracy reward, a solution column is required to compute the accuracy.
You can also train with custom reward functions or reward models. Columns in the dataset will be passed into
**kwargs
of the reward function. An example of a custom reward function can be found here: swift/examples/train/grpo/plugin/plugin.pyDuring training, we use vLLM to accelerate the sampling process. Setting num_infer_workers=8, we deploy one vLLM engine on each device to speed up the sampling process.
The training script is as follows:
Qwen3-30B-A3B MoE SFT (Megatron-SWIFT)
ms-swift introduces Megatron's parallel technology to accelerate large model training, including data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, context parallelism, and expert parallelism. It supports pre-training and fine-tuning of models like Qwen3, Qwen3-MoE, Qwen2.5, Llama3, Deepseek-R1 distillation series, etc.
For environment preparation (image) and the conversion between HF and MCore model weights, please refer to the Megatron-SWIFT training documentation; it is not covered here: https://swift.readthedocs.io/en/latest/Instruction/Megatron-SWIFT-Training.html
We use DLC to initiate the training command. The training environment consists of 2 machines with 8 * 80GiB A800:
More multi-node launch methods can be found here: https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node
Training loss (partial):
The custom dataset format is the same as
swift sft
, which can be found above. Specify--dataset <dataset_path>
.Below is the comparison of full-parameter training speed/GPU memory usage for the Qwen3-30B-A3B model using
megatron sft
andswift sft
:中文版
非常高兴听到Qwen3和Qwen3-MoE的开源, ms-swift大模型训练框架首发支持了Qwen3/Qwen3-MoE的CPT/SFT/DPO/GRPO,同时支持了Qwen3/Qwen3-MoE的Megatron训练(CPT/SFT)实现,在MoE模型上相比transformers实现的训练速度快10倍。
我们将展示可运行的微调demo,并给出自定义数据集的格式。
在开始微调之前,请确保您的环境已准备妥当。
Qwen3-8B SFT
对Qwen3-8B进行训练的脚本如下,可在魔搭提供的免费算力A10中运行:https://modelscope.cn/my/mynotebook
自定义数据集格式如下(system字段可选),指定
--dataset <dataset_path>
即可:参考自定义数据集文档:https://swift.readthedocs.io/zh-cn/latest/Customization/%E8%87%AA%E5%AE%9A%E4%B9%89%E6%95%B0%E6%8D%AE%E9%9B%86.html
10分钟快速自我认知微调Demo(显存占用:22GB)
ref:
ms-swift/swift/llm/dataset/dataset/llm.py
Line 835 in 51cafe5
推理测试微调效果:
CUDA_VISIBLE_DEVICES=0 \ swift infer \ --adapters output/vx-xxx/checkpoint-xxx \ --stream true \ --temperature 0 \ --max_new_tokens 2048
Qwen3-8B GRPO
以Qwen3-8B为例,下面使用swift框架对进行GRPO训练。更多关于GRPO,可以参考GRPO文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO.html
使用AI-MO/NuminaMath-TIR作为数据集,并使用accuracy函数计算模型回答的准确率奖励, 计算奖励需要安装以下环境:
自定义数据集格式与SFT类似,其中assistant部分不必需。如果使用accuracy奖励,则需要solution列来计算准确率。
也可以使用自定义的奖励函数/奖励模型进行训练,数据集中的列会传到奖励函数的
**kwargs
中,自定义奖励函数的例子参考:swift/examples/train/grpo/plugin/plugin.py在训练过程中,我们使用vLLM来加速采样过程。设置num_infer_workers=8,我们为每个device都部署一个vLLM engine来加速采样过程。
训练脚本如下:
Qwen3-30B-A3B MoE SFT(Megatron-SWIFT)
SWIFT引入了Megatron的并行技术来加速大模型的训练,包括数据并行、张量并行、流水线并行、序列并行,上下文并行,专家并行。支持Qwen3、Qwen3-MoE、Qwen2.5、Llama3、Deepseek-R1蒸馏系等模型的预训练和微调。
对于环境准备(镜像)和HF与MCore模型权重的转换,可以参考Megatron-SWIFT训练文档,这里不进行介绍:https://swift.readthedocs.io/zh-cn/latest/Instruction/Megatron-SWIFT%E8%AE%AD%E7%BB%83.html
我们使用DLC启动训练命令,训练环境是2机8 * 80GiB A800:
更多多节点启动方式参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node
训练loss图(部分):
效果截图:
自定义数据集格式与
swift sft
相同,可以在本文上方找到,指定--dataset <dataset_path>
即可。使用
megatron sft
和swift sft
进行Qwen3-30B-A3B模型全参数训练速度/显存占用对比如下:The text was updated successfully, but these errors were encountered: