Skip to content

🚀 Best Practices for Training Qwen3/Qwen3-MoE #4030

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Jintao-Huang opened this issue Apr 28, 2025 · 14 comments
Open

🚀 Best Practices for Training Qwen3/Qwen3-MoE #4030

Jintao-Huang opened this issue Apr 28, 2025 · 14 comments
Labels
good first issue Good for newcomers

Comments

@Jintao-Huang
Copy link
Collaborator

Jintao-Huang commented Apr 28, 2025

中文版 notebook: https://modelscope.cn/notebook/share/ipynb/d4d8765f/qwen3.ipynb

English Version

We are thrilled to hear about the open-source release of Qwen3 and Qwen3-MoE. The CPT/SFT/DPO/GRPO for Qwen3/Qwen3-MoE has been supported at the first time by the ms-swift large model training framework. Meanwhile, it also supports the Megatron training (CPT/SFT) implementation for Qwen3/Qwen3-MoE, which is 10 times faster than the training speed achieved using transformers on MoE models.

We will showcase a runnable fine-tuning demo and provide the format for custom datasets.

Before starting the fine-tuning process, please ensure that your environment is properly set up.

# pip install git+https://github.com/modelscope/ms-swift.git
git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .

pip install liger-kernel transformers -U

Qwen3-8B SFT

The script for training Qwen3-8B is as follows, which can be run on the free A10 computing resources provided by ModelScope: https://modelscope.cn/my/mynotebook

# Training GPU memory: 22GB
# You can specify `--dataset AI-ModelScope/alpaca-gpt4-data-zh` to run the experiment
CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model Qwen/Qwen3-8B \
    --train_type lora \
    --dataset '<dataset-path>' \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 4 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --packing true \
    --use_liger_kernel true \
    --attn_impl flash_attn

The format for a custom dataset is as follows (the system field is optional). Simply specify --dataset <dataset_path>:

For more information, refer to the custom dataset documentation: https://swift.readthedocs.io/en/latest/Customization/Custom-dataset.html

{"messages": [{"role": "user", "content": "Where is the capital of Zhejiang?"}, {"role": "assistant", "content": "<think>\nxxx\n</think>\n\nThe capital of Zhejiang is Hangzhou."}]}
{"messages": [{"role": "user", "content": "Where is the capital of Zhejiang? /no_think"}, {"role": "assistant", "content": "<think>\n\n</think>\n\nThe capital of Zhejiang is Hangzhou."}]}

10-Minute Quick Self-Cognition Fine-Tuning Demo (GPU Memory Usage: 22GB)

ref:

row['query'] = row['query'] + ' /no_think'

CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model Qwen/Qwen3-8B \
    --train_type lora \
    --dataset 'swift/Qwen3-SFT-Mixin#2000' \
              'swift/self-cognition:qwen3#600' \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 16 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --use_liger_kernel true \
    --model_author swift \
    --model_name swift-robot

Inference and test the fine-tuning results:

CUDA_VISIBLE_DEVICES=0 \
swift infer \
    --adapters output/vx-xxx/checkpoint-xxx \
    --stream true \
    --temperature 0 \
    --max_new_tokens 2048
Image

Qwen3-8B GRPO

Taking Qwen3-8B as an example, the following uses the ms-swift framework to conduct GRPO training. For more details about GRPO, refer to the GRPO documentation: https://swift.readthedocs.io/en/latest/Instruction/GRPO.html

The AI-MO/NuminaMath-TIR dataset is used, and the accuracy function is employed to compute the model’s response accuracy reward. The following environment needs to be installed to calculate rewards:

pip install math_verify==0.5.2

The custom dataset format is similar to SFT, where the assistant part is optional. If using the accuracy reward, a solution column is required to compute the accuracy.

{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}]}
{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}]}
{"messages": [{"role": "user", "content": "What is your name?"}]}

You can also train with custom reward functions or reward models. Columns in the dataset will be passed into **kwargs of the reward function. An example of a custom reward function can be found here: swift/examples/train/grpo/plugin/plugin.py

    --external_plugins examples/train/grpo/plugin/plugin.py \
    --reward_funcs external_math_acc external_math_format \
    --reward_model AI-ModelScope/Skywork-Reward-Llama-3.1-8B-v0.2

During training, we use vLLM to accelerate the sampling process. Setting num_infer_workers=8, we deploy one vLLM engine on each device to speed up the sampling process.

The training script is as follows:

# 70G*8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=8 \
swift rlhf \
    --rlhf_type grpo \
    --model Qwen/Qwen3-8B \
    --train_type full \
    --dataset AI-MO/NuminaMath-TIR \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --learning_rate 1e-6 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --output_dir output \
    --gradient_accumulation_steps 1 \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --max_completion_length 4096 \
    --vllm_max_model_len 8192 \
    --reward_funcs accuracy \
    --num_generations 16 \
    --use_vllm true \
    --vllm_gpu_memory_utilization 0.4 \
    --sleep_level 1 \
    --offload_model true \
    --offload_optimizer true \
    --gc_collect_after_offload true \
    --deepspeed zero3 \
    --num_infer_workers 8 \
    --tensor_parallel_size 1 \
    --temperature 1.0 \
    --top_p 0.85 \
    --report_to wandb \
    --log_completions true \
    --overlong_filter true 

Qwen3-30B-A3B MoE SFT (Megatron-SWIFT)

ms-swift introduces Megatron's parallel technology to accelerate large model training, including data parallelism, tensor parallelism, pipeline parallelism, sequence parallelism, context parallelism, and expert parallelism. It supports pre-training and fine-tuning of models like Qwen3, Qwen3-MoE, Qwen2.5, Llama3, Deepseek-R1 distillation series, etc.

For environment preparation (image) and the conversion between HF and MCore model weights, please refer to the Megatron-SWIFT training documentation; it is not covered here: https://swift.readthedocs.io/en/latest/Instruction/Megatron-SWIFT-Training.html

We use DLC to initiate the training command. The training environment consists of 2 machines with 8 * 80GiB A800:

More multi-node launch methods can be found here: https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node

# https://help.aliyun.com/zh/pai/user-guide/general-environment-variables
# Please ensure that the weight saving paths are the same for both nodes.
NNODES=$WORLD_SIZE \
NODE_RANK=$RANK \
megatron sft \
    --load Qwen3-30B-A3B-Base-mcore \
    --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
    --tensor_model_parallel_size 2 \
    --expert_model_parallel_size 8 \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --moe_aux_loss_coeff 0.01 \
    --micro_batch_size 1 \
    --global_batch_size 16 \
    --packing true \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --train_iters 2000 \
    --eval_iters 50 \
    --finetune true \
    --cross_entropy_loss_fusion true \
    --lr 1e-5 \
    --lr_warmup_iters 100 \
    --min_lr 1e-6 \
    --save megatron_output/Qwen3-30B-A3B-Base \
    --eval_interval 200 \
    --save_interval 200 \
    --max_length 8192 \
    --num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --sequence_parallel true \
    --use_flash_attn true

Training loss (partial):

Image

The custom dataset format is the same as swift sft, which can be found above. Specify --dataset <dataset_path>.

Below is the comparison of full-parameter training speed/GPU memory usage for the Qwen3-30B-A3B model using megatron sft and swift sft:

Megatron-LM DeepSpeed-ZERO2 DeepSpeed-ZERO3
Training Speed 9.6s/it - 91.2s/it
GPU Memory Usage 16 * 60GiB OOM 16 * 80GiB

中文版

非常高兴听到Qwen3和Qwen3-MoE的开源, ms-swift大模型训练框架首发支持了Qwen3/Qwen3-MoE的CPT/SFT/DPO/GRPO,同时支持了Qwen3/Qwen3-MoE的Megatron训练(CPT/SFT)实现,在MoE模型上相比transformers实现的训练速度快10倍

我们将展示可运行的微调demo,并给出自定义数据集的格式。

在开始微调之前,请确保您的环境已准备妥当。

git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .

pip install liger-kernel transformers -U

Qwen3-8B SFT

对Qwen3-8B进行训练的脚本如下,可在魔搭提供的免费算力A10中运行:https://modelscope.cn/my/mynotebook

# 训练显存:22GB
# 你可以指定`--dataset AI-ModelScope/alpaca-gpt4-data-zh`来跑通实验
CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model Qwen/Qwen3-8B \
    --train_type lora \
    --dataset '<dataset-path>' \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 4 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --packing true \
    --use_liger_kernel true \
    --attn_impl flash_attn

自定义数据集格式如下(system字段可选),指定--dataset <dataset_path>即可:

参考自定义数据集文档:https://swift.readthedocs.io/zh-cn/latest/Customization/%E8%87%AA%E5%AE%9A%E4%B9%89%E6%95%B0%E6%8D%AE%E9%9B%86.html

{"messages": [{"role": "user", "content": "浙江的省会在哪?"}, {"role": "assistant", "content": "<think>\nxxx\n</think>\n\n浙江的省会在杭州。"}]}
{"messages": [{"role": "user", "content": "浙江的省会在哪? /no_think"}, {"role": "assistant", "content": "<think>\n\n</think>\n\n浙江的省会在杭州。"}]}

10分钟快速自我认知微调Demo(显存占用:22GB)

ref:

row['query'] = row['query'] + ' /no_think'

CUDA_VISIBLE_DEVICES=0 \
swift sft \
    --model Qwen/Qwen3-8B \
    --train_type lora \
    --dataset 'swift/Qwen3-SFT-Mixin#2000' \
              'swift/self-cognition:qwen3#600' \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 1e-4 \
    --lora_rank 8 \
    --lora_alpha 32 \
    --target_modules all-linear \
    --gradient_accumulation_steps 16 \
    --eval_steps 50 \
    --save_steps 50 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --max_length 2048 \
    --output_dir output \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --use_liger_kernel true \
    --model_author swift \
    --model_name swift-robot

推理测试微调效果:

CUDA_VISIBLE_DEVICES=0 \
swift infer \
    --adapters output/vx-xxx/checkpoint-xxx \
    --stream true \
    --temperature 0 \
    --max_new_tokens 2048
Image

Qwen3-8B GRPO

以Qwen3-8B为例,下面使用swift框架对进行GRPO训练。更多关于GRPO,可以参考GRPO文档:https://swift.readthedocs.io/zh-cn/latest/Instruction/GRPO.html

使用AI-MO/NuminaMath-TIR作为数据集,并使用accuracy函数计算模型回答的准确率奖励, 计算奖励需要安装以下环境:

pip install math_verify==0.5.2

自定义数据集格式与SFT类似,其中assistant部分不必需。如果使用accuracy奖励,则需要solution列来计算准确率。

{"messages": [{"role": "system", "content": "You are a useful and harmless assistant"}, {"role": "user", "content": "Tell me tomorrow's weather"}]}
{"messages": [{"role": "system", "content": "You are a useful and harmless math calculator"}, {"role": "user", "content": "What is 1 + 1?"}, {"role": "assistant", "content": "It equals 2"}, {"role": "user", "content": "What about adding 1?"}]}
{"messages": [{"role": "user", "content": "What is your name?"}]}

也可以使用自定义的奖励函数/奖励模型进行训练,数据集中的列会传到奖励函数的**kwargs中,自定义奖励函数的例子参考:swift/examples/train/grpo/plugin/plugin.py

    --external_plugins examples/train/grpo/plugin/plugin.py \
    --reward_funcs external_math_acc external_math_format \
    --reward_model AI-ModelScope/Skywork-Reward-Llama-3.1-8B-v0.2

在训练过程中,我们使用vLLM来加速采样过程。设置num_infer_workers=8,我们为每个device都部署一个vLLM engine来加速采样过程。

训练脚本如下:

# 70G*8
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
NPROC_PER_NODE=8 \
swift rlhf \
    --rlhf_type grpo \
    --model Qwen/Qwen3-8B \
    --train_type full \
    --dataset AI-MO/NuminaMath-TIR \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --learning_rate 1e-6 \
    --save_total_limit 2 \
    --logging_steps 5 \
    --output_dir output \
    --gradient_accumulation_steps 1 \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --max_completion_length 4096 \
    --vllm_max_model_len 8192 \
    --reward_funcs accuracy \
    --num_generations 16 \
    --use_vllm true \
    --vllm_gpu_memory_utilization 0.4 \
    --sleep_level 1 \
    --offload_model true \
    --offload_optimizer true \
    --gc_collect_after_offload true \
    --deepspeed zero3 \
    --num_infer_workers 8 \
    --tensor_parallel_size 1 \
    --temperature 1.0 \
    --top_p 0.85 \
    --report_to wandb \
    --log_completions true \
    --overlong_filter true 

Qwen3-30B-A3B MoE SFT(Megatron-SWIFT)

SWIFT引入了Megatron的并行技术来加速大模型的训练,包括数据并行、张量并行、流水线并行、序列并行,上下文并行,专家并行。支持Qwen3Qwen3-MoE、Qwen2.5、Llama3、Deepseek-R1蒸馏系等模型的预训练和微调。

对于环境准备(镜像)和HF与MCore模型权重的转换,可以参考Megatron-SWIFT训练文档,这里不进行介绍:https://swift.readthedocs.io/zh-cn/latest/Instruction/Megatron-SWIFT%E8%AE%AD%E7%BB%83.html

我们使用DLC启动训练命令,训练环境是2机8 * 80GiB A800:

更多多节点启动方式参考:https://github.com/modelscope/ms-swift/tree/main/examples/train/multi-node

# https://help.aliyun.com/zh/pai/user-guide/general-environment-variables
# 请确保两个节点的保存权重路径相同
NNODES=$WORLD_SIZE \
NODE_RANK=$RANK \
megatron sft \
    --load Qwen3-30B-A3B-Base-mcore \
    --dataset 'liucong/Chinese-DeepSeek-R1-Distill-data-110k-SFT' \
    --tensor_model_parallel_size 2 \
    --expert_model_parallel_size 8 \
    --moe_grouped_gemm true \
    --moe_shared_expert_overlap true \
    --moe_aux_loss_coeff 0.01 \
    --micro_batch_size 1 \
    --global_batch_size 16 \
    --packing true \
    --recompute_granularity full \
    --recompute_method uniform \
    --recompute_num_layers 1 \
    --train_iters 2000 \
    --eval_iters 50 \
    --finetune true \
    --cross_entropy_loss_fusion true \
    --lr 1e-5 \
    --lr_warmup_iters 100 \
    --min_lr 1e-6 \
    --save megatron_output/Qwen3-30B-A3B-Base \
    --eval_interval 200 \
    --save_interval 200 \
    --max_length 8192 \
    --num_workers 8 \
    --dataset_num_proc 8 \
    --no_save_optim true \
    --no_save_rng true \
    --sequence_parallel true \
    --use_flash_attn true

训练loss图(部分):

Image

效果截图:

Image

自定义数据集格式与swift sft相同,可以在本文上方找到,指定--dataset <dataset_path>即可。

使用megatron sftswift sft进行Qwen3-30B-A3B模型全参数训练速度/显存占用对比如下:

Megatron-LM DeepSpeed-ZeRO2 DeepSpeed-ZeRO3
训练速度 9.6s/it - 91.2s/it
显存占用 16 * 60GiB OOM 16 * 80GiB
@Jintao-Huang Jintao-Huang added the good first issue Good for newcomers label Apr 28, 2025
@Jintao-Huang Jintao-Huang pinned this issue Apr 28, 2025
@Jintao-Huang Jintao-Huang changed the title Best Practices for Training Qwen3/Qwen3-MoE 🍭🚀 Best Practices for Training Qwen3/Qwen3-MoE Apr 28, 2025
@Jintao-Huang
Copy link
Collaborator Author

Jintao-Huang commented Apr 28, 2025

Model Inference:

Thinking Mode:

CUDA_VISIBLE_DEVICES=0 \
swift infer \
    --model Qwen/Qwen3-8B \
    --infer_backend vllm \
    --stream true \
    --max_new_tokens 2048 \
    --max_model_len 8192
<<<  who are you?
<think>
Okay, the user is asking "who are you?" Let me start by introducing myself as Qwen, the large language model developed by Alibaba Cloud. I should mention my capabilities, like answering questions, creating content, and engaging in conversations. But I need to keep it concise. Also, the user might want to know how I can assist them. Maybe I should ask how I can help them today. Let me check if there's anything else important to include. Oh, I should make sure the tone is friendly and approachable. Alright, that should cover it.
</think>

Hello! I am Qwen, a large language model developed by Alibaba Cloud. I can assist with a wide range of tasks, such as answering questions, creating content, writing stories, coding, and more. How can I help you today? 😊
<<< who are you? /no_think
<think>

</think>

I am Qwen, a large language model developed by Alibaba Cloud. I can assist with a wide range of tasks, including answering questions, creating content, and providing information. How can I help you today?

Non-Thinking Mode:

CUDA_VISIBLE_DEVICES=0 \
swift infer \
    --model Qwen/Qwen3-8B \
    --infer_backend vllm \
    --stream true \
    --max_new_tokens 2048 \
    --max_model_len 8192 \
    --response_prefix '<think>\n\n</think>\n\n'
<<< who are you?
<think>

</think>

I am Qwen, a large-scale language model developed by Alibaba Cloud. I am designed to assist with a wide range of tasks, including answering questions, creating content, and providing information. How can I assist you today?

Model Quantization:

Qwen3-32B-AWQ: https://modelscope.cn/models/swift/Qwen3-32B-AWQ

Qwen3-30B-A3B-AWQ: https://modelscope.cn/models/swift/Qwen3-30B-A3B-AWQ

@Jintao-Huang Jintao-Huang changed the title 🍭🚀 Best Practices for Training Qwen3/Qwen3-MoE 🚀 Best Practices for Training Qwen3/Qwen3-MoE 👋 Apr 28, 2025
@Jintao-Huang Jintao-Huang changed the title 🚀 Best Practices for Training Qwen3/Qwen3-MoE 👋 🚀 Best Practices for Training Qwen3/Qwen3-MoE Apr 29, 2025
@EvilCalf
Copy link

请问vllm版本选择多少

@Jintao-Huang
Copy link
Collaborator Author

Jintao-Huang commented Apr 29, 2025

vllm==0.8.5

@sosofun
Copy link

sosofun commented Apr 29, 2025

将HF格式的权重转为Megatron格式失败:

CUDA_VISIBLE_DEVICES=0 \ swift export \ --model Qwen/Qwen3-30B-A3B \ --to_mcore true \ --torch_dtype bfloat16 \ --output_dir Qwen/Qwen3-30B-A3B-mcore

errors:
[rank0]: Traceback (most recent call last): [rank0]: File "/usr/local/lib/python3.11/site-packages/swift/cli/export.py", line 5, in <module> [rank0]: export_main() [rank0]: File "/usr/local/lib/python3.11/site-packages/swift/llm/export/export.py", line 50, in export_main [rank0]: return SwiftExport(args).main() [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/site-packages/swift/llm/base.py", line 47, in main [rank0]: result = self.run() [rank0]: ^^^^^^^^^^ [rank0]: File "/usr/local/lib/python3.11/site-packages/swift/llm/export/export.py", line 34, in run [rank0]: convert_hf2mcore(args) [rank0]: File "/usr/local/lib/python3.11/site-packages/swift/megatron/utils/convert.py", line 72, in convert_hf2mcore [rank0]: assert megatron_model_meta is not None, f'Model: {args.model} is not supported.' [rank0]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [rank0]: AssertionError: Model: Qwen/Qwen3-30B-A3B is not supported.

@Jintao-Huang
Copy link
Collaborator Author

Jintao-Huang commented Apr 29, 2025

It's still on the main branch now, and the version ms-swift==3.4.0 will be released tonight.

@NianBroken
Copy link

请求增加对Qwen3-8B的自我认知训练的NoteBook文件

我在魔塔提供的PAI-DSW中使用“self-cognition-sft.ipynb”训练“Qwen3-8B”时注意到该NoteBook文件无法训练“Qwen3”模型。

@yxk9810
Copy link

yxk9810 commented Apr 29, 2025

能否添加全参数微调的脚本?

@Jintao-Huang
Copy link
Collaborator Author

You can refer to the example here and modify the --model parameter accordingly.

https://github.com/modelscope/ms-swift/blob/main/examples/train/full/qwen2_5_32b.sh

@Jintao-Huang
Copy link
Collaborator Author

请求增加对Qwen3-8B的自我认知训练的NoteBook文件

我在魔塔提供的PAI-DSW中使用“self-cognition-sft.ipynb”训练“Qwen3-8B”时注意到该NoteBook文件无法训练“Qwen3”模型。

已加入自我认知微调的demo

@qingzhong1
Copy link

qingzhong1 commented Apr 29, 2025

If I currently have data without a reasoning process, but I want to use this data to fine-tune Qwen3, should I simply add /no_think after the prompt and prefix the response with <think>\n\n</think>\n\n?

@Jintao-Huang
Copy link
Collaborator Author

Perhaps you can refer to this for a solution:

row['query'] = row['query'] + ' /no_think'

@NianBroken
Copy link

NianBroken commented Apr 29, 2025

已加入自我认知微调的demo

如何将微调成功后的模型导出为GGUF格式?
请求增加一个用于将通过ms-swift微调后的模型转为GGUF格式文件的Notebook文件

@stephen-nju
Copy link

Perhaps you can refer to this for a solution:

ms-swift/swift/llm/dataset/dataset/llm.py

Line 835 in 51cafe5

row['query'] = row['query'] + ' /no_think'

@Jintao-Huang 在不采用推理的情况下,是否仍然可以使用Qwen2.5 的模板微调模型?

@Jintao-Huang
Copy link
Collaborator Author

When using --packing true, please additionally use --attn_impl flash_attn. This was missed in the best practices.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

7 participants