Skip to content

[Megatron] support MoE (Qwen2-Moe & Qwen3-MoE) #4012

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Apr 28, 2025
46 changes: 40 additions & 6 deletions docs/source/Instruction/Megatron-SWIFT训练.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,9 @@ pip install git+https://github.com/NVIDIA/TransformerEngine.git@stable
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

# megatron-core
pip install git+https://github.com/NVIDIA/Megatron-LM.git@core_r0.11.0
```

或者你也可以使用镜像:
Expand All @@ -24,7 +27,7 @@ modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubunt
modelscope-registry.us-west-1.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.4.0-py311-torch2.6.0-vllm0.8.3-modelscope1.25.0-swift3.3.0.post1
```

依赖库Megatron-LM将会由swift进行git clone并安装,不需要用户手动安装。你也可以通过环境变量`MEGATRON_LM_PATH`指向已经下载好的repo路径(断网环境,[core_r0.11.0分支](https://github.com/NVIDIA/Megatron-LM/tree/core_r0.11.0))。
依赖库Megatron-LM中的训练模块将由swift进行git clone并安装。你也可以通过环境变量`MEGATRON_LM_PATH`指向已经下载好的repo路径(断网环境,[core_r0.11.0分支](https://github.com/NVIDIA/Megatron-LM/tree/core_r0.11.0))。


## 快速入门案例
Expand Down Expand Up @@ -104,13 +107,22 @@ I am a language model developed by swift, you can call me swift-robot. How can I

## Benchmark

使用`megatron sft`和`swift sft`在单机八卡A800环境下进行14B模型全参数训练的速度对比如下,对应脚本参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/benchmark)。
使用`megatron sft`和`swift sft`在单机八卡A800环境下进行Dense/MoE模型全参数训练的速度对比如下,对应脚本参考[这里](https://github.com/modelscope/ms-swift/tree/main/examples/train/megatron/benchmark)。

**Dense** Qwen2.5-14B:

| | Megatron-LM | Deepspeed-ZeRO2 | Deepspeed-ZeRO3 |
| -------- | ----------- | ---------- | ---------- |
| 训练速度 | 9.04s/it | 10.32s/it | 10.56s/it |
| 显存占用 | 8\*64GB | 8\*80GB | 8\*58GB |

**MoE** Qwen1.5-MoE-A2.7B:

| | Megatron-LM | Deepspeed-ZeRO2 | Deepspeed-ZeRO3 |
| -------- | ----------- | ---------- | ---------- |
| 训练速度 | 2.93s/it | 6.02s/it | 24.30s/it |
| 显存占用 | 8\*66GB | 8\*72GB | 8\*50GB |


## 命令行参数

Expand Down Expand Up @@ -187,8 +199,8 @@ I am a language model developed by swift, you can call me swift-robot. How can I
- overlap_param_gather: 启用分布式优化器中参数all-gather的重叠(降低DP通信耗时)。默认为False。
- distributed_timeout_minutes: torch.distributed的timeout时间(单位为分钟),默认为60分钟。

**日志参数**
- log_params_norm: 记录参数的norm。默认为True
**日志参数**:
- log_params_norm: 记录参数的norm。默认为False
- log_throughput: 记录每个GPU的吞吐量。默认为True。
- 注意:在非packing情况下,log_throughput并不准确,因为`seq_length`并不等于真实序列长度。
- tensorboard_log_interval: 记录到tensorboard的间隔(steps),默认为1。
Expand All @@ -199,11 +211,11 @@ I am a language model developed by swift, you can call me swift-robot. How can I
- log_memory_to_tensorboard: 将内存日志写入tensorboard。默认为True。
- logging_leval: 日志级别。默认为None。

**评估参数**
**评估参数**:
- 🔥eval_iters: 评估的迭代次数,默认为100。
- 🔥eval_interval: 评估的间隔(steps),默认为None,即设置为save_interval。

**混合精度参数**
**混合精度参数**:
- fp16: fp16模式。默认为None,会根据模型的torch_dtype进行设置。torch_dtype默认读取config.json。
- bf16: bf16模式。默认为None,会根据模型的torch_dtype进行设置。
- apply_query_key_layer_scaling: 将`Q * K^T` 缩放为 `1 / 层数`(例如:第layer_num层则除以layer_num)。这对fp16训练很有帮助。默认为None,即若使用`--fp16`,则设置为True。
Expand All @@ -228,9 +240,29 @@ I am a language model developed by swift, you can call me swift-robot. How can I
- add_qkv_bias: 仅在QKV的linear中增加bias,默认为True。
- attention_dropout: 默认为0.。
- hidden_dropout: 默认为0.。
- kv_channels: 默认为None,设置为`args.hidden_size // args.num_attention_heads`。
- qk_layernorm: 是否对Q和K进行层归一化。
- transformer_impl: 使用哪种transformer实现,可选项为'local'和'transformer_engine'。默认为transformer_engine。
- padded_vocab_size: 完整词表大小,默认为None。
- rope_scaling: rope_scaling相关参数,默认为None。格式参考[llama3.1 config.json](https://modelscope.cn/models/LLM-Research/Meta-Llama-3.1-8B-Instruct/file/view/master?fileName=config.json&status=1),传入json字符串。
- model_type: Huggingface模型权重中config.json中的model_type。


**MoE参数**:
- num_experts: MoE的专家数,默认为None。自动从config.json读取。
- moe_ffn_hidden_siz: 每个专家的前馈网络(ffn)的隐藏层大小。默认为None,设置为ffn_hidden_size。自动从config.json读取。
- moe_shared_expert_intermediate_size: 共享专家的总FFN隐藏层大小。如果有多个共享专家,它应等于 `num_shared_experts * ffn_size_of_each_shared_expert`。 默认为None。自动从config.json读取。
- moe_router_topk: 每个token路由到的专家数量。默认为None。自动从config.json读取。
- moe_router_pre_softmax: 为MoE启用预softmax路由,这意味着softmax会在top-k选择之前进行。默认为None。自动从config.json读取。
- moe_aux_loss_coeff: 辅助损失的缩放系数:建议的初始值为 1e-2。默认为None。自动从config.json读取。
- expert_model_parallel_size: 专家并行数,默认为1。
- moe_token_dispatcher_type: 要使用的token分发器类型。可选选项包括 'allgather'、'alltoall' 和 'alltoall_seq'。默认值为 'alltoall'。
- moe_grouped_gemm: 当每个rank包含多个专家时,通过在多个流中启动多个本地 GEMM 内核,利用 TransformerEngine中的GroupedLinear提高利用率和性能。默认为False。
- moe_router_load_balancing_type: 确定路由器的负载均衡策略。可选项为"aux_loss"、"seq_aux_loss"、"sinkhorn"、"none"。默认值为 "aux_loss"。
- moe_z_loss_coeff: z-loss 的缩放系数。默认为None。
- moe_expert_capacity_factor: 每个专家的容量因子,None表示不会丢弃任何token。默认为None。
- moe_shared_expert_overlap: 启用共享专家计算与调度器通信之间的重叠。如果不启用此选项,共享专家将在路由专家之后执行。仅在设置了`moe_shared_expert_intermediate_size`时有效。默认为False。


### Megatron训练参数

Expand All @@ -240,3 +272,5 @@ Megatron训练参数继承自Megatron参数和基本参数。基本参数的内
- 🔥packing: 是否使用序列packing,默认为False。
- 🔥streaming: 流式读取并处理数据集,默认False。通常在处理大型数据集时,设置为True。更多流式的参数查看命令行参数文档。
- lazy_tokenize: 默认为False。若该参数设置为False,则在训练之前对所有的数据集样本进行tokenize(这可以避免在训练中出现报错);设置为True,则在训练中对数据集进行tokenize(这可以节约内存)。
- dataloader_persistent_workers: 透传入dataloader的参数,默认为True。
- dataloader_prefetch_factor: 透传入dataloader的参数,默认为10。
1 change: 1 addition & 0 deletions docs/source/Instruction/命令行参数.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@
- 注意:CPT/SFT的随机包括两个部分:数据集的随机,由`dataset_shuffle`控制;train_dataloader中的随机,由`train_dataloader_shuffle`控制。
- val_dataset_shuffle: 是否对val_dataset进行随机操作。默认为False。
- 🔥streaming: 流式读取并处理数据集,默认False。通常在处理大型数据集时,设置为True
- 注意:需要额外设置`--max_steps`,因为流式数据集无法获得其长度
- interleave_prob: 默认值为 None。在组合多个数据集时,默认使用 `concatenate_datasets` 函数;如果设置了该参数,则会使用 `interleave_datasets` 函数。该参数通常用于流式数据集的组合,并会作为参数传入 `interleave_datasets` 函数中
- stopping_strategy: 可选为"first_exhausted", "all_exhausted",默认为"first_exhausted"。传入interleave_datasets函数中
- shuffle_buffer_size: 该参数用于指定流式数据集的随机buffer大小,默认为1000
Expand Down
8 changes: 4 additions & 4 deletions docs/source/Instruction/支持的模型和数据集.md
Original file line number Diff line number Diff line change
Expand Up @@ -173,11 +173,11 @@
|[Qwen/Qwen2.5-Math-1.5B](https://modelscope.cn/models/Qwen/Qwen2.5-Math-1.5B)|qwen2_5_math|qwen2_5_math|transformers>=4.37|✔|math|[Qwen/Qwen2.5-Math-1.5B](https://huggingface.co/Qwen/Qwen2.5-Math-1.5B)|
|[Qwen/Qwen2.5-Math-7B](https://modelscope.cn/models/Qwen/Qwen2.5-Math-7B)|qwen2_5_math|qwen2_5_math|transformers>=4.37|✔|math|[Qwen/Qwen2.5-Math-7B](https://huggingface.co/Qwen/Qwen2.5-Math-7B)|
|[Qwen/Qwen2.5-Math-72B](https://modelscope.cn/models/Qwen/Qwen2.5-Math-72B)|qwen2_5_math|qwen2_5_math|transformers>=4.37|✔|math|[Qwen/Qwen2.5-Math-72B](https://huggingface.co/Qwen/Qwen2.5-Math-72B)|
|[Qwen/Qwen1.5-MoE-A2.7B-Chat](https://modelscope.cn/models/Qwen/Qwen1.5-MoE-A2.7B-Chat)|qwen2_moe|qwen|transformers>=4.40|✘|-|[Qwen/Qwen1.5-MoE-A2.7B-Chat](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B-Chat)|
|[Qwen/Qwen1.5-MoE-A2.7B](https://modelscope.cn/models/Qwen/Qwen1.5-MoE-A2.7B)|qwen2_moe|qwen|transformers>=4.40|✘|-|[Qwen/Qwen1.5-MoE-A2.7B](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B)|
|[Qwen/Qwen1.5-MoE-A2.7B-Chat](https://modelscope.cn/models/Qwen/Qwen1.5-MoE-A2.7B-Chat)|qwen2_moe|qwen|transformers>=4.40|✔|-|[Qwen/Qwen1.5-MoE-A2.7B-Chat](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B-Chat)|
|[Qwen/Qwen1.5-MoE-A2.7B](https://modelscope.cn/models/Qwen/Qwen1.5-MoE-A2.7B)|qwen2_moe|qwen|transformers>=4.40|✔|-|[Qwen/Qwen1.5-MoE-A2.7B](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B)|
|[Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4](https://modelscope.cn/models/Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4)|qwen2_moe|qwen|transformers>=4.40|✘|-|[Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4)|
|[Qwen/Qwen2-57B-A14B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-57B-A14B-Instruct)|qwen2_moe|qwen|transformers>=4.40|✘|-|[Qwen/Qwen2-57B-A14B-Instruct](https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct)|
|[Qwen/Qwen2-57B-A14B](https://modelscope.cn/models/Qwen/Qwen2-57B-A14B)|qwen2_moe|qwen|transformers>=4.40|✘|-|[Qwen/Qwen2-57B-A14B](https://huggingface.co/Qwen/Qwen2-57B-A14B)|
|[Qwen/Qwen2-57B-A14B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-57B-A14B-Instruct)|qwen2_moe|qwen|transformers>=4.40|✔|-|[Qwen/Qwen2-57B-A14B-Instruct](https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct)|
|[Qwen/Qwen2-57B-A14B](https://modelscope.cn/models/Qwen/Qwen2-57B-A14B)|qwen2_moe|qwen|transformers>=4.40|✔|-|[Qwen/Qwen2-57B-A14B](https://huggingface.co/Qwen/Qwen2-57B-A14B)|
|[Qwen/Qwen2-57B-A14B-Instruct-GPTQ-Int4](https://modelscope.cn/models/Qwen/Qwen2-57B-A14B-Instruct-GPTQ-Int4)|qwen2_moe|qwen|transformers>=4.40|✘|-|[Qwen/Qwen2-57B-A14B-Instruct-GPTQ-Int4](https://huggingface.co/Qwen/Qwen2-57B-A14B-Instruct-GPTQ-Int4)|
|[Qwen/QwQ-32B-Preview](https://modelscope.cn/models/Qwen/QwQ-32B-Preview)|qwq_preview|qwq_preview|transformers>=4.37|✔|-|[Qwen/QwQ-32B-Preview](https://huggingface.co/Qwen/QwQ-32B-Preview)|
|[Qwen/QwQ-32B](https://modelscope.cn/models/Qwen/QwQ-32B)|qwq|qwq|transformers>=4.37|✔|-|[Qwen/QwQ-32B](https://huggingface.co/Qwen/QwQ-32B)|
Expand Down
1 change: 1 addition & 0 deletions docs/source_en/Instruction/Command-line-parameters.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ Hints:
- Note: The shuffling in CPT/SFT consists of two parts: dataset shuffling, controlled by `dataset_shuffle`; and shuffling in the train_dataloader, controlled by `train_dataloader_shuffle`.
- val_dataset_shuffle: Whether to perform shuffling on the val_dataset. Default is False.
- 🔥streaming: Stream reading and processing of the dataset, default is False. It is typically set to True when handling large datasets.
- Note: It is necessary to set `--max_steps` additionally, as the length of the streaming dataset cannot be obtained.
- interleave_prob: Defaults to None. When combining multiple datasets, the `concatenate_datasets` function is used by default. If this parameter is set, the `interleave_datasets` function will be used instead. This parameter is typically used when combining streaming datasets and is passed to the `interleave_datasets` function.
- stopping_strategy: Can be either "first_exhausted" or "all_exhausted", with the default being "first_exhausted". This parameter is passed to the `interleave_datasets` function.
- shuffle_buffer_size: This parameter is used to specify the shuffle buffer size for streaming datasets. Defaults to 1000.
Expand Down
Loading
Loading