Skip to content

mindnlp 0.5.0 from transformers import Trainer后初始化并行环境出错 #2148

@YuanpengGao

Description

@YuanpengGao

Describe the bug/ 问题描述 (Mandatory / 必填)
A clear and concise description of what the bug is.
更新mindnlp为0.5.0后,无法从mindnlp.engine import Trainer,改为from transformers import Trainer,但原来配置的并行环境是基于mindspore和Ascend的,而导入之后程序会默认检查基于PyTorch的并行环境,于是报错没有初始化成功。mindnlp的代码试图检查PyTorch的分布式进程组(dist.get_world_size()),而我们使用的是MindSpore的分布式。这两个框架的分布式通信是不兼容的。

  • Hardware Environment(Ascend/GPU/CPU) / 硬件环境: Ascend

Please delete the backend not involved / 请删除不涉及的后端:
/device ascend/GPU/CPU/kirin/等其他芯片

  • Software Environment / 软件环境 (Mandatory / 必填):
    -- MindSpore version (e.g., 1.7.0.Bxxx) : 2.6.0
    -- Python version (e.g., Python 3.7.5) : 3.10
    -- OS platform and distribution (e.g., Linux Ubuntu 16.04): openEuler 22.03 LTS (aarch64)
    -- GCC/Compiler version (if compiled from source): 10.3.1

  • Excute Mode / 执行模式 (Mandatory / 必填)(PyNative/Graph): Graph

Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
/mode graph

To Reproduce / 重现步骤 (Mandatory / 必填)
Steps to reproduce the behavior:

  1. Go to '...' 设置 from transformers import Trainer, TrainingArguments
  2. Click on '....' 配置并行环境
def setup_parallel_environment(config: Dict[str, Any]) -> tuple:
    """配置并行环境 - 适配msrun动态组网"""
    if not config.get('parallel', {}).get('enabled', False):
        return 0, 1
    
    print("=== 配置并行环境 (msrun动态组网) ===")
    
    # 1. 首先从环境变量获取rank_id并设置设备(必须在init()之前)
    
    rank_id = int(os.getenv('RANK_ID', '0'))
    ms.set_device("Ascend", rank_id)
    
    # 2. 初始化通信
    init()
    
    # 3. 获取并行信息
    rank_id = get_rank()
    device_num = get_group_size()
    
    print(f"当前进程rank: {rank_id}, 总设备数: {device_num}")
    
    # 4. 配置自动并行上下文 - 适配动态组网
    set_auto_parallel_context(
        parallel_mode=ParallelMode.DATA_PARALLEL,  # 数据并行模式
        gradients_mean=True,                      # 梯度平均
        device_num=device_num,                    # 设备数量
        parameter_broadcast=True                   # 参数广播
    )
    
    print(f"并行配置完成: 数据并行模式, {device_num}卡 (msrun动态组网)")
    return rank_id, device_num

  1. Scroll down to '....' 实施多卡训练
  2. See error
    ValueError: Default process group has not been initialized, please make sure to call init_process_group.

Expected behavior / 预期结果 (Mandatory / 必填)
A clear and concise description of what you expected to happen.
无并行初始化报错

Screenshots/ 日志 / 截图 (Mandatory / 必填)
If applicable, add screenshots to help explain your problem.
Traceback (most recent call last):
File "/home/workspace/ReactionQwen/train/train_lora_multi_new.py", line 546, in
main()
File "/home/workspace/ReactionQwen/train/train_lora_multi_new.py", line 457, in main
model = setup_model_for_lora(config, lora_config)
File "/home/workspace/ReactionQwen/train/train_lora_multi_new.py", line 287, in setup_model_for_lora
base_model = MultiModalQwen(mm_config)
File "/home/workspace/ReactionQwen/model/reactionqwen.py", line 63, in init
self.qwen = qwen_model or AutoModelForCausalLM.from_pretrained(
File "/home/miniconda3/envs/ms-new/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 600, in from_pretrained
return model_class.from_pretrained(
File "/home/miniconda3/envs/ms-new/lib/python3.10/site-packages/mindnlp/utils/decorators.py", line 15, in wrapper
return fn(*args, **kwargs)
File "/home/miniconda3/envs/ms-new/lib/python3.10/site-packages/transformers/modeling_utils.py", line 317, in _wrapper
return func(*args, **kwargs)
File "/home/miniconda3/envs/ms-new/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4918, in from_pretrained
checkpoint_files, sharded_metadata = _get_resolved_checkpoint_files(
File "/home/miniconda3/envs/ms-new/lib/python3.10/site-packages/mindnlp/transformers/modeling_utils.py", line 220, in wrapper
if GlobalComm.INITED and dist.get_world_size() > 1:
File "/home/miniconda3/envs/ms-new/lib/python3.10/site-packages/mindnlp/core/distributed/distributed_c10d.py", line 1945, in get_world_size
return _get_group_size(group)
File "/home/miniconda3/envs/ms-new/lib/python3.10/site-packages/mindnlp/core/distributed/distributed_c10d.py", line 953, in _get_group_size
default_pg = _get_default_group()
File "/home/miniconda3/envs/ms-new/lib/python3.10/site-packages/mindnlp/core/distributed/distributed_c10d.py", line 1154, in _get_default_group
raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.

Additional context / 备注 (Optional / 选填)
Add any other context about the problem here.
希望给出mindnlp更新之后的调整部分的说明。例如training args的改变,callback类位置的改变。但还是希望优先解决并行环境初始化的问题。

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions