-
Notifications
You must be signed in to change notification settings - Fork 257
Description
Describe the bug/ 问题描述 (Mandatory / 必填)
A clear and concise description of what the bug is.
更新mindnlp为0.5.0后,无法从mindnlp.engine import Trainer,改为from transformers import Trainer,但原来配置的并行环境是基于mindspore和Ascend的,而导入之后程序会默认检查基于PyTorch的并行环境,于是报错没有初始化成功。mindnlp
的代码试图检查PyTorch的分布式进程组(dist.get_world_size()
),而我们使用的是MindSpore的分布式。这两个框架的分布式通信是不兼容的。
- Hardware Environment(
Ascend
/GPU
/CPU
) / 硬件环境: Ascend
Please delete the backend not involved / 请删除不涉及的后端:
/device ascend/GPU/CPU/kirin/等其他芯片
-
Software Environment / 软件环境 (Mandatory / 必填):
-- MindSpore version (e.g., 1.7.0.Bxxx) : 2.6.0
-- Python version (e.g., Python 3.7.5) : 3.10
-- OS platform and distribution (e.g., Linux Ubuntu 16.04): openEuler 22.03 LTS (aarch64)
-- GCC/Compiler version (if compiled from source): 10.3.1 -
Excute Mode / 执行模式 (Mandatory / 必填)(
PyNative
/Graph
): Graph
Please delete the mode not involved / 请删除不涉及的模式:
/mode pynative
/mode graph
To Reproduce / 重现步骤 (Mandatory / 必填)
Steps to reproduce the behavior:
- Go to '...' 设置
from transformers import Trainer, TrainingArguments
- Click on '....' 配置并行环境
def setup_parallel_environment(config: Dict[str, Any]) -> tuple:
"""配置并行环境 - 适配msrun动态组网"""
if not config.get('parallel', {}).get('enabled', False):
return 0, 1
print("=== 配置并行环境 (msrun动态组网) ===")
# 1. 首先从环境变量获取rank_id并设置设备(必须在init()之前)
rank_id = int(os.getenv('RANK_ID', '0'))
ms.set_device("Ascend", rank_id)
# 2. 初始化通信
init()
# 3. 获取并行信息
rank_id = get_rank()
device_num = get_group_size()
print(f"当前进程rank: {rank_id}, 总设备数: {device_num}")
# 4. 配置自动并行上下文 - 适配动态组网
set_auto_parallel_context(
parallel_mode=ParallelMode.DATA_PARALLEL, # 数据并行模式
gradients_mean=True, # 梯度平均
device_num=device_num, # 设备数量
parameter_broadcast=True # 参数广播
)
print(f"并行配置完成: 数据并行模式, {device_num}卡 (msrun动态组网)")
return rank_id, device_num
- Scroll down to '....' 实施多卡训练
- See error
ValueError: Default process group has not been initialized, please make sure to call init_process_group.
Expected behavior / 预期结果 (Mandatory / 必填)
A clear and concise description of what you expected to happen.
无并行初始化报错
Screenshots/ 日志 / 截图 (Mandatory / 必填)
If applicable, add screenshots to help explain your problem.
Traceback (most recent call last):
File "/home/workspace/ReactionQwen/train/train_lora_multi_new.py", line 546, in
main()
File "/home/workspace/ReactionQwen/train/train_lora_multi_new.py", line 457, in main
model = setup_model_for_lora(config, lora_config)
File "/home/workspace/ReactionQwen/train/train_lora_multi_new.py", line 287, in setup_model_for_lora
base_model = MultiModalQwen(mm_config)
File "/home/workspace/ReactionQwen/model/reactionqwen.py", line 63, in init
self.qwen = qwen_model or AutoModelForCausalLM.from_pretrained(
File "/home/miniconda3/envs/ms-new/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 600, in from_pretrained
return model_class.from_pretrained(
File "/home/miniconda3/envs/ms-new/lib/python3.10/site-packages/mindnlp/utils/decorators.py", line 15, in wrapper
return fn(*args, **kwargs)
File "/home/miniconda3/envs/ms-new/lib/python3.10/site-packages/transformers/modeling_utils.py", line 317, in _wrapper
return func(*args, **kwargs)
File "/home/miniconda3/envs/ms-new/lib/python3.10/site-packages/transformers/modeling_utils.py", line 4918, in from_pretrained
checkpoint_files, sharded_metadata = _get_resolved_checkpoint_files(
File "/home/miniconda3/envs/ms-new/lib/python3.10/site-packages/mindnlp/transformers/modeling_utils.py", line 220, in wrapper
if GlobalComm.INITED and dist.get_world_size() > 1:
File "/home/miniconda3/envs/ms-new/lib/python3.10/site-packages/mindnlp/core/distributed/distributed_c10d.py", line 1945, in get_world_size
return _get_group_size(group)
File "/home/miniconda3/envs/ms-new/lib/python3.10/site-packages/mindnlp/core/distributed/distributed_c10d.py", line 953, in _get_group_size
default_pg = _get_default_group()
File "/home/miniconda3/envs/ms-new/lib/python3.10/site-packages/mindnlp/core/distributed/distributed_c10d.py", line 1154, in _get_default_group
raise ValueError(
ValueError: Default process group has not been initialized, please make sure to call init_process_group.
Additional context / 备注 (Optional / 选填)
Add any other context about the problem here.
希望给出mindnlp更新之后的调整部分的说明。例如training args的改变,callback类位置的改变。但还是希望优先解决并行环境初始化的问题。