在新版本（3.4）中，如果nproc_per_node小于CUDA_VISIBLE_DEVICES的数量时无法运行，老版本（3.2）可以 #4019

gscfwid · 2025-04-27T13:11:54Z

Describe the bug
新版本好像二者必须要相等。我因为上下文比较长，会使用进程<gpu数量的模式来训练，如果二者相等的话，会有OOM错误，之前用旧版（3.2）时没有问题，后来在新的服务器下载3.4之后就会弹出AssertionError: found no DeviceMesh from dtensor args for c10d.broadcast_.default!错误：

Your hardware and system info
CUDA 12
torch 试了 2.5和2.6都不行，好像不是torch的问题，我退回到swift 3.2的torch版本（好像是2.5）也是有错误的

Additional context

错误的详细信息：
[INFO:swift] model_parameter_info: PeftModelForCausalLM: 32898.0941M Params (134.2177M Trainable [0.4080%]), 0.0001M Buffers.
/GPUFS/sysu_xzhao_1/ms-swift/swift/trainers/mixin.py:81: FutureWarning: tokenizer is deprecated and will be removed in version 5.0.0 for Seq2SeqTrainer.__init__. Use processing_class instead.
super().init(
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
No label_names provided for model class PeftModelForCausalLM. Since PeftModel hides base models input arguments, if label_names is not given, label_names can't be set automatically within Trainer. Note that empty label_names list will be used instead.
[INFO:swift] The logging file will be saved in: /GPUFS/sysu_xzhao_1/ms-swift/output/Qwen2.5-32B-Instruct/v0-20250427-093426/logging.jsonl
[rank0]: Traceback (most recent call last):
[rank0]: File "/GPUFS/sysu_xzhao_1/ms-swift/swift/cli/sft.py", line 10, in
[rank0]: sft_main()
[rank0]: File "/GPUFS/sysu_xzhao_1/ms-swift/swift/llm/train/sft.py", line 283, in sft_main
[rank0]: return SwiftSft(args).main()
[rank0]: File "/GPUFS/sysu_xzhao_1/ms-swift/swift/llm/base.py", line 47, in main
[rank0]: result = self.run()
[rank0]: File "/GPUFS/sysu_xzhao_1/ms-swift/swift/llm/train/sft.py", line 144, in run
[rank0]: return self.train(trainer)
[rank0]: File "/GPUFS/sysu_xzhao_1/ms-swift/swift/llm/train/sft.py", line 204, in train
[rank0]: trainer.train(trainer.args.resume_from_checkpoint)
[rank0]: File "/GPUFS/sysu_xzhao_1/ms-swift/swift/trainers/mixin.py", line 294, in train
[rank0]: res = super().train(*args, **kwargs)
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
[rank0]: return inner_training_loop(
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/transformers/trainer.py", line 2374, in _inner_training_loop
[rank0]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/accelerate/accelerator.py", line 1446, in prepare
[rank0]: result = tuple(
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/accelerate/accelerator.py", line 1447, in
[rank0]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/accelerate/accelerator.py", line 1289, in _prepare_one
[rank0]: return self.prepare_model(obj, device_placement=device_placement)
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/accelerate/accelerator.py", line 1595, in prepare_model
[rank0]: model = torch.nn.parallel.DistributedDataParallel(
[rank0]: File "/GPUFS/sysu_xzhao_1/ms-swift/swift/llm/model/patcher.py", line 326, in
[rank0]: lambda self, model, device_ids, output_device, *args, **kwargs: _old_ddp_init(self, model, *args, **kwargs))
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 827, in init
[rank0]: _sync_module_states(
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/distributed/utils.py", line 323, in _sync_module_states
[rank0]: _sync_params_and_buffers(process_group, module_states, broadcast_bucket_size, src)
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/distributed/utils.py", line 334, in _sync_params_and_buffers
[rank0]: dist._broadcast_coalesced(
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/_compile.py", line 32, in inner
[rank0]: return disable_fn(*args, **kwargs)
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/distributed/tensor/_api.py", line 346, in torch_dispatch
[rank0]: return DTensor._op_dispatcher.dispatch(
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/distributed/tensor/_dispatch.py", line 167, in dispatch
[rank0]: op_info = self.unwrap_to_op_info(op_call, args, kwargs)
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/distributed/tensor/dispatch.py", line 400, in unwrap_to_op_info
[rank0]: assert mesh is not None, f"found no DeviceMesh from dtensor args for {op_call}!"
[rank0]: AssertionError: found no DeviceMesh from dtensor args for c10d.broadcast.default!
[rank0]:[W427 09:37:18.882445516 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E0427 09:37:20.937000 1189 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1247) of binary: /GPUFS/sysu_xzhao_1/swift/bin/python
Traceback (most recent call last):
File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/distributed/run.py", line 922, in
main()
File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

/GPUFS/sysu_xzhao_1/ms-swift/swift/cli/sft.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-04-27_09:37:20
host : conda-generic-conda-gen
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1247)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

在新版本（3.4）中，如果nproc_per_node小于CUDA_VISIBLE_DEVICES的数量时无法运行，老版本（3.2）可以 #4019

在新版本（3.4）中，如果nproc_per_node小于CUDA_VISIBLE_DEVICES的数量时无法运行，老版本（3.2）可以 #4019

gscfwid commented Apr 27, 2025 •

edited

Loading

/GPUFS/sysu_xzhao_1/ms-swift/swift/cli/sft.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-04-27_09:37:20
host : conda-generic-conda-gen
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1247)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

在新版本（3.4）中，如果nproc_per_node小于CUDA_VISIBLE_DEVICES的数量时无法运行，老版本（3.2）可以 #4019

在新版本（3.4）中，如果nproc_per_node小于CUDA_VISIBLE_DEVICES的数量时无法运行，老版本（3.2）可以 #4019

Comments

gscfwid commented Apr 27, 2025 • edited Loading

/GPUFS/sysu_xzhao_1/ms-swift/swift/cli/sft.py FAILED

Failures: <NO_OTHER_FAILURES>

Root Cause (first observed failure): [0]: time : 2025-04-27_09:37:20 host : conda-generic-conda-gen rank : 0 (local_rank: 0) exitcode : 1 (pid: 1247) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

gscfwid commented Apr 27, 2025 •

edited

Loading

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2025-04-27_09:37:20
host : conda-generic-conda-gen
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1247)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html