You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
新版本好像二者必须要相等。我因为上下文比较长,会使用进程<gpu数量的模式来训练,如果二者相等的话,会有OOM错误,之前用旧版(3.2)时没有问题,后来在新的服务器下载3.4之后就会弹出AssertionError: found no DeviceMesh from dtensor args for c10d.broadcast_.default!错误:
Your hardware and system info
CUDA 12
torch 试了 2.5和2.6都不行,好像不是torch的问题,我退回到swift 3.2的torch版本(好像是2.5)也是有错误的
Additional context
错误的详细信息:
[INFO:swift] model_parameter_info: PeftModelForCausalLM: 32898.0941M Params (134.2177M Trainable [0.4080%]), 0.0001M Buffers.
/GPUFS/sysu_xzhao_1/ms-swift/swift/trainers/mixin.py:81: FutureWarning: tokenizer is deprecated and will be removed in version 5.0.0 for Seq2SeqTrainer.__init__. Use processing_class instead.
super().init(
Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
No label_names provided for model class PeftModelForCausalLM. Since PeftModel hides base models input arguments, if label_names is not given, label_names can't be set automatically within Trainer. Note that empty label_names list will be used instead.
[INFO:swift] The logging file will be saved in: /GPUFS/sysu_xzhao_1/ms-swift/output/Qwen2.5-32B-Instruct/v0-20250427-093426/logging.jsonl
[rank0]: Traceback (most recent call last):
[rank0]: File "/GPUFS/sysu_xzhao_1/ms-swift/swift/cli/sft.py", line 10, in
[rank0]: sft_main()
[rank0]: File "/GPUFS/sysu_xzhao_1/ms-swift/swift/llm/train/sft.py", line 283, in sft_main
[rank0]: return SwiftSft(args).main()
[rank0]: File "/GPUFS/sysu_xzhao_1/ms-swift/swift/llm/base.py", line 47, in main
[rank0]: result = self.run()
[rank0]: File "/GPUFS/sysu_xzhao_1/ms-swift/swift/llm/train/sft.py", line 144, in run
[rank0]: return self.train(trainer)
[rank0]: File "/GPUFS/sysu_xzhao_1/ms-swift/swift/llm/train/sft.py", line 204, in train
[rank0]: trainer.train(trainer.args.resume_from_checkpoint)
[rank0]: File "/GPUFS/sysu_xzhao_1/ms-swift/swift/trainers/mixin.py", line 294, in train
[rank0]: res = super().train(*args, **kwargs)
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
[rank0]: return inner_training_loop(
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/transformers/trainer.py", line 2374, in _inner_training_loop
[rank0]: model, self.optimizer = self.accelerator.prepare(self.model, self.optimizer)
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/accelerate/accelerator.py", line 1446, in prepare
[rank0]: result = tuple(
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/accelerate/accelerator.py", line 1447, in
[rank0]: self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/accelerate/accelerator.py", line 1289, in _prepare_one
[rank0]: return self.prepare_model(obj, device_placement=device_placement)
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/accelerate/accelerator.py", line 1595, in prepare_model
[rank0]: model = torch.nn.parallel.DistributedDataParallel(
[rank0]: File "/GPUFS/sysu_xzhao_1/ms-swift/swift/llm/model/patcher.py", line 326, in
[rank0]: lambda self, model, device_ids, output_device, *args, **kwargs: _old_ddp_init(self, model, *args, **kwargs))
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 827, in init
[rank0]: _sync_module_states(
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/distributed/utils.py", line 323, in _sync_module_states
[rank0]: _sync_params_and_buffers(process_group, module_states, broadcast_bucket_size, src)
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/distributed/utils.py", line 334, in _sync_params_and_buffers
[rank0]: dist._broadcast_coalesced(
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/_compile.py", line 32, in inner
[rank0]: return disable_fn(*args, **kwargs)
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 745, in _fn
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/distributed/tensor/_api.py", line 346, in torch_dispatch
[rank0]: return DTensor._op_dispatcher.dispatch(
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/distributed/tensor/_dispatch.py", line 167, in dispatch
[rank0]: op_info = self.unwrap_to_op_info(op_call, args, kwargs)
[rank0]: File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/distributed/tensor/dispatch.py", line 400, in unwrap_to_op_info
[rank0]: assert mesh is not None, f"found no DeviceMesh from dtensor args for {op_call}!"
[rank0]: AssertionError: found no DeviceMesh from dtensor args for c10d.broadcast.default!
[rank0]:[W427 09:37:18.882445516 ProcessGroupNCCL.cpp:1496] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
E0427 09:37:20.937000 1189 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 1247) of binary: /GPUFS/sysu_xzhao_1/swift/bin/python
Traceback (most recent call last):
File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/distributed/run.py", line 922, in
main()
File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 355, in wrapper
return f(*args, **kwargs)
File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/distributed/run.py", line 918, in main
run(args)
File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/distributed/run.py", line 909, in run
elastic_launch(
File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/GPUFS/sysu_xzhao_1/swift/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Describe the bug
新版本好像二者必须要相等。我因为上下文比较长,会使用进程<gpu数量的模式来训练,如果二者相等的话,会有OOM错误,之前用旧版(3.2)时没有问题,后来在新的服务器下载3.4之后就会弹出AssertionError: found no DeviceMesh from dtensor args for c10d.broadcast_.default!错误:
Your hardware and system info
CUDA 12
torch 试了 2.5和2.6都不行,好像不是torch的问题,我退回到swift 3.2的torch版本(好像是2.5)也是有错误的
Additional context
The text was updated successfully, but these errors were encountered: