-
Notifications
You must be signed in to change notification settings - Fork 7.6k
Closed
Labels
solvedThis problem has been already solvedThis problem has been already solved
Description
Reminder
- I have read the above rules and searched the existing issues.
System Info
[2025-04-03 06:02:20,328] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO 04-03 06:02:21 [init.py:256] Automatically detected platform cuda.
llamafactoryversion: 0.9.3.dev0- Platform: Linux-5.4.0-162-generic-x86_64-with-glibc2.31
- Python version: 3.12.9
- PyTorch version: 2.6.0+cu118 (GPU)
- Transformers version: 4.50.0
- Datasets version: 3.4.1
- Accelerate version: 1.5.2
- PEFT version: 0.15.0
- TRL version: 0.9.6
- GPU type: NVIDIA GeForce RTX 4090
- GPU number: 8
- GPU memory: 23.65GB
- DeepSpeed version: 0.16.5
- vLLM version: 0.8.1
Reproduction
在使用deepspeed=examples/deepspeed/ds_z3_config.json用4张4090对qwen2.5vl7b进行lora微调时出现如下问题
Traceback (most recent call last):
File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/distributed/run.py", line 918, in main
run(args)
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1845, in _call_impl
[rank1]: return inner()
[rank1]: ^^^^^^^
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1793, in inner
[rank1]: result = forward_call(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/peft/tuners/tuners_utils.py", line 193, in forward
[rank1]: return self.model.forward(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 1875, in forward
[rank1]: logits = self.lm_head(hidden_states)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank1]: return self._call_impl(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1845, in _call_impl
[rank1]: return inner()
[rank1]: ^^^^^^^
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1782, in inner
[rank1]: args_result = hook(self, args)
[rank1]: ^^^^^^^^^^^^^^^^
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 372, in _post_backward_module_hook
[rank1]: return apply_to_tensors_only(module.post_bwd_fn.apply,
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/deepspeed/runtime/zero/utils.py", line 133, in apply_to_tensors_only
[rank1]: touched_output = apply_to_tensors_only(function, elem)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/deepspeed/runtime/zero/utils.py", line 149, in apply_to_tensors_only
[rank1]: touched_output = function(value)
[rank1]: ^^^^^^^^^^^^^^^
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/autograd/function.py", line 575, in apply
[rank1]: return super().apply(*args, **kwargs) # type: ignore[misc]
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 440, in forward
[rank1]: module.ds_grads_remaining += 1
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1928, in __getattr__
[rank1]: raise AttributeError(
[rank1]: AttributeError: 'Linear' object has no attribute 'ds_grads_remaining'
修改为ds_z2_config时出现oom
此外,在使用ds_z3_config时,添加如下配置:
enable_liger_kernel: True
use_unsloth_gc: True
出现了如下新的问题:
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/mas-wang.zhenyu/download/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>
[rank1]: launch()
[rank1]: File "/home/mas-wang.zhenyu/download/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch
[rank1]: run_exp()
[rank1]: File "/home/mas-wang.zhenyu/download/LLaMA-Factory/src/llamafactory/train/tuner.py", line 107, in run_exp
[rank1]: _training_function(config={"args": args, "callbacks": callbacks})
[rank1]: File "/home/mas-wang.zhenyu/download/LLaMA-Factory/src/llamafactory/train/tuner.py", line 69, in _training_function
[rank1]: run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
[rank1]: File "/home/mas-wang.zhenyu/download/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 102, in run_sft
[rank1]: train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/transformers/trainer.py", line 2245, in train
[rank1]: return inner_training_loop(
[rank1]: ^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/transformers/trainer.py", line 2556, in _inner_training_loop
[rank1]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/transformers/trainer.py", line 3764, in training_step
[rank1]: self.accelerator.backward(loss, **kwargs)
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/accelerate/accelerator.py", line 2351, in backward
[rank1]: self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/accelerate/utils/deepspeed.py", line 266, in backward
[rank1]: self.engine.backward(loss, **kwargs)
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 2187, in backward
[rank1]: self._do_optimizer_backward(loss, retain_graph)
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 2133, in _do_optimizer_backward
[rank1]: self.optimizer.backward(loss, retain_graph=retain_graph)
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank1]: ret_val = func(*args, **kwargs)
[rank1]: ^^^^^^^^^^^^^^^^^^^^^
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/deepspeed/runtime/zero/stage3.py", line 2284, in backward
[rank1]: self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank1]: scaled_loss.backward(retain_graph=retain_graph)
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/_tensor.py", line 626, in backward
[rank1]: torch.autograd.backward(
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/autograd/__init__.py", line 347, in backward
[rank1]: _engine_run_backward(
[rank1]: File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
[rank1]: return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
[rank1]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: RuntimeError: The size of tensor a (0) must match the size of tensor b (3584) at non-singleton dimension 1
请问在不修改cutoff_len和max_pixels的前提下能否通过修改其他配置解决问题?求助!!
Others
No response
Metadata
Metadata
Assignees
Labels
solvedThis problem has been already solvedThis problem has been already solved