Skip to content

Need Help!! qwen2.5vl7b lora sft with deepspeed zero3 #7588

@Luffy966

Description

@Luffy966

Reminder

  • I have read the above rules and searched the existing issues.

System Info

[2025-04-03 06:02:20,328] [INFO] [real_accelerator.py:239:get_accelerator] Setting ds_accelerator to cuda (auto detect)
INFO 04-03 06:02:21 [init.py:256] Automatically detected platform cuda.

  • llamafactory version: 0.9.3.dev0
  • Platform: Linux-5.4.0-162-generic-x86_64-with-glibc2.31
  • Python version: 3.12.9
  • PyTorch version: 2.6.0+cu118 (GPU)
  • Transformers version: 4.50.0
  • Datasets version: 3.4.1
  • Accelerate version: 1.5.2
  • PEFT version: 0.15.0
  • TRL version: 0.9.6
  • GPU type: NVIDIA GeForce RTX 4090
  • GPU number: 8
  • GPU memory: 23.65GB
  • DeepSpeed version: 0.16.5
  • vLLM version: 0.8.1

Reproduction

在使用deepspeed=examples/deepspeed/ds_z3_config.json用4张4090对qwen2.5vl7b进行lora微调时出现如下问题

Traceback (most recent call last):
  File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/distributed/run.py", line 918, in main
    run(args)
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl                                                            
[rank1]:     return self._call_impl(*args, **kwargs)                                                                                                                                                                                        
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                        
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1845, in _call_impl                                                                                               
[rank1]:     return inner()                                                                                                                                                                                                                 
[rank1]:            ^^^^^^^                                                                                                                                                                                                                 
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1793, in inner                                                                                                    
[rank1]:     result = forward_call(*args, **kwargs)                                                                                                                                                                                         
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                         
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/peft/tuners/tuners_utils.py", line 193, in forward                        
[rank1]:     return self.model.forward(*args, **kwargs)
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/transformers/models/qwen2_5_vl/modeling_qwen2_5_vl.py", line 1875, in forward
[rank1]:     logits = self.lm_head(hidden_states)
[rank1]:              ^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                                                                                                                                           
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1739, in _wrapped_call_impl
[rank1]:     return self._call_impl(*args, **kwargs)                                                                                                                                                                                        
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                  
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1845, in _call_impl
[rank1]:     return inner()                                                                                                                                                                                                                 
[rank1]:            ^^^^^^^
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1782, in inner
[rank1]:     args_result = hook(self, args)                                                                           
[rank1]:                   ^^^^^^^^^^^^^^^^                                                                           
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 372, in _post_backward_module_hook
[rank1]:     return apply_to_tensors_only(module.post_bwd_fn.apply,
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/deepspeed/runtime/zero/utils.py", line 133, in apply_to_tensors_only
[rank1]:     touched_output = apply_to_tensors_only(function, elem)
[rank1]:                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/deepspeed/runtime/zero/utils.py", line 149, in apply_to_tensors_only
[rank1]:     touched_output = function(value)
[rank1]:                      ^^^^^^^^^^^^^^^
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/autograd/function.py", line 575, in apply
[rank1]:     return super().apply(*args, **kwargs)  # type: ignore[misc]
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                    
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 440, in forward
[rank1]:     module.ds_grads_remaining += 1                                                                           
[rank1]:     ^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1928, in __getattr__
[rank1]:     raise AttributeError(                 
[rank1]: AttributeError: 'Linear' object has no attribute 'ds_grads_remaining'

修改为ds_z2_config时出现oom

此外,在使用ds_z3_config时,添加如下配置:

enable_liger_kernel: True
use_unsloth_gc: True

出现了如下新的问题:

[rank1]: Traceback (most recent call last):                                                                                                                                                                                     
[rank1]:   File "/home/mas-wang.zhenyu/download/LLaMA-Factory/src/llamafactory/launcher.py", line 23, in <module>                                                                                                                           
[rank1]:     launch()                                                                                                                                                                                                                       
[rank1]:   File "/home/mas-wang.zhenyu/download/LLaMA-Factory/src/llamafactory/launcher.py", line 19, in launch                                                                                                                             
[rank1]:     run_exp()                                                                                                                                                                                                                      
[rank1]:   File "/home/mas-wang.zhenyu/download/LLaMA-Factory/src/llamafactory/train/tuner.py", line 107, in run_exp                                                                                                                        
[rank1]:     _training_function(config={"args": args, "callbacks": callbacks})                                                                                                                                                              
[rank1]:   File "/home/mas-wang.zhenyu/download/LLaMA-Factory/src/llamafactory/train/tuner.py", line 69, in _training_function                                                                                                              
[rank1]:     run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)                                                                                                                                     
[rank1]:   File "/home/mas-wang.zhenyu/download/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 102, in run_sft                                                                                                                 
[rank1]:     train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/transformers/trainer.py", line 2245, in train
[rank1]:     return inner_training_loop(
[rank1]:            ^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/transformers/trainer.py", line 2556, in _inner_training_loop
[rank1]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank1]:                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/transformers/trainer.py", line 3764, in training_step
[rank1]:     self.accelerator.backward(loss, **kwargs)
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/accelerate/accelerator.py", line 2351, in backward
[rank1]:     self.deepspeed_engine_wrapped.backward(loss, **kwargs)
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/accelerate/utils/deepspeed.py", line 266, in backward
[rank1]:     self.engine.backward(loss, **kwargs)
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank1]:     ret_val = func(*args, **kwargs)
[rank1]:               ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 2187, in backward
[rank1]:     self._do_optimizer_backward(loss, retain_graph)
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/deepspeed/runtime/engine.py", line 2133, in _do_optimizer_backward
[rank1]:     self.optimizer.backward(loss, retain_graph=retain_graph)
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/deepspeed/utils/nvtx.py", line 20, in wrapped_fn
[rank1]:     ret_val = func(*args, **kwargs)
[rank1]:               ^^^^^^^^^^^^^^^^^^^^^
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/deepspeed/runtime/zero/stage3.py", line 2284, in backward
[rank1]:     self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/deepspeed/runtime/fp16/loss_scaler.py", line 63, in backward
[rank1]:     scaled_loss.backward(retain_graph=retain_graph)
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/_tensor.py", line 626, in backward
[rank1]:     torch.autograd.backward(
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/autograd/__init__.py", line 347, in backward
[rank1]:     _engine_run_backward(
[rank1]:   File "/home/mas-wang.zhenyu/anaconda3/envs/vllm/lib/python3.12/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
[rank1]:     return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
[rank1]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank1]: RuntimeError: The size of tensor a (0) must match the size of tensor b (3584) at non-singleton dimension 1

请问在不修改cutoff_len和max_pixels的前提下能否通过修改其他配置解决问题?求助!!

Others

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    solvedThis problem has been already solved

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions