Skip to content

qwen2.5-vl-72b, vllm_server_host方式运行,CUDA out of memory #4023

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
quentin-wang opened this issue Apr 28, 2025 · 3 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@quentin-wang
Copy link

Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)

qwen2.5-vl-72b 训练采用:grpo lora方式,推理采用:vllm_server_host方式运行
训练开始第一步,在_all_gather中,显存爆了。

Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
CUDA 12.4, h800 8卡

torch==2.6
vllm==0.8.4

Additional context
训练配置:
MASTER_ADDR=10.0.0.12
NPROC_PER_NODE=8
NNODES=1
NODE_RANK=0
swift rlhf
--rlhf_type grpo
--model /home/ap/nas_b/modelhub/Qwen2.5-VL-72B-Instruct
--train_type lora
--dataset 'AI-MO/NuminaMath-TIR#1000'
--torch_dtype bfloat16
--num_train_epochs 2
--max_length 4096
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 1
--eval_steps 200
--save_steps 200
--learning_rate 1e-4
--save_total_limit 4
--logging_steps 2
--output_dir output
--warmup_ratio 0.05
--dataloader_num_workers 4
--max_completion_length 2048
--external_plugins ms-swift/examples/train/grpo/plugin/plugin.py
--reward_funcs external_r1v_acc format
--num_generations 8
--system ms-swift/examples/train/grpo/prompt.txt
--use_vllm true
--vllm_server_host 10.0.0.12
--vllm_server_port 8000
--deepspeed zero3_offload
--temperature 1.0
--top_p 1.0
--top_k 80
--log_completions true
--num_infer_workers 1
--num_iterations 1

Invalidate trace cache @ step 372: expected module 382, but got module 374
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 2.67e-06, 'memory(GiB)': 5.78, 'train_speed(iter/s)': 0.032201, 'completions/mean_length': 30.0, 'completions/min_length': 30.0, 'completions/max_length': 30.0, 'complet
ions/clipped_ratio': 0.0, 'rewards/MultiModalAccuracyORM/mean': 0.0, 'rewards/MultiModalAccuracyORM/std': 0.0, 'rewards/Format/mean': 0.0, 'rewards/Format/std': 0.0, 'reward': 0.0, 'reward_std': 0.0, 'kl': 0.0, 'clip_r
atio': 0.0, 'epoch': 0.0, 'global_step/max_steps': '1/594', 'percentage': '0.17%', 'elapsed_time': '21s', 'remaining_time': '3h 28m 19s'}
Train: 0%| | 1/594 [00:21<3:28:13, 21.07s/it][rank4]: Traceback (most recent call last):
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/cli/rlhf.py", line 5, in
[rank4]: rlhf_main()
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/llm/train/rlhf.py", line 99, in rlhf_main
[rank4]: return SwiftRLHF(args).main()
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/llm/base.py", line 47, in main
[rank4]: result = self.run()
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/llm/train/sft.py", line 147, in run
[rank4]: return self.train(trainer)
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/llm/train/sft.py", line 207, in train
[rank4]:trainer.train(trainer.args.resume_from_checkpoint)
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/trainers/mixin.py", line 298, in train
[rank4]: res = super().train(*args, **kwargs)
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
[rank4]: return inner_training_loop(
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
[rank4]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/trainers/rlhf_trainer/grpo_trainer.py", line 1155, in training_step
[rank4]: return super().training_step(model, inputs, num_items_in_batch)
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/transformers/trainer.py", line 3730, in training_step
[rank4]: inputs = self._prepare_inputs(inputs)
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/trl/extras/profiling.py", line 87, in wrapper
[rank4]: return func(self, *args, **kwargs)
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/trl/trainer/grpo_trainer.py", line 899, in _prepare_inputs
[rank4]: inputs = self._generate_and_score_completions(accumulated_local_batch)
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/trainers/rlhf_trainer/grpo_trainer.py", line 829, in _generate_and_score_completions
[rank4]: inputs = self._generate_completions(inputs)
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/trainers/rlhf_trainer/grpo_trainer.py", line 799, in _generate_completions
[rank4]: inputs, outputs = self._fast_infer(inputs)
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/trainers/rlhf_trainer/grpo_trainer.py", line 742, in _fast_infer
[rank4]: self._move_model_to_vllm_lmdeploy()
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/trl/extras/profiling.py", line 87, in wrapper
[rank4]: return func(self, *args, **kwargs)
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/trainers/rlhf_trainer/grpo_trainer.py", line 520, in _move_model_to_vllm_lmdeploy
[rank4]: return super()._move_model_to_vllm()
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/trl/extras/profiling.py", line 87, in wrapper
[rank4]: return func(self, *args, **kwargs)
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/trl/trainer/grpo_trainer.py", line 847, in _move_model_to_vllm
[rank4]: with gather_if_zero3(list(self.model.parameters())):
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2235, in enter
[rank4]: self.params[0].all_gather(param_list=self.params)
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1154, in all_gather
[rank4]: return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn

[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1522, in _all_gather
[rank4]: self._allgather_params_coalesced(all_gather_nonquantize_list, hierarchy, quantize=False)
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1810, in _allgather_params_coalesced
[rank4]: flat_tensor = torch.empty(tensor_size, dtype=param_list[0].ds_tensor.dtype,
[rank4]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 462.00 MiB. GPU 4 has a total capacity of 79.11 GiB of which 462.56 MiB is free. Process 179754 has 78.61 GiB memory in use. Of the allocated memor
y 74.57 GiB is allocated by PyTorch, and 92.46 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation

@hjh0119
Copy link
Collaborator

hjh0119 commented Apr 28, 2025

It appears that the OOM occurred during the gathering of parameters. To alleviate peak memory usage, you can use the move_model_batches parameter, for example, --move_model_batches 20, to gather the parameters in layers.

@quentin-wang
Copy link
Author

quentin-wang commented Apr 28, 2025

Thinks for the reply. I set --move_model_batches 20. But it OOM continued. It seems that occured in:

self._move_model_to_vllm_lmdeploy()

--> return super()._move_model_to_vllm()

then it goes in trl.

@hjh0119
Copy link
Collaborator

hjh0119 commented Apr 28, 2025

Thanks for reporting. I checked, and the external vllm currently does not support the move_model_batches argument. I will add support for this feature soon.

@hjh0119 hjh0119 self-assigned this Apr 28, 2025
@hjh0119 hjh0119 added the enhancement New feature or request label Apr 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants