You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
CUDA 12.4, h800 8卡
Invalidate trace cache @ step 372: expected module 382, but got module 374
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 2.67e-06, 'memory(GiB)': 5.78, 'train_speed(iter/s)': 0.032201, 'completions/mean_length': 30.0, 'completions/min_length': 30.0, 'completions/max_length': 30.0, 'complet
ions/clipped_ratio': 0.0, 'rewards/MultiModalAccuracyORM/mean': 0.0, 'rewards/MultiModalAccuracyORM/std': 0.0, 'rewards/Format/mean': 0.0, 'rewards/Format/std': 0.0, 'reward': 0.0, 'reward_std': 0.0, 'kl': 0.0, 'clip_r
atio': 0.0, 'epoch': 0.0, 'global_step/max_steps': '1/594', 'percentage': '0.17%', 'elapsed_time': '21s', 'remaining_time': '3h 28m 19s'}
Train: 0%| | 1/594 [00:21<3:28:13, 21.07s/it][rank4]: Traceback (most recent call last):
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/cli/rlhf.py", line 5, in
[rank4]: rlhf_main()
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/llm/train/rlhf.py", line 99, in rlhf_main
[rank4]: return SwiftRLHF(args).main()
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/llm/base.py", line 47, in main
[rank4]: result = self.run()
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/llm/train/sft.py", line 147, in run
[rank4]: return self.train(trainer)
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/llm/train/sft.py", line 207, in train
[rank4]:trainer.train(trainer.args.resume_from_checkpoint)
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/trainers/mixin.py", line 298, in train
[rank4]: res = super().train(*args, **kwargs)
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
[rank4]: return inner_training_loop(
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
[rank4]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/trainers/rlhf_trainer/grpo_trainer.py", line 1155, in training_step
[rank4]: return super().training_step(model, inputs, num_items_in_batch)
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/transformers/trainer.py", line 3730, in training_step
[rank4]: inputs = self._prepare_inputs(inputs)
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/trl/extras/profiling.py", line 87, in wrapper
[rank4]: return func(self, *args, **kwargs)
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/trl/trainer/grpo_trainer.py", line 899, in _prepare_inputs
[rank4]: inputs = self._generate_and_score_completions(accumulated_local_batch)
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/trainers/rlhf_trainer/grpo_trainer.py", line 829, in _generate_and_score_completions
[rank4]: inputs = self._generate_completions(inputs)
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/trainers/rlhf_trainer/grpo_trainer.py", line 799, in _generate_completions
[rank4]: inputs, outputs = self._fast_infer(inputs)
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/trainers/rlhf_trainer/grpo_trainer.py", line 742, in _fast_infer
[rank4]: self._move_model_to_vllm_lmdeploy()
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/trl/extras/profiling.py", line 87, in wrapper
[rank4]: return func(self, *args, **kwargs)
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/trainers/rlhf_trainer/grpo_trainer.py", line 520, in _move_model_to_vllm_lmdeploy
[rank4]: return super()._move_model_to_vllm()
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/trl/extras/profiling.py", line 87, in wrapper
[rank4]: return func(self, *args, **kwargs)
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/trl/trainer/grpo_trainer.py", line 847, in _move_model_to_vllm
[rank4]: with gather_if_zero3(list(self.model.parameters())):
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2235, in enter
[rank4]: self.params[0].all_gather(param_list=self.params)
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1154, in all_gather
[rank4]: return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1522, in _all_gather
[rank4]: self._allgather_params_coalesced(all_gather_nonquantize_list, hierarchy, quantize=False)
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1810, in _allgather_params_coalesced
[rank4]: flat_tensor = torch.empty(tensor_size, dtype=param_list[0].ds_tensor.dtype,
[rank4]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 462.00 MiB. GPU 4 has a total capacity of 79.11 GiB of which 462.56 MiB is free. Process 179754 has 78.61 GiB memory in use. Of the allocated memor
y 74.57 GiB is allocated by PyTorch, and 92.46 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation
The text was updated successfully, but these errors were encountered:
It appears that the OOM occurred during the gathering of parameters. To alleviate peak memory usage, you can use the move_model_batches parameter, for example, --move_model_batches 20, to gather the parameters in layers.
Thanks for reporting. I checked, and the external vllm currently does not support the move_model_batches argument. I will add support for this feature soon.
Describe the bug
What the bug is, and how to reproduce, better with screenshots(描述bug以及复现过程,最好有截图)
qwen2.5-vl-72b 训练采用:grpo lora方式,推理采用:vllm_server_host方式运行
训练开始第一步,在_all_gather中,显存爆了。
Your hardware and system info
Write your system info like CUDA version/system/GPU/torch version here(在这里给出硬件信息和系统信息,如CUDA版本,系统,GPU型号和torch版本等)
CUDA 12.4, h800 8卡
torch==2.6
vllm==0.8.4
Additional context
训练配置:
MASTER_ADDR=10.0.0.12
NPROC_PER_NODE=8
NNODES=1
NODE_RANK=0
swift rlhf
--rlhf_type grpo
--model /home/ap/nas_b/modelhub/Qwen2.5-VL-72B-Instruct
--train_type lora
--dataset 'AI-MO/NuminaMath-TIR#1000'
--torch_dtype bfloat16
--num_train_epochs 2
--max_length 4096
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 1
--eval_steps 200
--save_steps 200
--learning_rate 1e-4
--save_total_limit 4
--logging_steps 2
--output_dir output
--warmup_ratio 0.05
--dataloader_num_workers 4
--max_completion_length 2048
--external_plugins ms-swift/examples/train/grpo/plugin/plugin.py
--reward_funcs external_r1v_acc format
--num_generations 8
--system ms-swift/examples/train/grpo/prompt.txt
--use_vllm true
--vllm_server_host 10.0.0.12
--vllm_server_port 8000
--deepspeed zero3_offload
--temperature 1.0
--top_p 1.0
--top_k 80
--log_completions true
--num_infer_workers 1
--num_iterations 1
Invalidate trace cache @ step 372: expected module 382, but got module 374
{'loss': 0.0, 'grad_norm': 0.0, 'learning_rate': 2.67e-06, 'memory(GiB)': 5.78, 'train_speed(iter/s)': 0.032201, 'completions/mean_length': 30.0, 'completions/min_length': 30.0, 'completions/max_length': 30.0, 'complet
ions/clipped_ratio': 0.0, 'rewards/MultiModalAccuracyORM/mean': 0.0, 'rewards/MultiModalAccuracyORM/std': 0.0, 'rewards/Format/mean': 0.0, 'rewards/Format/std': 0.0, 'reward': 0.0, 'reward_std': 0.0, 'kl': 0.0, 'clip_r
atio': 0.0, 'epoch': 0.0, 'global_step/max_steps': '1/594', 'percentage': '0.17%', 'elapsed_time': '21s', 'remaining_time': '3h 28m 19s'}
Train: 0%| | 1/594 [00:21<3:28:13, 21.07s/it][rank4]: Traceback (most recent call last):
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/cli/rlhf.py", line 5, in
[rank4]: rlhf_main()
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/llm/train/rlhf.py", line 99, in rlhf_main
[rank4]: return SwiftRLHF(args).main()
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/llm/base.py", line 47, in main
[rank4]: result = self.run()
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/llm/train/sft.py", line 147, in run
[rank4]: return self.train(trainer)
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/llm/train/sft.py", line 207, in train
[rank4]:trainer.train(trainer.args.resume_from_checkpoint)
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/trainers/mixin.py", line 298, in train
[rank4]: res = super().train(*args, **kwargs)
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/transformers/trainer.py", line 2245, in train
[rank4]: return inner_training_loop(
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/transformers/trainer.py", line 2560, in _inner_training_loop
[rank4]: tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/trainers/rlhf_trainer/grpo_trainer.py", line 1155, in training_step
[rank4]: return super().training_step(model, inputs, num_items_in_batch)
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/transformers/trainer.py", line 3730, in training_step
[rank4]: inputs = self._prepare_inputs(inputs)
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/trl/extras/profiling.py", line 87, in wrapper
[rank4]: return func(self, *args, **kwargs)
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/trl/trainer/grpo_trainer.py", line 899, in _prepare_inputs
[rank4]: inputs = self._generate_and_score_completions(accumulated_local_batch)
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/trainers/rlhf_trainer/grpo_trainer.py", line 829, in _generate_and_score_completions
[rank4]: inputs = self._generate_completions(inputs)
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/trainers/rlhf_trainer/grpo_trainer.py", line 799, in _generate_completions
[rank4]: inputs, outputs = self._fast_infer(inputs)
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/trainers/rlhf_trainer/grpo_trainer.py", line 742, in _fast_infer
[rank4]: self._move_model_to_vllm_lmdeploy()
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/trl/extras/profiling.py", line 87, in wrapper
[rank4]: return func(self, *args, **kwargs)
[rank4]: File "/home/ap/nas_b/production/wb_workspace/ms-swift.github.0426/swift/trainers/rlhf_trainer/grpo_trainer.py", line 520, in _move_model_to_vllm_lmdeploy
[rank4]: return super()._move_model_to_vllm()
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/trl/extras/profiling.py", line 87, in wrapper
[rank4]: return func(self, *args, **kwargs)
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/trl/trainer/grpo_trainer.py", line 847, in _move_model_to_vllm
[rank4]: with gather_if_zero3(list(self.model.parameters())):
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 2235, in enter
[rank4]: self.params[0].all_gather(param_list=self.params)
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1154, in all_gather
[rank4]: return self._all_gather(param_list, async_op=async_op, hierarchy=hierarchy)
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1522, in _all_gather
[rank4]: self._allgather_params_coalesced(all_gather_nonquantize_list, hierarchy, quantize=False)
[rank4]: File "/home/ap/nas_b/miniconda3/envs/qwen25vl.train4/lib/python3.10/site-packages/deepspeed/runtime/zero/partition_parameters.py", line 1810, in _allgather_params_coalesced
[rank4]: flat_tensor = torch.empty(tensor_size, dtype=param_list[0].ds_tensor.dtype,
[rank4]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 462.00 MiB. GPU 4 has a total capacity of 79.11 GiB of which 462.56 MiB is free. Process 179754 has 78.61 GiB memory in use. Of the allocated memor
y 74.57 GiB is allocated by PyTorch, and 92.46 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation
The text was updated successfully, but these errors were encountered: