8卡h200,deepspeed-zero2,qwen32b全量微调出现oom现象 #8643
Unanswered
fangjin001024
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
System Info
8卡h200,deepspeed-zero2,qwen32b全量微调出现oom现象
bf16 混合精度
get_dataset()3.4G --> load_model(from_pretrained(67g)-->init_adapter(67g))-->梯度累计**(100G)**>trainer.train(OOM)
模型加载完后,显存占用正常67G,开启了梯度累计,在计算并保存每个batch的梯度时候,占用了32G多,有点异常,按正常估算,梯度每张卡应该是32*4/8=16G,后续模型更新参数直接就爆显存了。
执行命令脚本
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 llamafactory-cli train examples/train_full/llama3_full_sft.yaml
参数设置:
`### model
model_name_or_path: /opt/workspace/model/Qwen/Qwen3-32B
trust_remote_code: true
method
stage: sft
do_train: true
finetuning_type: full
deepspeed: examples/deepspeed/ds_z2_config.json # choices: [ds_z0_config.json, ds_z2_config.json, ds_z3_config.json]
dataset
dataset: alpaca_zh_demo
template: qwen3
cutoff_len: 1024
max_samples: 1000
overwrite_cache: true
preprocessing_num_workers: 16
dataloader_num_workers: 4
output
output_dir: saves/qwen3-32b/full/sft
logging_steps: 2
save_steps: 500
plot_loss: true
overwrite_output_dir: true
save_only_model: false
report_to: none # choices: [none, wandb, tensorboard, swanlab, mlflow]
train
per_device_train_batch_size: 1
gradient_accumulation_steps: 1
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
learning_rate: 1.0e-5
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true
pure_bf16: false
ddp_timeout: 180000000
resume_from_checkpoint: null
deepspeed 参数 {
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"fp16": {
"enabled": false,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": true
},
"zero_optimization": {
"stage": 2,
"allgather_partitions": true,
"allgather_bucket_size": 5e8,
"overlap_comm": false,
"reduce_scatter": true,
"reduce_bucket_size": 5e8,
"contiguous_gradients": true,
"round_robin_gradients": true,
"load_from_fp32_weights": false
},
"flops_profiler": {
"enabled": true,
"profile_step": 6,
"module_depth": -1,
"top_modules": 1,
"detailed": true,
"output_file": "saves/qwen3-8b/full/sft/flops_report.txt"
}
}
`
Beta Was this translation helpful? Give feedback.
All reactions