[deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 128, reducing to 64 #228

dopu2k16 · 2023-09-03T16:00:17Z

dopu2k16
Sep 3, 2023

[INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 128, reducing to 64

[loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32, reducing to 16

[loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 0, but hysteresis is 2. Reducing hysteresis to 1
loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 0, reducing to 0

lr=2e-4
lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05
model_type="llama"
tokenizer_path_lang_vac="./vac_bpe.model"
dataset_dir_2=./data/vac/
data_cache=temp_data_cache_dir
per_device_train_batch_size=4
per_device_eval_batch_size=1
gradient_accumulation_steps=8
output_dir=vac_llama_peft_scratch_output_dir

deepspeed_config_file=ds_zero2_no_offload.json

torchrun --nnodes 1 --nproc_per_node 1 run_clm_llama_pretraining_peft.py
--deepspeed ${deepspeed_config_file}
--model_type ${model_type}
--tokenizer_name_or_path ${tokenizer_path_lang_vac}
--dataset_dir ${dataset_dir_2}
--data_cache_dir ${data_cache}
--validation_split_percentage 0.05
--per_device_train_batch_size ${per_device_train_batch_size}
--per_device_eval_batch_size ${per_device_eval_batch_size}
--do_train
--do_eval
--seed 42
--fp16
--num_train_epochs 1
--lr_scheduler_type linear
--learning_rate ${lr}
--warmup_ratio 0.05
--weight_decay 0.01
--logging_strategy steps
--logging_steps 10
--save_strategy steps
--save_total_limit 2
--save_steps 200
--gradient_accumulation_steps ${gradient_accumulation_steps}
--preprocessing_num_workers 8
--block_size 512
--output_dir ${output_dir}
--overwrite_output_dir True
--ddp_timeout 30000
--logging_first_step True
--lora_rank ${lora_rank}
--lora_alpha ${lora_alpha}
--trainable ${lora_trainable}
--lora_dropout ${lora_dropout}
--torch_dtype bfloat16
--gradient_checkpointing True
--ddp_find_unused_parameters False

May you tell me why there is overflow and the loss is not decreasing much, the model is not learning anything

dopu2k16 · 2023-09-03T16:02:48Z

dopu2k16
Sep 3, 2023
Author

{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 100,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1e-10
},

"zero_optimization": {
    "stage": 2,
    "allgather_partitions": true,
    "allgather_bucket_size": 1e8,
    "overlap_comm": true,
    "reduce_scatter": true,
    "reduce_bucket_size": 1e8,
    "contiguous_gradients": true
},

"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 2000,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false

}

Is it because bp16 is not enabled in the DeepSpeed configuration?

0 replies

dopu2k16 · 2023-09-03T17:13:52Z

dopu2k16
Sep 3, 2023
Author

Even with float16, I am getting the similar issue

[2023-09-03 22:41:56,720] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1

0 replies

dopu2k16 · 2023-09-03T21:12:20Z

dopu2k16
Sep 3, 2023
Author

HI, I was able to resolve it and now I don't have Overflow issues. Thank you.

6 replies

can you please tell me ,how did you solve this issue?

When im facing this issue, i tried few method. 1st solution is i used to run in bf16. Second solution, i used A100 model to train llama2-7b,it solve the overflow issue

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 128, reducing to 64 #228

{{title}}

Replies: 3 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 128, reducing to 64 #228

dopu2k16 Sep 3, 2023

Replies: 3 comments · 6 replies

dopu2k16 Sep 3, 2023 Author

dopu2k16 Sep 3, 2023 Author

dopu2k16 Sep 3, 2023 Author

JianqiaoLu Oct 19, 2023

yusufcakmakk Nov 30, 2023

Chaseldot Jan 4, 2024

puppy2000 Jul 10, 2024

sazzadDAI Jul 10, 2024

dopu2k16
Sep 3, 2023

Replies: 3 comments 6 replies

dopu2k16
Sep 3, 2023
Author

dopu2k16
Sep 3, 2023
Author

dopu2k16
Sep 3, 2023
Author