Replies: 3 comments 6 replies
-
{
} Is it because bp16 is not enabled in the DeepSpeed configuration? |
Beta Was this translation helpful? Give feedback.
-
Even with float16, I am getting the similar issue [2023-09-03 22:41:56,720] [INFO] [loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 65536, but hysteresis is 2. Reducing hysteresis to 1 |
Beta Was this translation helpful? Give feedback.
-
HI, I was able to resolve it and now I don't have Overflow issues. Thank you. |
Beta Was this translation helpful? Give feedback.
-
[INFO] [loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 128, reducing to 64
[loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 32, reducing to 16
[loss_scaler.py:190:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 0, but hysteresis is 2. Reducing hysteresis to 1
loss_scaler.py:183:update_scale] [deepspeed] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 0, reducing to 0
lr=2e-4
lora_rank=64
lora_alpha=128
lora_trainable="q_proj,v_proj,k_proj,o_proj,gate_proj,down_proj,up_proj"
modules_to_save="embed_tokens,lm_head"
lora_dropout=0.05
model_type="llama"
tokenizer_path_lang_vac="./vac_bpe.model"
dataset_dir_2=./data/vac/
data_cache=temp_data_cache_dir
per_device_train_batch_size=4
per_device_eval_batch_size=1
gradient_accumulation_steps=8
output_dir=vac_llama_peft_scratch_output_dir
deepspeed_config_file=ds_zero2_no_offload.json
torchrun --nnodes 1 --nproc_per_node 1 run_clm_llama_pretraining_peft.py
--deepspeed ${deepspeed_config_file}
--model_type ${model_type}
--tokenizer_name_or_path ${tokenizer_path_lang_vac}
--dataset_dir ${dataset_dir_2}
--data_cache_dir ${data_cache}
--validation_split_percentage 0.05
--per_device_train_batch_size ${per_device_train_batch_size}
--per_device_eval_batch_size ${per_device_eval_batch_size}
--do_train
--do_eval
--seed 42
--fp16
--num_train_epochs 1
--lr_scheduler_type linear
--learning_rate ${lr}
--warmup_ratio 0.05
--weight_decay 0.01
--logging_strategy steps
--logging_steps 10
--save_strategy steps
--save_total_limit 2
--save_steps 200
--gradient_accumulation_steps ${gradient_accumulation_steps}
--preprocessing_num_workers 8
--block_size 512
--output_dir ${output_dir}
--overwrite_output_dir True
--ddp_timeout 30000
--logging_first_step True
--lora_rank ${lora_rank}
--lora_alpha ${lora_alpha}
--trainable ${lora_trainable}
--lora_dropout ${lora_dropout}
--torch_dtype bfloat16
--gradient_checkpointing True
--ddp_find_unused_parameters False
May you tell me why there is overflow and the loss is not decreasing much, the model is not learning anything
Beta Was this translation helpful? Give feedback.
All reactions