You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
. Note that I only want to load the model, not the optimizer. As a side note I have tried setting load_module_only=True as suggested here: https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html , but then I get the error message: TypeError: load_checkpoint() got an unexpected keyword argument 'load_module_only'.
I then run the script with the following config:
Note that I'm now using a dynamic loss scale. What happens then is that the loss keeps overflowing while the loss scaling never changes the scale of the loss, it seems to somehow be stuck with the value assigned to it during the initial training phase. These scripts are run separately meaning that the model is reinitialized before training starts so I should not be running into the issues warned about here : https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html .
[2021-10-24 12:00:47,525] [INFO] [stage3.py:2704:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024, reducing to 1024
[2021-10-24 12:00:50,348] [INFO] [stage3.py:2704:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024, reducing to 1024
[2021-10-24 12:00:53,481] [INFO] [stage3.py:2704:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024, reducing to 1024
[2021-10-24 12:00:56,459] [INFO] [stage3.py:2704:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024, reducing to 1024
[2021-10-24 12:00:59,403] [INFO] [stage3.py:2704:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024, reducing to 1024
[2021-10-24 12:01:02,289] [INFO] [stage3.py:2704:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024, reducing to 1024
[2021-10-24 12:01:05,325] [INFO] [stage3.py:2704:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024, reducing to 1024
How can I activate the dynamic loss scaling when I finetune the model? I have tried manually setting self.loss_scaler = LossScaler(static_loss_scale) (with both LossScaler and DynamicLossScaler) as suggested here #138 but it changed nothing. Would really appreciate some help!
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
I have trained a model using the following config:
After training the model I save it using :
model_engine.save_checkpoint(SAVE_PATH, tag=tag)
I then try to finetune it on another dataset so I first load the model using:
. Note that I only want to load the model, not the optimizer. As a side note I have tried setting load_module_only=True as suggested here: https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html , but then I get the error message: TypeError: load_checkpoint() got an unexpected keyword argument 'load_module_only'.
I then run the script with the following config:
Note that I'm now using a dynamic loss scale. What happens then is that the loss keeps overflowing while the loss scaling never changes the scale of the loss, it seems to somehow be stuck with the value assigned to it during the initial training phase. These scripts are run separately meaning that the model is reinitialized before training starts so I should not be running into the issues warned about here : https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html .
How can I activate the dynamic loss scaling when I finetune the model? I have tried manually setting self.loss_scaler = LossScaler(static_loss_scale) (with both LossScaler and DynamicLossScaler) as suggested here #138 but it changed nothing. Would really appreciate some help!
Beta Was this translation helpful? Give feedback.
All reactions