Zero optimizer Dynamic Loss Scaling stuck does not scale loss after loading a checkpoint #1478

Paandaman · 2021-10-25T02:37:56Z

Paandaman
Oct 25, 2021

I have trained a model using the following config:

{
    "train_batch_size": 128,
    "train_micro_batch_size_per_gpu": 1,
    "steps_per_print": 1000,
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 1e-5
        }
    },
    "gradient_clipping": 1.0,
    "zero_optimization": {
        "stage": 3,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e8,
        "stage3_param_persitance_threshold": 1e5,
        "stage3_prefetch_bucket_size": 5e7,
        "contiguous_gradients": true,
        "cpu_offload": true,
        "cpu_offload_params": true,
        "cpu_offload_use_pin_memory": true,
        "overlap_comm": true,
        "reduce_bucket_size": 90000000,
        "sub_group_size": 4e8
    },
    "wall_clock_breakdown": false,
    "fp16": {
        "enabled": true,
        "loss_scale": 1024,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    }
}

After training the model I save it using :
model_engine.save_checkpoint(SAVE_PATH, tag=tag)

I then try to finetune it on another dataset so I first load the model using:

model_engine, __, __, __ = deepspeed.initialize(args=args, model=model, model_parameters=model.parameters())
model_engine.load_checkpoint(load_dir=param_path, tag='', load_module_strict=True, load_optimizer_states=False)

. Note that I only want to load the model, not the optimizer. As a side note I have tried setting load_module_only=True as suggested here: https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html , but then I get the error message: TypeError: load_checkpoint() got an unexpected keyword argument 'load_module_only'.
I then run the script with the following config:

{
    "train_batch_size": 256,
    "train_micro_batch_size_per_gpu": 16,
    "steps_per_print": 1000,
    "optimizer": {
        "type": "Adam",
        "params": {
            "lr": 1e-5
        }
    },
    "gradient_clipping": 1.0,
    "zero_optimization": {
        "stage": 3,
        "stage3_max_live_parameters": 1e9,
        "stage3_max_reuse_distance": 1e8,
        "stage3_param_persitance_threshold": 1e5,
        "stage3_prefetch_bucket_size": 5e7,
        "contiguous_gradients": true,
        "cpu_offload": true,
        "cpu_offload_params": true,
        "cpu_offload_use_pin_memory": true,
        "overlap_comm": true,
        "reduce_bucket_size": 90000000,
        "sub_group_size": 4e8
    },
    "wall_clock_breakdown": false,
    "fp16": {
        "enabled": true,
        "loss_scale": 0.0,
        "loss_scale_window": 1000,
        "hysteresis": 2,
        "min_loss_scale": 1
    }
}

Note that I'm now using a dynamic loss scale. What happens then is that the loss keeps overflowing while the loss scaling never changes the scale of the loss, it seems to somehow be stuck with the value assigned to it during the initial training phase. These scripts are run separately meaning that the model is reinitialized before training starts so I should not be running into the issues warned about here : https://deepspeed.readthedocs.io/en/latest/model-checkpointing.html .

[2021-10-24 12:00:47,525] [INFO] [stage3.py:2704:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024, reducing to 1024
[2021-10-24 12:00:50,348] [INFO] [stage3.py:2704:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024, reducing to 1024
[2021-10-24 12:00:53,481] [INFO] [stage3.py:2704:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024, reducing to 1024
[2021-10-24 12:00:56,459] [INFO] [stage3.py:2704:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024, reducing to 1024
[2021-10-24 12:00:59,403] [INFO] [stage3.py:2704:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024, reducing to 1024
[2021-10-24 12:01:02,289] [INFO] [stage3.py:2704:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024, reducing to 1024
[2021-10-24 12:01:05,325] [INFO] [stage3.py:2704:_overflow_clean_up] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 1024, reducing to 1024

How can I activate the dynamic loss scaling when I finetune the model? I have tried manually setting self.loss_scaler = LossScaler(static_loss_scale) (with both LossScaler and DynamicLossScaler) as suggested here #138 but it changed nothing. Would really appreciate some help!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero optimizer Dynamic Loss Scaling stuck does not scale loss after loading a checkpoint #1478

{{title}}

Replies: 0 comments

Select a reply

Zero optimizer Dynamic Loss Scaling stuck does not scale loss after loading a checkpoint #1478

Paandaman Oct 25, 2021

Replies: 0 comments

Paandaman
Oct 25, 2021