Is there a reason why model with activation checkpointing behaves differently #1117

Moldoteck · 2021-05-29T15:04:15Z

Moldoteck
May 29, 2021

I have the same model, same parameters, one gpu. With classical training I get good accuracy scores after first epoch. With checkpointing+ customized forward in the model, accuracy drops almost 5 times and it converges very very slow.
Can this be because of loss function?
Or maybe the problem is in the network architecture? Here it is
Here is my config:

{
    "gradient_accumulation_steps": 1,
    "train_micro_batch_size_per_gpu": 3,
    "zero_allow_untested_optimizer":true,
    "activation_checkpointing": {
        "partition_activations": true,
        "contiguous_memory_optimization": true,
        "cpu_checkpointing": true,
        "number_checkpoints": 12
    }
}

Here is how checkpointing is configured:
deepspeed.checkpointing.configure(None, deepspeed_config='./modules/ds_config.json')
Here is how checkpointing is implemented:

def checkpoint_forward(self, x, chunk_length=1):
        def custom(start, end):
            def custom_forward(inputs):
                x_ = inputs[0]
                y_array = inputs[1]
                for ind, layer in enumerate(self.layers):
                    if start<=ind<end:
                        if ind < 3:
                            x_ = self.layers[ind](x_)
                        elif ind < 7:
                            x_, y_t = layer(x_)#layers with 2 outputs
                            y_array.append(y_t)#skip connection saving
                        elif ind == 7:
                            x_ = layer(x_)
                        elif ind < 12:
                            x_ = layer(x_, y_array[-1])#skip connection using
                            y_array=y_array[:-1]
                
                return x_, y_array
            return custom_forward

        l = 0
        num_layers = len(self.layers)
        hidden_states = x
        dy_array = []
        while l < num_layers:
            end = l+chunk_length
            if end > num_layers:
                end = num_layers
            hidden_states, dy_array = checkpoint(custom(l, end),[hidden_states, dy_array])
            l = end
        return hidden_states

Layers of this network are custom nn.Modules which contain Conv2d, LeakyRelu, BatchNorm2d, Dropout2d and AvgPool2d
And results for comparison:

Without checkpointing:
Time avg per batch 0.165
Loss avg 1.8970
Jaccard avg 0.6063
WCE avg 1.2907
Acc avg 0.705
IoU avg 0.275

With checkpointing:
Time avg per batch 0.163
Loss avg 3.7500
Jaccard avg 0.9243
WCE avg 2.8257
Acc avg 0.175
IoU avg 0.033

Moldoteck · 2021-06-03T11:49:05Z

Moldoteck
Jun 3, 2021
Author

Ok, so found out a possible problem/solution. I have replaced all that custom checkpointing with just>

downCntx = checkpoint(self.downCntx, x.requires_grad_())
downCntx = checkpoint(self.downCntx2, downCntx.requires_grad_())
downCntx = checkpoint(self.downCntx3, downCntx.requires_grad_())

down0c, down0b = checkpoint(self.resBlock1, downCntx.requires_grad_())
down1c, down1b = checkpoint(self.resBlock2, down0c.requires_grad_())
down2c, down2b = checkpoint(self.resBlock3, down1c.requires_grad_())
down3c, down3b = checkpoint(self.resBlock4, down2c.requires_grad_())
down5c = checkpoint(self.resBlock5, down3c.requires_grad_())

up4e = checkpoint(self.upBlock1, down5c.requires_grad_(), down3b.requires_grad_())
up3e = checkpoint(self.upBlock2, up4e.requires_grad_(), down2b.requires_grad_())
up2e = checkpoint(self.upBlock3, up3e.requires_grad_(), down1b.requires_grad_())
up1e = checkpoint(self.upBlock4, up2e.requires_grad_(), down0b.requires_grad_())

And suddenly there is no more accuracy drop, the only downside being that this consumes a bit more GPU memory. I don't know what could be the problem here, but at least it works as expected now

0 replies

tjruwase · 2021-06-14T21:43:58Z

tjruwase
Jun 14, 2021
Maintainer

@Moldoteck, this is quite interesting. So it seems the problem is not with deepspeed checkpointing, but rather how it is being used? You could also try torch checkpointing to see how that behaves.

1 reply

Moldoteck Jun 15, 2021
Author

@Moldoteck, this is quite interesting. So it seems the problem is not with deepspeed checkpointing, but rather how it is being used? You could also try torch checkpointing to see how that behaves.

Im not sure, but I think pytorch implementation of checkpointing needs an input with gradient:

NOTE: In case of checkpointing, if all the inputs don’t require grad but the outputs do, then if the inputs are passed as is, the output of Checkpoint will be variable which don’t require grad and autograd tape will break there. To get around, you can pass a dummy input which requires grad but isn’t necessarily used in computation.
Took it from here: https://github.com/prigoyal/pytorch_memonger/blob/master/tutorial/Checkpointing_for_PyTorch_models.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a reason why model with activation checkpointing behaves differently #1117

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{title}}

Select a reply

Is there a reason why model with activation checkpointing behaves differently #1117

Moldoteck May 29, 2021

Replies: 2 comments · 1 reply

Moldoteck Jun 3, 2021 Author

tjruwase Jun 14, 2021 Maintainer

Moldoteck Jun 15, 2021 Author

Moldoteck
May 29, 2021

Replies: 2 comments 1 reply

Moldoteck
Jun 3, 2021
Author

tjruwase
Jun 14, 2021
Maintainer

Moldoteck Jun 15, 2021
Author