Replies: 2 comments 1 reply
-
Ok, so found out a possible problem/solution. I have replaced all that custom checkpointing with just>
And suddenly there is no more accuracy drop, the only downside being that this consumes a bit more GPU memory. I don't know what could be the problem here, but at least it works as expected now |
Beta Was this translation helpful? Give feedback.
0 replies
-
@Moldoteck, this is quite interesting. So it seems the problem is not with deepspeed checkpointing, but rather how it is being used? You could also try torch checkpointing to see how that behaves. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I have the same model, same parameters, one gpu. With classical training I get good accuracy scores after first epoch. With checkpointing+ customized forward in the model, accuracy drops almost 5 times and it converges very very slow.
Can this be because of loss function?
Or maybe the problem is in the network architecture? Here it is
Here is my config:
Here is how checkpointing is configured:
deepspeed.checkpointing.configure(None, deepspeed_config='./modules/ds_config.json')
Here is how checkpointing is implemented:
Layers of this network are custom nn.Modules which contain Conv2d, LeakyRelu, BatchNorm2d, Dropout2d and AvgPool2d
And results for comparison:
Beta Was this translation helpful? Give feedback.
All reactions