How do debug failing in deepspeed stage 3 but not stage 2 #4429

baxelrod · 2023-09-30T22:27:25Z

baxelrod
Sep 30, 2023

I'm using pytorch lightning with deepspeed, but am having trouble training the model with stage 3, even though stage 2 works.

The model is part of a very large codebase and difficult to share, but it is a VAE with an adversarial loss, so nothing too out of the ordinary. I did have to implement a little workaround to get the adversarial loss to work (basically by detaching the graph at the right spot and adding a bit of redundancy so that a single backwards pass would compute the right gradient).

Everything works great with stage 2, but when I run in stage 3 I get the following error:

RuntimeError: The size of tensor a (0) must match the size of tensor b (3) at non-singleton dimension 4

Are there flags one can turn on to get more informative debug information (for example which tensors have a mismatch)?

Also, does stage 3 make assumptions about the shape of the loss? Is the user supposed to avoid aggregation over the batch, for example?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do debug failing in deepspeed stage 3 but not stage 2 #4429

{{title}}

Replies: 0 comments

Select a reply

How do debug failing in deepspeed stage 3 but not stage 2 #4429

baxelrod Sep 30, 2023

Replies: 0 comments

baxelrod
Sep 30, 2023