You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If you try to train GAN with lightning module using multi-GPU, face some erros like this:
[rank0]: RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step. If this is intentional, you must enable the detection of unused parameters in DDP, either by setting the string value `strategy='ddp_find_unused_parameters_true'` or by setting the flag in the strategy with `strategy=DDPStrategy(find_unused_parameters=True)`.
For inspection, I tried to train GAN training codes with this kind of snippets in the main.py:
import os
os.environ["TORCH_CPP_LOG_LEVEL"] = "INFO"
os.environ["TORCH_DISTRIBUTED_DEBUG"] = "DETAIL"
and then you can face this kind of errors:
[rank1]:[I reducer.cpp:1949] [Rank 1] Parameter: discriminator.discs.1.conv.weight_orig did not get gradient in backwards pass.
[rank1]:[I reducer.cpp:1949] [Rank 1] Parameter: discriminator.discs.1.conv.bias did not get gradient in backwards pass.
[rank1]:[I reducer.cpp:1949] [Rank 1] Parameter: discriminator.discs.1.down_blocks.3.norm.bias did not get gradient in backwards pass.
[rank1]:[I reducer.cpp:1949] [Rank 1] Parameter: discriminator.discs.1.down_blocks.3.norm.weight did not get gradient in backwards pass.
[rank1]:[I reducer.cpp:1949] [Rank 1] Parameter: discriminator.discs.1.down_blocks.3.conv.weight_orig did not get gradient in backwards pass.
[rank1]:[I reducer.cpp:1949] [Rank 1] Parameter: discriminator.discs.1.down_blocks.3.conv.bias did not get gradient in backwards pass.
[rank1]:[I reducer.cpp:1949] [Rank 1] Parameter: discriminator.discs.1.down_blocks.2.norm.bias did not get gradient in backwards pass.
[rank1]:[I reducer.cpp:1949] [Rank 1] Parameter: discriminator.discs.1.down_blocks.2.norm.weight did not get gradient in backwards pass.
[rank1]:[I reducer.cpp:1949] [Rank 1] Parameter: discriminator.discs.1.down_blocks.2.conv.weight_orig did not get gradient in backwards pass.
[rank1]:[I reducer.cpp:1949] [Rank 1] Parameter: discriminator.discs.1.down_blocks.2.conv.bias did not get gradient in backwards pass.
[rank1]:[I reducer.cpp:1949] [Rank 1] Parameter: discriminator.discs.1.down_blocks.1.norm.bias did not get gradient in backwards pass.
[rank1]:[I reducer.cpp:1949] [Rank 1] Parameter: discriminator.discs.1.down_blocks.1.norm.weight did not get gradient in backwards pass.
[rank1]:[I reducer.cpp:1949] [Rank 1] Parameter: discriminator.discs.1.down_blocks.1.conv.weight_orig did not get gradient in backwards pass.
[rank1]:[I reducer.cpp:1949] [Rank 1] Parameter: discriminator.discs.1.down_blocks.1.conv.bias did not get gradient in backwards pass.
[rank1]:[I reducer.cpp:1949] [Rank 1] Parameter: discriminator.discs.1.down_blocks.0.conv.weight_orig did not get gradient in backwards pass.
[rank1]:[I reducer.cpp:1949] [Rank 1] Parameter: discriminator.discs.1.down_blocks.0.conv.bias did not get gradient in backwards pass.
[rank0]:[I reducer.cpp:1949] [Rank 0] Parameter: discriminator.discs.1.conv.weight_orig did not get gradient in backwards pass.
[rank0]:[I reducer.cpp:1949] [Rank 0] Parameter: discriminator.discs.1.conv.bias did not get gradient in backwards pass.
[rank0]:[I reducer.cpp:1949] [Rank 0] Parameter: discriminator.discs.1.down_blocks.3.norm.bias did not get gradient in backwards pass.
[rank0]:[I reducer.cpp:1949] [Rank 0] Parameter: discriminator.discs.1.down_blocks.3.norm.weight did not get gradient in backwards pass.
[rank0]:[I reducer.cpp:1949] [Rank 0] Parameter: discriminator.discs.1.down_blocks.3.conv.weight_orig did not get gradient in backwards pass.
[rank0]:[I reducer.cpp:1949] [Rank 0] Parameter: discriminator.discs.1.down_blocks.3.conv.bias did not get gradient in backwards pass.
[rank0]:[I reducer.cpp:1949] [Rank 0] Parameter: discriminator.discs.1.down_blocks.2.norm.bias did not get gradient in backwards pass.
[rank0]:[I reducer.cpp:1949] [Rank 0] Parameter: discriminator.discs.1.down_blocks.2.norm.weight did not get gradient in backwards pass.
[rank0]:[I reducer.cpp:1949] [Rank 0] Parameter: discriminator.discs.1.down_blocks.2.conv.weight_orig did not get gradient in backwards pass.
[rank0]:[I reducer.cpp:1949] [Rank 0] Parameter: discriminator.discs.1.down_blocks.2.conv.bias did not get gradient in backwards pass.
[rank0]:[I reducer.cpp:1949] [Rank 0] Parameter: discriminator.discs.1.down_blocks.1.norm.bias did not get gradient in backwards pass.
[rank0]:[I reducer.cpp:1949] [Rank 0] Parameter: discriminator.discs.1.down_blocks.1.norm.weight did not get gradient in backwards pass.
[rank0]:[I reducer.cpp:1949] [Rank 0] Parameter: discriminator.discs.1.down_blocks.1.conv.weight_orig did not get gradient in backwards pass.
[rank0]:[I reducer.cpp:1949] [Rank 0] Parameter: discriminator.discs.1.down_blocks.1.conv.bias did not get gradient in backwards pass.
[rank0]:[I reducer.cpp:1949] [Rank 0] Parameter: discriminator.discs.1.down_blocks.0.conv.weight_orig did not get gradient in backwards pass.
[rank0]:[I reducer.cpp:1949] [Rank 0] Parameter: discriminator.discs.1.down_blocks.0.conv.bias did not get gradient in backwards pass.
I think this problem comes from the some wrong codes in the lightning module; if you successfully run one of the manual_backward() codes, then your the other manual_backward() codes cannot run right way since the first call might remove all the gradients of other module.
In this, someone suggests a way to call only one manual_backward() codes, but I think it would be a little different from normal training strategy of GAN:
self.manual_backward(d_loss + g_loss) --> I think this would be problem, but I cannot find no other way to solve unused parameter issue
self.toggle_optimizer(optimizer_d)
optimizer_d.step()
optimizer_d.zero_grad()
self.untoggle_optimizer(optimizer_d)
self.toggle_optimizer(optimizer_g)
optimizer_g.step()
optimizer_g.zero_grad()
self.untoggle_optimizer(optimizer_g)
Interestingly, if i swap the procedure of training GAN (e.g. from "disc first, then gen" to "gen first, then disc") I would get same error there is no gradient in the later module.
samsara-ku
changed the title
RuntimeError: It looks like your LightningModule has parameters that were not used in producing the loss returned by training_step.
self.manual_backward() makes all gradients gone
Apr 1, 2025
Bug description
If you try to train GAN with lightning module using multi-GPU, face some erros like this:
For inspection, I tried to train GAN training codes with this kind of snippets in the
main.py
:and then you can face this kind of errors:
I think this problem comes from the some wrong codes in the lightning module; if you successfully run one of the
manual_backward()
codes, then your the othermanual_backward()
codes cannot run right way since the first call might remove all the gradients of other module.In this, someone suggests a way to call only one
manual_backward()
codes, but I think it would be a little different from normal training strategy of GAN:Is there anyone to solve this problem?
What version are you seeing the problem on?
v2.5
How to reproduce the bug
Error messages and logs
Environment
Current environment
More info
No response
The text was updated successfully, but these errors were encountered: