Fix wrong behavior of DDPStrategy option with simple GAN training using DDP#20936
Fix wrong behavior of DDPStrategy option with simple GAN training using DDP#20936samsara-ku wants to merge 28 commits intoLightning-AI:masterfrom
DDPStrategy option with simple GAN training using DDP#20936Conversation
for more information, see https://pre-commit.ci
examples/pytorch/domain_templates/generative_adversarial_net_ddp.py
Outdated
Show resolved
Hide resolved
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 7 days if no further activity occurs. If you need further help see our docs: https://lightning.ai/docs/pytorch/latest/generated/CONTRIBUTING.html#pull-request or ask the assistance of a core contributor here or on Discord. Thank you for your contributions. |
…dp-implementation test: cover MultiModelDDPStrategy
examples/pytorch/domain_templates/generative_adversarial_net.py
Outdated
Show resolved
Hide resolved
examples/pytorch/domain_templates/generative_adversarial_net_ddp.py
Outdated
Show resolved
Hide resolved
examples/pytorch/domain_templates/generative_adversarial_net_ddp.py
Outdated
Show resolved
Hide resolved
|
@SkafteNicki could you pls check too :) |
examples/pytorch/domain_templates/generative_adversarial_net_ddp.py
Outdated
Show resolved
Hide resolved
|
@samsara-ku please mark also the comments as resolved to keep track what has left :) |
examples/pytorch/domain_templates/generative_adversarial_net_ddp.py
Outdated
Show resolved
Hide resolved
examples/pytorch/domain_templates/generative_adversarial_net_ddp.py
Outdated
Show resolved
Hide resolved
SkafteNicki
left a comment
There was a problem hiding this comment.
I think we are getting close with this PR, just a few suggestions
for more information, see https://pre-commit.ci
|
I add some base test case for I think I should add more test case about the new strategy like below:
|
|
Hi @Borda, I’m picking this up again and have updated the PR with refined test coverage and compatibility fixes. Could you share guidance on the expected test assertions (especially for GAN convergence/validation) and any remaining concerns with the current approach? |
…evious MultiModelDDP model example w/ new pl version
for more information, see https://pre-commit.ci
1. change all testcases snippets in right way; tested with one-by-one
for more information, see https://pre-commit.ci
|
I change my bad testcase codes in right way w/ But, here is some problems I have so please anyone give me some right direction to solve this problem:
Currently the testcase check the training only with Discriminator's loss but it might be not the best way to validate the training
I have tried to take some time to test this case, but I cannot find this; it is hard to check the unused parameter because of that strategy |
What does this PR do?
Fixes #20866
Fixes #20328
Fixes #18740
Fixes #17212
This PR adds
MultiModelDDPStrategyclass and its simple execution example, for the multi-gpu training with GAN training.Simply speaking:
Currently, pytorch lightning simple GAN training has has problem with
DistributedDataParallelstrategy. It tries to wrappl.trainer, not thenn.Modulemodels in thepl.trainerAlthough we can activate
find_unused_parameters=Trueoptions to avoid this issue but it is not right way; I think it is just a trick.So the key idea to solve this issue is that we assign
DistributedDataParallelto the each model in thepl.trainer, different from the previous strategyDDPStrategy.I already tested with my GPUs to visaulize the result and tracked the gradients of model with each epoch; it works and you can see the visulized result in thie google drive link
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Reviewer checklist
📚 Documentation preview 📚: https://pytorch-lightning--20936.org.readthedocs.build/en/20936/