Skip to content

Enable distributed training test with NCCL on ALPS (GH200) #689

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 25, 2025

Conversation

RMeli
Copy link
Contributor

@RMeli RMeli commented Jul 24, 2025

This PR builds on #687 and runs the distributed training example in tests/distributed. The example is run on a single node, with all 4 GPUs.

The CI workflow is now the following:

Build base container -> Build MTT container -> Run distributed example
          |
          ------------> Run tox tests

The base container (used by the tox tests) is now only re-built if the Containerfile changes.

This PR also adds torch.distributed.destroy_process_group() to the different trainers to get rid of the corresponding warnings.

This is a starting point, but we should discuss about how to expand the distributed training tests (currently the example only tests SOAP-BPNN).

EDIT: For the distributed training tests I built on top of the base container. I did not yet explore the performance of this solution.


📚 Documentation preview 📚: https://metatrain--689.org.readthedocs.build/en/689/

@RMeli RMeli requested a review from Luthaf July 24, 2025 08:24
@Luthaf
Copy link
Member

Luthaf commented Jul 24, 2025

This is a starting point, but we should discuss about how to expand the distributed training tests (currently the example only tests SOAP-BPNN).

Yes, the current tests are only intended to check the metatrain side of distributed training, and we use soap-bpnn as a "simple" model for this. We would have to add distributed tests to the relevant architectures.

Copy link
Member

@Luthaf Luthaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good, thanks!

@RMeli RMeli requested a review from Luthaf July 24, 2025 14:37
@RMeli
Copy link
Contributor Author

RMeli commented Jul 24, 2025

cscs-ci run

@Luthaf Luthaf merged commit 632b925 into main Jul 25, 2025
15 checks passed
@Luthaf Luthaf deleted the cscs-ci-dist-clean branch July 25, 2025 12:40
@Luthaf
Copy link
Member

Luthaf commented Jul 25, 2025

Thanks a lot!

@RMeli RMeli added Priority: High Critical issues needing immediate attention. Enhancement Idea or improvement labels Jul 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Idea or improvement Priority: High Critical issues needing immediate attention.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants