Enable distributed training test with NCCL on ALPS (GH200) #689

RMeli · 2025-07-24T08:24:37Z

This PR builds on #687 and runs the distributed training example in tests/distributed. The example is run on a single node, with all 4 GPUs.

The CI workflow is now the following:

Build base container -> Build MTT container -> Run distributed example
          |
          ------------> Run tox tests

The base container (used by the tox tests) is now only re-built if the Containerfile changes.

This PR also adds torch.distributed.destroy_process_group() to the different trainers to get rid of the corresponding warnings.

This is a starting point, but we should discuss about how to expand the distributed training tests (currently the example only tests SOAP-BPNN).

EDIT: For the distributed training tests I built on top of the base container. I did not yet explore the performance of this solution.

📚 Documentation preview 📚: https://metatrain--689.org.readthedocs.build/en/689/

Luthaf · 2025-07-24T09:33:11Z

This is a starting point, but we should discuss about how to expand the distributed training tests (currently the example only tests SOAP-BPNN).

Yes, the current tests are only intended to check the metatrain side of distributed training, and we use soap-bpnn as a "simple" model for this. We would have to add distributed tests to the relevant architectures.

Luthaf

This looks good, thanks!

ci/containers/Containerfile.mtt

ci/cscs.yml

Co-authored-by: Guillaume Fraux <[email protected]>

RMeli · 2025-07-24T14:56:03Z

cscs-ci run

Luthaf · 2025-07-25T12:40:57Z

Thanks a lot!

RMeli added 2 commits July 24, 2025 10:14

cscs ci distributed

c283f58

restore toml

4bf1cb9

RMeli requested a review from Luthaf July 24, 2025 08:24

RMeli requested review from frostedoyster and abmazitov as code owners July 24, 2025 08:24

RMeli added 2 commits July 24, 2025 10:27

remove containerfile

4d02748

remove system-site-packages

18c817b

Luthaf reviewed Jul 24, 2025

View reviewed changes

ci/containers/Containerfile.mtt Outdated Show resolved Hide resolved

ci/cscs.yml Outdated Show resolved Hide resolved

ci/cscs.yml Show resolved Hide resolved

ci/cscs.yml Show resolved Hide resolved

RMeli and others added 3 commits July 24, 2025 11:40

Update ci/cscs.yml

45fa96e

Co-authored-by: Guillaume Fraux <[email protected]>

rename repo

fd90596

check number of devices

3b942d3

RMeli requested a review from Luthaf July 24, 2025 14:37

fix config

074e781

Luthaf approved these changes Jul 25, 2025

View reviewed changes

Luthaf merged commit 632b925 into main Jul 25, 2025
15 checks passed

Luthaf deleted the cscs-ci-dist-clean branch July 25, 2025 12:40

RMeli added Priority: High Critical issues needing immediate attention. Enhancement Idea or improvement labels Jul 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable distributed training test with NCCL on ALPS (GH200) #689

Enable distributed training test with NCCL on ALPS (GH200) #689

RMeli commented Jul 24, 2025 •

edited

Loading

Uh oh!

Luthaf commented Jul 24, 2025

Uh oh!

Luthaf left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RMeli commented Jul 24, 2025

Uh oh!

Uh oh!

Luthaf commented Jul 25, 2025

Uh oh!

Uh oh!

Enable distributed training test with NCCL on ALPS (GH200) #689

Enable distributed training test with NCCL on ALPS (GH200) #689

Conversation

RMeli commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Luthaf commented Jul 24, 2025

Uh oh!

Luthaf left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

RMeli commented Jul 24, 2025

Uh oh!

Uh oh!

Luthaf commented Jul 25, 2025

Uh oh!

Uh oh!

RMeli commented Jul 24, 2025 •

edited

Loading