Skip to content

PyTorch Lightning logs are not synchronised when using distributed training #3157

Open
@admivsn

Description

@admivsn

Description

As described in PyTorch Lightning documentation, the logs need to be synchronised using sync_dist=True.

For example in DeepAR, I think there should be an extra parameter when running distributed training.

I notice that when training on multi-GPU SageMaker instances I don't see the a performance uplift compared to a single-GPU instance. I also get a warning output from PyTorch Lightning.

It is recommended to use `self.log('train_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.

To Reproduce

It's difficult to reproduce as I'm running a SageMaker Training job.

SM_CURRENT_INSTANCE_TYPE=ml.g4dn.12xlarge
SM_NUM_CPUS=48
SM_NUM_GPUS=4

You can see by the set up there are 4 GPUs, which are detected by PyTorch Lightning as the logs look like this:

2024-04-17 20:37:48 Starting - Starting the training job...
2024-04-17 20:38:05 Starting - Preparing the instances for training......
2024-04-17 20:39:10 Downloading - Downloading input data...
2024-04-17 20:39:29 Downloading - Downloading the training image............
2024-04-17 20:41:50 Training - Training image download completed. Training in progress........2024-04-17 20:42:45,440 sagemaker-training-toolkit INFO     No Neurons detected (normal if no neurons installed)

...

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/logger_connector.py:67: Starting from v1.9.0, `tensorboardX` has been removed as a dependency of the `lightning.pytorch` package, due to potential conflicts with other packages in the ML ecosystem. For this reason, `logger=True` will use `CSVLogger` as the default logger, unless the `tensorboard` or `tensorboardX` packages are found. Please `pip install lightning[extra]` or one of them to enable TensorBoard support by default
/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/configuration_validator.py:74: You defined a `validation_step` but have no `val_dataloader`. Skipping val loop.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/4
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/4
Missing logger folder: /opt/ml/code/lightning_logs
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/4
Missing logger folder: /opt/ml/code/lightning_logs
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/4
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 4 processes
----------------------------------------------------------------------------------------------------
Missing logger folder: /opt/ml/code/lightning_logs
Missing logger folder: /opt/ml/code/lightning_logs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 3 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
LOCAL_RANK: 2 - CUDA_VISIBLE_DEVICES: [0,1,2,3]
  | Name  | Type        | Params | In sizes                                                        | Out sizes  
----------------------------------------------------------------------------------------------------------------------
0 | model | DeepARModel | 25.9 K | [[1, 1], [1, 1], [1, 1102, 4], [1, 1102], [1, 1102], [1, 1, 4]] | [1, 100, 1]
----------------------------------------------------------------------------------------------------------------------
25.9 K    Trainable params
0         Non-trainable params
25.9 K    Total params
0.104     Total estimated model params size (MB)
/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:433: It is recommended to use `self.log('train_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.
Epoch 0, global step 50: 'train_loss' reached 0.75536 (best 0.75536), saving model to '/opt/ml/code/lightning_logs/version_0/checkpoints/epoch=0-step=50.ckpt' as top 1
Epoch 1, global step 100: 'train_loss' reached 0.72144 (best 0.72144), saving model to '/opt/ml/code/lightning_logs/version_0/checkpoints/epoch=1-step=100.ckpt' as top 1

Error message or code output

The particular warning of interest is:

/usr/local/lib/python3.10/site-packages/lightning/pytorch/trainer/connectors/logger_connector/result.py:433: It is recommended to use `self.log('train_loss', ..., sync_dist=True)` when logging on epoch level in distributed setting to accumulate the metric across devices.

Environment

  • Operating system:
  • Python version: 3.10
  • GluonTS version: 0.14.4
  • PyTorch version: 2.2.1
  • PyTorch Lightning version: 2.1.4

(Add as much information about your environment as possible, e.g. dependencies versions.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions