Skip to content

7x slower training speed when switching from lightning 1.0 to 2.0 #20201

@MaiBe-ctrl

Description

@MaiBe-ctrl

Bug description

We switched from lightning==1.9.4 to the lightning>=2.0.0 but observed a significant slow down in training of our models. We rely on lightning heavily for the implementation of our package NeuralProphet.

These are the profiling results (Pytorch Basis Profiler):

Action Total time Lightning 1.9.4 (s) Total time Lightning 2.0 (s)
Total 4.9571 36.367
run_training_epoch 4.7887 35.304
run_training_batch 3.4862 23.234
[Strategy]SingleDeviceStrategy.training_step 3.3989 23.188
optimizer_step 0.52329 5.1963
[TrainingEpochLoop].train_dataloader_next 0.33142 0.41921
[Strategy]SingleDeviceStrategy.batch_to_device 0.27525 9.7629
[Callback]LearningRateFinder.on_fit_start 0.26031 2.4607
[LightningModule]TimeNet.transfer_batch_to_device 0.24308 9.7149
[Strategy]SingleDeviceStrategy.validation_step 0.08974 0.5451

This issue is significantly impacting the performance of our package. Do you have any insights into what might be causing this and how we can resolve it? Your assistance would be greatly appreciated!

What version are you seeing the problem on?

v2.1, v2.2, v2.3, v2.4

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions