-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Labels
Description
Bug description
We switched from lightning==1.9.4 to the lightning>=2.0.0 but observed a significant slow down in training of our models. We rely on lightning heavily for the implementation of our package NeuralProphet.
These are the profiling results (Pytorch Basis Profiler):
| Action | Total time Lightning 1.9.4 (s) | Total time Lightning 2.0 (s) |
|---|---|---|
| Total | 4.9571 | 36.367 |
| run_training_epoch | 4.7887 | 35.304 |
| run_training_batch | 3.4862 | 23.234 |
| [Strategy]SingleDeviceStrategy.training_step | 3.3989 | 23.188 |
| optimizer_step | 0.52329 | 5.1963 |
| [TrainingEpochLoop].train_dataloader_next | 0.33142 | 0.41921 |
| [Strategy]SingleDeviceStrategy.batch_to_device | 0.27525 | 9.7629 |
| [Callback]LearningRateFinder.on_fit_start | 0.26031 | 2.4607 |
| [LightningModule]TimeNet.transfer_batch_to_device | 0.24308 | 9.7149 |
| [Strategy]SingleDeviceStrategy.validation_step | 0.08974 | 0.5451 |
This issue is significantly impacting the performance of our package. Do you have any insights into what might be causing this and how we can resolve it? Your assistance would be greatly appreciated!
What version are you seeing the problem on?
v2.1, v2.2, v2.3, v2.4
How to reproduce the bug
No response
Error messages and logs
# Error messages and logs here please
Environment
Current environment
#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
More info
No response
quant-exchange