Open
Description
Bug description
In DDP, the ModelCheckpooint
(configuration below) does not save the best model despite lower validation losses being achieved in later epochs. Configuration:
checkpoint_callback = ModelCheckpoint(
dirpath=path_save_model,
filename="best_loss",
monitor="val_loss",
mode="min",
every_n_epochs=1,
verbose=True,
)
Here is a snippet of the output log:
Epoch 292: 100% 40/40 [00:49<00:00, 1.24s/it, v_num=0, l1=0.00589, msssim=0.00441, lpips=0.0184, loss=0.0287, lr=0.000128, val_l1=0.00746, val_msssim=0.0122, val_lpips=0.0299, val_loss=0.0495]Epoch 292, global step 11720: 'val_loss' reached 0.04955 (best 0.04955), saving model to '/qumulo/sravan/Projects/pmc_lit/experiments/e3_uwsyn_multi_noprompt/250318_2110_30k_uw50pc_ghosting/best_loss.ckpt' as top 1
Epoch 297: 100% 40/40 [00:49<00:00, 1.24s/it, v_num=0, l1=0.0141, msssim=0.00869, lpips=0.0113, loss=0.0341, lr=0.000124, val_l1=0.00779, val_msssim=0.0121, val_lpips=0.0266, val_loss=0.0466]Epoch 297, global step 11920: 'val_loss' was not in top 1
Clearly, epoch 297 should be saved to disk
What version are you seeing the problem on?
v2.5
How to reproduce the bug
Error messages and logs
Epoch 292: 100% 40/40 [00:49<00:00, 1.24s/it, v_num=0, l1=0.00589, msssim=0.00441, lpips=0.0184, loss=0.0287, lr=0.000128, val_l1=0.00746, val_msssim=0.0122, val_lpips=0.0299, val_loss=0.0495]Epoch 292, global step 11720: 'val_loss' reached 0.04955 (best 0.04955), saving model to '250318_2110_30k_uw50pc/best_loss.ckpt' as top 1
Epoch 297: 100% 40/40 [00:49<00:00, 1.24s/it, v_num=0, l1=0.0141, msssim=0.00869, lpips=0.0113, loss=0.0341, lr=0.000124, val_l1=0.00779, val_msssim=0.0121, val_lpips=0.0266, val_loss=0.0466]Epoch 297, global step 11920: 'val_loss' was not in top 1
Environment
Current environment
#- PyTorch Lightning Version (e.g., 2.5.0): 2.5.0
#- PyTorch Version (e.g., 2.5): 2.3.1+cu121
#- Python version (e.g., 3.12): 3.10.12
#- OS (e.g., Linux): Ubuntu 22.04.3 LTS
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source): pip
More info
No response