Skip to content

ModelCheckpoint not saving best model #20657

Open
@sravan953

Description

@sravan953

Bug description

In DDP, the ModelCheckpooint (configuration below) does not save the best model despite lower validation losses being achieved in later epochs. Configuration:

checkpoint_callback = ModelCheckpoint(
        dirpath=path_save_model,
        filename="best_loss",
        monitor="val_loss",
        mode="min",
        every_n_epochs=1,
        verbose=True,
    )

Here is a snippet of the output log:

Epoch 292: 100% 40/40 [00:49<00:00,  1.24s/it, v_num=0, l1=0.00589, msssim=0.00441, lpips=0.0184, loss=0.0287, lr=0.000128, val_l1=0.00746, val_msssim=0.0122, val_lpips=0.0299, val_loss=0.0495]Epoch 292, global step 11720: 'val_loss' reached 0.04955 (best 0.04955), saving model to '/qumulo/sravan/Projects/pmc_lit/experiments/e3_uwsyn_multi_noprompt/250318_2110_30k_uw50pc_ghosting/best_loss.ckpt' as top 1

Epoch 297: 100% 40/40 [00:49<00:00,  1.24s/it, v_num=0, l1=0.0141, msssim=0.00869, lpips=0.0113, loss=0.0341, lr=0.000124, val_l1=0.00779, val_msssim=0.0121, val_lpips=0.0266, val_loss=0.0466]Epoch 297, global step 11920: 'val_loss' was not in top 1

Clearly, epoch 297 should be saved to disk

What version are you seeing the problem on?

v2.5

How to reproduce the bug

Error messages and logs

Epoch 292: 100% 40/40 [00:49<00:00,  1.24s/it, v_num=0, l1=0.00589, msssim=0.00441, lpips=0.0184, loss=0.0287, lr=0.000128, val_l1=0.00746, val_msssim=0.0122, val_lpips=0.0299, val_loss=0.0495]Epoch 292, global step 11720: 'val_loss' reached 0.04955 (best 0.04955), saving model to '250318_2110_30k_uw50pc/best_loss.ckpt' as top 1

Epoch 297: 100% 40/40 [00:49<00:00,  1.24s/it, v_num=0, l1=0.0141, msssim=0.00869, lpips=0.0113, loss=0.0341, lr=0.000124, val_l1=0.00779, val_msssim=0.0121, val_lpips=0.0266, val_loss=0.0466]Epoch 297, global step 11920: 'val_loss' was not in top 1

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.5.0): 2.5.0
#- PyTorch Version (e.g., 2.5): 2.3.1+cu121
#- Python version (e.g., 3.12): 3.10.12
#- OS (e.g., Linux): Ubuntu 22.04.3 LTS
#- CUDA/cuDNN version: 
#- GPU models and configuration: 
#- How you installed Lightning(`conda`, `pip`, source): pip

More info

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingneeds triageWaiting to be triaged by maintainersver: 2.5.x

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions