Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ModelCheckpoint not saving best model #20657

Closed
sravan953 opened this issue Mar 19, 2025 · 2 comments
Closed

ModelCheckpoint not saving best model #20657

sravan953 opened this issue Mar 19, 2025 · 2 comments
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.5.x

Comments

@sravan953
Copy link

sravan953 commented Mar 19, 2025

Bug description

In DDP, the ModelCheckpooint (configuration below) does not save the best model despite lower validation losses being achieved in later epochs. Configuration:

checkpoint_callback = ModelCheckpoint(
        dirpath=path_save_model,
        filename="best_loss",
        monitor="val_loss",
        mode="min",
        every_n_epochs=1,
        verbose=True,
    )

Here is a snippet of the output log:

Epoch 292: 100% 40/40 [00:49<00:00,  1.24s/it, v_num=0, l1=0.00589, msssim=0.00441, lpips=0.0184, loss=0.0287, lr=0.000128, val_l1=0.00746, val_msssim=0.0122, val_lpips=0.0299, val_loss=0.0495]Epoch 292, global step 11720: 'val_loss' reached 0.04955 (best 0.04955), saving model to '/qumulo/sravan/Projects/pmc_lit/experiments/e3_uwsyn_multi_noprompt/250318_2110_30k_uw50pc_ghosting/best_loss.ckpt' as top 1

Epoch 297: 100% 40/40 [00:49<00:00,  1.24s/it, v_num=0, l1=0.0141, msssim=0.00869, lpips=0.0113, loss=0.0341, lr=0.000124, val_l1=0.00779, val_msssim=0.0121, val_lpips=0.0266, val_loss=0.0466]Epoch 297, global step 11920: 'val_loss' was not in top 1

Clearly, epoch 297 should be saved to disk

What version are you seeing the problem on?

v2.5

How to reproduce the bug

Error messages and logs

Epoch 292: 100% 40/40 [00:49<00:00,  1.24s/it, v_num=0, l1=0.00589, msssim=0.00441, lpips=0.0184, loss=0.0287, lr=0.000128, val_l1=0.00746, val_msssim=0.0122, val_lpips=0.0299, val_loss=0.0495]Epoch 292, global step 11720: 'val_loss' reached 0.04955 (best 0.04955), saving model to '250318_2110_30k_uw50pc/best_loss.ckpt' as top 1

Epoch 297: 100% 40/40 [00:49<00:00,  1.24s/it, v_num=0, l1=0.0141, msssim=0.00869, lpips=0.0113, loss=0.0341, lr=0.000124, val_l1=0.00779, val_msssim=0.0121, val_lpips=0.0266, val_loss=0.0466]Epoch 297, global step 11920: 'val_loss' was not in top 1

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.5.0): 2.5.0
#- PyTorch Version (e.g., 2.5): 2.3.1+cu121
#- Python version (e.g., 3.12): 3.10.12
#- OS (e.g., Linux): Ubuntu 22.04.3 LTS
#- CUDA/cuDNN version: 
#- GPU models and configuration: 
#- How you installed Lightning(`conda`, `pip`, source): pip

More info

No response

@sravan953 sravan953 added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Mar 19, 2025
@Borda
Copy link
Member

Borda commented Mar 20, 2025

@sravan953 could you please share a full example to reproduce?

@sravan953
Copy link
Author

I am unable to reproduce this issue, though I can swear it was an issue for a couple of days. Thank you for the quick response, closing this now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs triage Waiting to be triaged by maintainers ver: 2.5.x
Projects
None yet
Development

No branches or pull requests

2 participants