Skip to content

Checkpoint callback run before validation step - stale or none monitor values considered for validation metrics #20185

@PheelaV

Description

@PheelaV

Bug description

I am doing iterative training with check_val_every_n_epoch=None and (example values)val_check_interval=10 on my trainer and with the matched argument on ModelCheckpoint every_n_train_steps=10.

e.g.

  checkpoint_callback = ModelCheckpoint(
    dirpath=experiment_dir.joinpath("checkpoints"),
    filename="checkpoint-{epoch}-{step:06d}-{train_loss:.2f}-{val_loss:.2f}",
    save_top_k=checkpoint_top_k,
    every_n_train_steps=checkpoint_n_step,
    monitor="val_loss",
  )

It is a documented usage to make the monitor metric val_loss.

The problem is that these values might not exist, giving the warning or they are stale - because val_step is run after the checkpoint has been processed, new val metrics are not considered.

What version are you seeing the problem on?

v2.3, v2.4

How to reproduce the bug

No response

Error messages and logs

# Error messages and logs here please

Environment

Current environment
#- PyTorch Lightning Version (e.g., 2.4.0):
#- PyTorch Version (e.g., 2.4):
#- Python version (e.g., 3.12):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):

More info

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions