Checkpointing simulations #4892

ali-ramadhan · 2025-10-30T18:27:11Z

This PR refactors how the Checkpointer works by now checkpointing simulations, rather than just models. This is needed as the simulations (+ output writers, callbacks, etc.) all contain crucial information needed to properly restore/pickup a simulation and continue time stepping.

Basic design idea:

We now have two new functions: prognostic_state(obj) which returns a named tuple corresponding to the prognostic state of obj and restore_prognostic_state!(obj, state) which restores obj based on information contained in state (which is a named tuple and is read from a checkpoint file).
Objects are checkpointed recursively by serializing prognostic information to the JLD2 checkpoint file.
The goal is for checkpointing to be flexible enough that we can very easily use it for different types of simulations, e.g. coupled simulations in ClimaOcean.jl by just defining prognostic_state and restore_prognostic_state!.

Right now I've only implemented proper checkpointing for non-hydrostatic model but it looks like it'll be straightforward to do it for hydrostatic and shallow water models. I'm working on adding comprehensive testing too.

Will continue working on this PR, but any feedback is very welcome!

Resolves #1249
Resolves #2866
Resolves #3670
Resolves #3845
Resolves #4516
Resolves #4857

Rhetorical aside

In general, the checkpointer is assuming that the simulation setup is the same. So only prognostic state information that changes will be checkpointed (e.g. field data, TimeInterval.actuations, etc.). The approach I have been taking (based on #4857) is to only checkpoint the prognostic state.

Should we operate under this assumption? I think so because not doing so can lead to a lot of undefined behavior. The checkpointer should not be responsible for checking that you set up the same simulation as the one that was checkpointed.

For example, take the SpecifiedTimes schedule. It has two properties times and previous_actuation. Since previous_actuation changes as the simulation runs, only previous_actuation needs to be checkpointed.

This leads to the possibility of the user changing times then picking up previous_actuation which can lead to undefined behavior. I think this is fine, because the checkpointer only works assuming you set up the same simulation as the one that was checkpointed.

Checkpointing both times and previous_actuation allows us to check that times is the same when restoring. But I don't think this is the checkpointer's responsibility.

…anigans.jl into ali/checkpointing-that-works

src/Models/HydrostaticFreeSurfaceModels/hydrostatic_free_surface_model.jl

…ce_model.jl Co-authored-by: Gregory L. Wagner <[email protected]>

glwagner · 2025-11-13T14:48:02Z

src/Models/NonhydrostaticModels/nonhydrostatic_model.jl

+    if length(model.closure_fields) > 0
+        restore_prognostic_state!(model.closure_fields, state.closure_fields)
+    end


should we handle this with dispatch?

also for dispatch I think this may need to know the closure as well. This is a unique object that is "managed" by the closure but doesn't store much identifying info. We could also change that design, but might want to dedicate / test in a prior PR

For sure. Right now I'm still working on getting all the existing tests to pass, but once they do I want to start testing checkpointing more and more complex simulations. As part of it, we should also test closures that use model.closure_fields.

glwagner · 2025-11-13T14:50:20Z

src/Models/NonhydrostaticModels/nonhydrostatic_model.jl

    if length(model.tracers) > 0
        restore_prognostic_state!(model.tracers, state.tracers)
    end


it might also be possible to catch this with dispatch on ::NamedTuple{} ( I think that's the rigfht way to write empty NamedTuple)

We could! And it would make these functions simpler! Will do this soon as well.

ali-ramadhan · 2025-11-13T18:05:53Z

Looks like tests will all pass 🎉 I'll start testing the checkpointing of increasingly complex simulations while iterating on the design! This way we'll be able to weed out most bugs and issues.

…anigans.jl into ali/checkpointing-that-works

glwagner · 2025-11-15T04:15:12Z

src/Models/ShallowWaterModels/shallow_water_model.jl

+        restore_prognostic_state!(model.tracers, state.tracers)
+    end
+
+    if !isnothing(model.closure_fields)


handle with dispatch?

glwagner · 2025-11-15T04:16:08Z

CATKE is an edge case so it should be tested

ali-ramadhan added 6 commits October 30, 2025 07:11

First stab at starting to support checkpointing simulations

390f24e

Start working on some new tests

751072c

Parameterize a couple of tests

bc39dd5

Replace old tests

d131070

Fix archs for checkpointer tests

ee79883

Merge branch 'main' into ali/checkpointing-that-works

30d4ccf

navidcy added the output 💾 label Nov 1, 2025

ali-ramadhan added 5 commits November 12, 2025 16:57

Merge branch 'main' into ali/checkpointing-that-works

0f79241

Checkpointing output writers

c3838da

Checkpointing and restoring Lagrangian particles

d721d9b

Checkpoint the hydrostatic model

50cd623

Merge branch 'ali/checkpointing-that-works' of github.com:CliMA/Ocean…

629381f

…anigans.jl into ali/checkpointing-that-works

glwagner reviewed Nov 13, 2025

View reviewed changes

src/Models/HydrostaticFreeSurfaceModels/hydrostatic_free_surface_model.jl Outdated Show resolved Hide resolved

ali-ramadhan and others added 3 commits November 12, 2025 22:34

Update src/Models/HydrostaticFreeSurfaceModels/hydrostatic_free_surfa…

f6d8bfc

…ce_model.jl Co-authored-by: Gregory L. Wagner <[email protected]>

Nonhydrostatic diffusivity fields are now called closure fields

e155376

Fix model prognostic_state

71cffaa

glwagner reviewed Nov 13, 2025

View reviewed changes

ali-ramadhan added 3 commits November 13, 2025 09:19

Checkpointing MultiRegionObject

5a1e461

Checkpointing for free surfaces

d4c25bd

Properly checkpoint simulation to not override new stop criteria

1f6814c

ali-ramadhan added 3 commits November 13, 2025 17:25

Merge branch 'main' into ali/checkpointing-that-works

0802088

Checkpoint SplitRungeKutta3TimeStepper

3b3eb39

Test checkpointing hydrostatic models

c512323

ali-ramadhan mentioned this pull request Nov 14, 2025

Checkpointing doesn't seem to be bit-for-bit with NonhydrostaticModel and variable z spacing #4904

Open

ali-ramadhan added 4 commits November 14, 2025 09:42

Merge branch 'ali/checkpointing-that-works' of github.com:CliMA/Ocean…

4af2871

…anigans.jl into ali/checkpointing-that-works

Get rid of checkpointer properties

bf03663

Checkpoint shallow water models

6a7f654

Test checkpointing shallow water models

d2ef109

ali-ramadhan added 2 commits November 14, 2025 17:48

Merge branch 'main' into ali/checkpointing-that-works

3dbf637

Fix test archs for CI

4437c23

glwagner reviewed Nov 15, 2025

View reviewed changes

Update checkpointing for ImplicitFreeSurface

add807d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Checkpointing simulations #4892

Checkpointing simulations #4892

ali-ramadhan commented Oct 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

glwagner Nov 13, 2025

Uh oh!

glwagner Nov 13, 2025

Uh oh!

ali-ramadhan Nov 13, 2025

Uh oh!

glwagner Nov 13, 2025

Uh oh!

ali-ramadhan Nov 13, 2025

Uh oh!

ali-ramadhan commented Nov 13, 2025 •

edited

Loading

Uh oh!

glwagner Nov 15, 2025

Uh oh!

glwagner commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Checkpointing simulations #4892

Are you sure you want to change the base?

Checkpointing simulations #4892

Conversation

ali-ramadhan commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rhetorical aside

Uh oh!

Uh oh!

glwagner Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

glwagner Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

ali-ramadhan Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

glwagner Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

ali-ramadhan Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

ali-ramadhan commented Nov 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glwagner Nov 15, 2025

Choose a reason for hiding this comment

Uh oh!

glwagner commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ali-ramadhan commented Oct 30, 2025 •

edited

Loading

ali-ramadhan commented Nov 13, 2025 •

edited

Loading