-
Notifications
You must be signed in to change notification settings - Fork 252
Checkpointing simulations #4892
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…anigans.jl into ali/checkpointing-that-works
src/Models/HydrostaticFreeSurfaceModels/hydrostatic_free_surface_model.jl
Outdated
Show resolved
Hide resolved
…ce_model.jl Co-authored-by: Gregory L. Wagner <[email protected]>
| if length(model.closure_fields) > 0 | ||
| restore_prognostic_state!(model.closure_fields, state.closure_fields) | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we handle this with dispatch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also for dispatch I think this may need to know the closure as well. This is a unique object that is "managed" by the closure but doesn't store much identifying info. We could also change that design, but might want to dedicate / test in a prior PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure. Right now I'm still working on getting all the existing tests to pass, but once they do I want to start testing checkpointing more and more complex simulations. As part of it, we should also test closures that use model.closure_fields.
| if length(model.tracers) > 0 | ||
| restore_prognostic_state!(model.tracers, state.tracers) | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might also be possible to catch this with dispatch on ::NamedTuple{} ( I think that's the rigfht way to write empty NamedTuple)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could! And it would make these functions simpler! Will do this soon as well.
|
Looks like tests will all pass 🎉 I'll start testing the checkpointing of increasingly complex simulations while iterating on the design! This way we'll be able to weed out most bugs and issues. |
…anigans.jl into ali/checkpointing-that-works
| restore_prognostic_state!(model.tracers, state.tracers) | ||
| end | ||
|
|
||
| if !isnothing(model.closure_fields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handle with dispatch?
|
CATKE is an edge case so it should be tested |
This PR refactors how the
Checkpointerworks by now checkpointing simulations, rather than just models. This is needed as the simulations (+ output writers, callbacks, etc.) all contain crucial information needed to properly restore/pickup a simulation and continue time stepping.Basic design idea:
prognostic_state(obj)which returns a named tuple corresponding to the prognostic state ofobjandrestore_prognostic_state!(obj, state)which restoresobjbased on information contained instate(which is a named tuple and is read from a checkpoint file).prognostic_stateandrestore_prognostic_state!.Right now I've only implemented proper checkpointing for non-hydrostatic model but it looks like it'll be straightforward to do it for hydrostatic and shallow water models. I'm working on adding comprehensive testing too.
Will continue working on this PR, but any feedback is very welcome!
Resolves #1249
Resolves #2866
Resolves #3670
Resolves #3845
Resolves #4516
Resolves #4857
Rhetorical aside
In general, the checkpointer is assuming that the simulation setup is the same. So only prognostic state information that changes will be checkpointed (e.g. field data,
TimeInterval.actuations, etc.). The approach I have been taking (based on #4857) is to only checkpoint the prognostic state.Should we operate under this assumption? I think so because not doing so can lead to a lot of undefined behavior. The checkpointer should not be responsible for checking that you set up the same simulation as the one that was checkpointed.
For example, take the
SpecifiedTimesschedule. It has two propertiestimesandprevious_actuation. Sinceprevious_actuationchanges as the simulation runs, onlyprevious_actuationneeds to be checkpointed.This leads to the possibility of the user changing
timesthen picking upprevious_actuationwhich can lead to undefined behavior. I think this is fine, because the checkpointer only works assuming you set up the same simulation as the one that was checkpointed.Checkpointing both
timesandprevious_actuationallows us to check thattimesis the same when restoring. But I don't think this is the checkpointer's responsibility.