Checkpoint/Restart framework

I gave this some more thoughts and it will be a pretty serious task, with some issues. Let's start with the one big issue I know of:

there is a bug in the stl rng for gcc4.6 (fixed in gcc4.7) : the state of the rng is not saved or loaded correctly (from stream) so that you can't just continue a calculation on the same random number sequence (hence any test, even if we reload the rng will yield different results). [This issue](https://stackoverflow.com/questions/5999144/how-to-save-state-of-c0x-random-number-generator) is true at least if the state is read from a stream (I am not sure about other serialization methods, see boost serialize for example). Currently all our computers use gcc4.6.

About the implementation:

each module will have to have a **save** and **load** function and MC will have to have a save and load function in turn that saves its own state and calls the respective functions on the modules. I can't see a way around this because some of the modules have their own rng and a set of members that must be reinstated. Then it actually remains the problem of I/O for all this information, it's not obvious to me how to serialize it out of c++, considering that there is also the python layer that needs to be taken care of. Online I found a lot of references to [boost serialize](http://www.boost.org/doc/libs/1_47_0/libs/serialization/doc/index.html) to do this sort of things. 

Boost serialize would introduce an additional dependency (it could be made optional because check-pointing is a non essential feature). This still requires quite a bit of writing at the c++ level and remains the question of how to make this compatible with the python layer, although it doesn't seem impossible to do (one would have to rebuild the object from python and then call the MC::load function that will make sure everything goes back to the old state).

If we can do this, then re-instating the python layer state is simple because it's just a matter of reloading self.**dict** before calling mc.load. Unless there will be unforeseen complications.

Oh let's not forget that the potential and optimizer class should be serialized too for proper checkpointing, so the Pele potentials and minimizers would have to undergo this revision too, at least in principle.

This is the only way I can see how to implement a proper check-pointing system without something like [BLCR system](http://crd.lbl.gov/groups-depts/ftg/projects/current-projects/BLCR).

Since this sounds like a ton of work to me I would like to get as many suggestions as possible and get everyone to agree on the best course of action.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Checkpoint/Restart framework #32

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Checkpoint/Restart framework #32

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions