When resuming training from a training state that was also created from resuming training from a training state behaves weird.
Example: 5 epochs. Save state every epoch.
- Blue line: Normal training from start to finish.
- Red line: Resume from state of epoch 2.
resume = "E:/training/output/test_1-000002-state"
- yellowline: Resumed from first saved state (epoch 3) of the previously resumed training
resume = "E:/training/output/test_2-000003-state"
The first resumed training (redline) trains for 3 epochs and finishes at total of 2+3=5 epochs as expected.
The resumed-resumed training (yellowline) trains for 4 epochs resulting in of 2+1+4=7 epochs of training.
This may be as simple as the "current_step"
being saved with wrong number in train_state.json. But I am not good enough to know if that is the problem.
Used training settings: training_tomls.zip