Skip to content

Resuming training from a state that was resumed training from earlier state behaves weird #2171

@Hirmuolio

Description

@Hirmuolio

When resuming training from a training state that was also created from resuming training from a training state behaves weird.

Example: 5 epochs. Save state every epoch.

  • Blue line: Normal training from start to finish.
  • Red line: Resume from state of epoch 2. resume = "E:/training/output/test_1-000002-state"
  • yellowline: Resumed from first saved state (epoch 3) of the previously resumed training resume = "E:/training/output/test_2-000003-state"
Image

The first resumed training (redline) trains for 3 epochs and finishes at total of 2+3=5 epochs as expected.
The resumed-resumed training (yellowline) trains for 4 epochs resulting in of 2+1+4=7 epochs of training.

This may be as simple as the "current_step" being saved with wrong number in train_state.json. But I am not good enough to know if that is the problem.

Used training settings: training_tomls.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions