Can't continue training with cosine scheduler #2910
-
I updated my copy of kohya recently and I can't continue training anything anymore, even if I start from scratch. If I try cosine, it will continue but at half training value, then one third, then one quarter, etc. In the past, it would just flip the cosine every time I continued training. With cosine with restarts, I was able to continue once. But after that, it wants to do twice as many iterations and the training value is zero. And the only way I could continue the first time is to set epoch to 2, max epoch to 2 and LC Cycles to 2. But if I increase to 3, it doubles the iterations. And if I leave it at 2, training value goes to 0. Can we not continue training anymore? How is this supposed to work? This all worked fine in the past. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
I'm a programmer though I don't do much python these days. I looked at the code and printed out some values. It seems there's a bug in saving the state. When a state is saved, it only saves the current number of steps that were executed in that session. This is why you can continue once, but the second time you try to continue, it tries to produce epoch 3, but reads from the state that only one epoch's worth of steps has executed. This is why the number of steps doubles on the second attempt to continue. So either this bug needs to be fixed or I need to find a way to manually specify the initial step or initial epoch. Not sure how to do that yet. |
Beta Was this translation helpful? Give feedback.
-
Ok, I found a workaround. So if you want to use cosine with restarts, set max train epoch and max train steps both to 0. So far, this is normal. But the initial_step value will be wrong on the third epoch (or second time continuing). To get around this, calculate what the start step number should be. If you have 500 iterations per epoch and you're on the 3rd epoch, then the start iteration number is 1000. In "additional parameters", you would add "--initial_step 1000" without the quotes. Change the number to your own start iteration number. Remember to divide by the batch size if you use that. |
Beta Was this translation helpful? Give feedback.
Ok, I found a workaround.
So if you want to use cosine with restarts, set max train epoch and max train steps both to 0.
Set epoch to the epoch you are generating.
And set LR cycles as if you're generating ALL the epochs. So if you're generating one epoch at a time and you want to restart the cosine every epoch, set this value to the epoch number. So if epoch is 3, set LR cycles to 3 as well.
So far, this is normal. But the initial_step value will be wrong on the third epoch (or second time continuing). To get around this, calculate what the start step number should be. If you have 500 iterations per epoch and you're on the 3rd epoch, then the start iteration number is 1000.
In "additiona…