In this line
|
backward(total_loss, scaler) |
We are trying to accumulate the gradients and perform optimizer step only after we accumulate gradients for
accum_freq steps.
I am wondering whether do we need to divide the total_loss by accum_freq to scale the loss properly.