You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I would like to inquire about Figure 3 in SANA-Srint. In the two subplots, does the "training gradient norm" mentioned refer to the gradients of trainable parameters (\theta) during training or the (d_F / d_t )?
Because when calculating the loss, the normalization of d_f/d_t ( g = g / (||g|| + c)) has already been considered. Logically, an excessively large d_f/d_t should not significantly affect training stability. I'm not sure is that right?