Inquiry about the Definition of "Training Gradient Norm" in SANA-Srint's Figure 3 Subplots


Hello, I would like to inquire about Figure 3 in SANA-Srint. In the two subplots, does the "training gradient norm" mentioned refer to the gradients of trainable parameters \(\theta\) during training or the \(d_F / d_t \)? 

Because when calculating the loss, the normalization of  d_f/d_t  ( g = g / (||g|| + c)) has already been considered. Logically, an excessively large d_f/d_t  should not significantly affect training stability. I'm not sure is that right?

![Image](https://github.com/user-attachments/assets/1d8e6120-c172-46aa-8c7b-fd7bac23f07b)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inquiry about the Definition of "Training Gradient Norm" in SANA-Srint's Figure 3 Subplots #282

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inquiry about the Definition of "Training Gradient Norm" in SANA-Srint's Figure 3 Subplots #282

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions