The grad norm is nan #4

sister-tong · 2024-04-10T09:26:34Z

Hi author, I'm getting the following when training branchformer using summary_mixing

[autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:12,899 (ctc:67) WARNING: 13/34 samples got nan grad. These were ignored for CTC loss.
[autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:13,133 (build_trainer:660) WARNING: The grad norm is nan. Skipping updating the model.
[autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:13,263 (ctc:67) WARNING: 7/32 samples got nan grad. These were ignored for CTC loss.
[autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:13,477 (build_trainer:660) WARNING: The grad norm is nan. Skipping updating the model.
[autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:13,625 (ctc:67) WARNING: 21/45 samples got nan grad. These were ignored for CTC loss.
[autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:13,858 (build_trainer:660) WARNING: The grad norm is nan. Skipping updating the model.
[autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:14,022 (ctc:67) WARNING: 21/62 samples got nan grad. These were ignored for CTC loss.
[autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:14,248 (build_trainer:660) WARNING: The grad norm is nan. Skipping updating the model.
[autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:14,499 (ctc:67) WARNING: 37/105 samples got nan grad. These were ignored for CTC loss.
[autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:14,735 (build_trainer:660) WARNING: The grad norm is nan. Skipping updating the model.
[autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:14,875 (ctc:67) WARNING: 12/39 samples got nan grad. These were ignored for CTC loss.
[autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:15,104 (build_trainer:660) WARNING: The grad norm is nan. Skipping updating the model.
[autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:15,261 (ctc:67) WARNING: 23/56 samples got nan grad. These were ignored for CTC loss.
[autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:15,479 (build_trainer:660) WARNING: The grad norm is nan. Skipping updating the model.
[autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:15,623 (ctc:67) WARNING: 20/47 samples got nan grad. These were ignored for CTC loss.
[autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:15,854 (build_trainer:660) WARNING: The grad norm is nan. Skipping updating the model.
[autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:16,004 (ctc:67) WARNING: 15/53 samples got nan grad. These were ignored for CTC loss.
[autodl-container-4d6411b93c-8a044365] 2024-04-10 17:11:16,224 (build_trainer:660) WARNING: The grad norm is nan. Skipping updating the model.

Why is this

TParcollet · 2024-04-10T10:25:17Z

Hello there, we would need much more information about what the model/trainer/data/task is to give you an answer. SummaryMixing does not, in itself, induce more instability during training than MHSA. With more information on the code, we could try to help.

sister-tong · 2024-04-11T05:09:45Z

I tried to print the output of summary_mixing and the tensor shows that there is Nan, what is the reason for this

TParcollet · 2024-04-11T13:31:41Z

Hi, we need much more information to help you here I am afraid. This could be due to many reasons that are all most likely not connected to SummaryMixing. Please describe your setup.

sister-tong · 2024-04-12T01:21:03Z

Hi, when I print the encoder input when trying to use summing_mixing I find nan in it, but when I make RelPositionMultiHeadedAttention the input has no nan.
This is my configuration environment, the exact model configuration and the encoder structure is in the zip.

  linux：Ubuntu 20.04.4
  python=3.8.18
  torch=2.0.1
  funasr=0.8.2
  modelscope=1.9.3

code.zip

TParcollet · 2024-04-12T08:30:47Z

Hello,

I've had a quick look at your code, but I am way too unfamiliar with this codebase to make any meaningful comment. My only comment would be that we never encountered any NaN issue with summarymixing so it might not be plugged-in properly (be careful with the masking for instance).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The grad norm is nan #4

The grad norm is nan #4

sister-tong commented Apr 10, 2024

TParcollet commented Apr 10, 2024

sister-tong commented Apr 11, 2024

TParcollet commented Apr 11, 2024

sister-tong commented Apr 12, 2024

TParcollet commented Apr 12, 2024

The grad norm is nan #4

The grad norm is nan #4

Comments

sister-tong commented Apr 10, 2024

TParcollet commented Apr 10, 2024

sister-tong commented Apr 11, 2024

TParcollet commented Apr 11, 2024

sister-tong commented Apr 12, 2024

TParcollet commented Apr 12, 2024