nan loss in MuLaN training

@lucidrains 
While training `MuLaN` on a dataset of around 5.2k samples, the loss goes to `nan` after some 15-16k steps. 
My batch size is 4, and the `text` part of the data samples are tokenized using: 

```python
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
text_in_numbers = tokenizer.encode(text)
```
Does it has something to do with the zero division? or square-root of 0 in the loss function?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

nan loss in MuLaN training #20

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

nan loss in MuLaN training #20

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions