-
Notifications
You must be signed in to change notification settings - Fork 78
Open
Labels
bugSomething isn't workingSomething isn't working
Description
What happened?
There are multiple reports of training instability, where the training loss displays sudden jumps that were not observed previously.
Below an example using anemoi with this release, i.e training: v0.8.4, models: v0.11.3, graphs: v0.8.3.
However, the issue has appeared in previous releases as well.
The issue appears to be (mostly?) happening for global models
What are the steps to reproduce the bug?
A simple training run with default configurations can reproduce the issue.
Version
training: v0.8.4, models: v0.11.3, graphs: v0.8.3
Platform (OS and architecture)
Linux, A100 Nvidia GPUs
Relevant log output
Accompanying data
No response
Organisation
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working
Type
Projects
Status
To be triaged