Skip to content

Unstable training loops, exploding loss values #847

@MicheleCattaneo

Description

@MicheleCattaneo

What happened?

There are multiple reports of training instability, where the training loss displays sudden jumps that were not observed previously.
Below an example using anemoi with this release, i.e training: v0.8.4, models: v0.11.3, graphs: v0.8.3.

Image

However, the issue has appeared in previous releases as well.

The issue appears to be (mostly?) happening for global models

What are the steps to reproduce the bug?

A simple training run with default configurations can reproduce the issue.

Version

training: v0.8.4, models: v0.11.3, graphs: v0.8.3

Platform (OS and architecture)

Linux, A100 Nvidia GPUs

Relevant log output

Accompanying data

No response

Organisation

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    Status

    To be triaged

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions