-
Notifications
You must be signed in to change notification settings - Fork 51
Open
Labels
bugSomething isn't workingSomething isn't workingmodelRelated to model training or definition (not generic infra)Related to model training or definition (not generic infra)
Description
What happened?
Big difference between training loss and validation loss:
Training Loss:
0: 011 : 00500/00512 : 006132 : loss = 1.4076E-01 (lr=3.31E-06, s/sec=0.540)
0:
0: LossPhysical.ERA5.mse.avg : 1.3174E-01
0: LossPhysical.NPPATMS.mse.avg : 1.6151E-02
0: LossPhysical.SurfaceCombined.mse.avg : 2.7438E-01
0: LossPhysical.loss_avg : 1.4076E-01
0:
0:
0: Saved model to /capstor/store/cscs/userlab/ch17/shared_work/models/x8d3aehb/x8d3aehb_latest.chkpt
0: 011 : 00510/00512 : 006142 : loss = 1.4489E-01 (lr=5.52E-07, s/sec=0.066)
0:
0: LossPhysical.ERA5.mse.avg : 1.3673E-01
0: LossPhysical.NPPATMS.mse.avg : 2.0788E-02
0: LossPhysical.SurfaceCombined.mse.avg : 2.7715E-01
0: LossPhysical.loss_avg : 1.4489E-01
Validation Loss:
0: validation (x8d3aehb) : 011 :
0: 0.22639146680012345
0: LossPhysical.ERA5.mse.avg : 6.7917E-01
0: LossPhysical.NPPATMS.mse.avg : 0.0000E+00
0: LossPhysical.SurfaceCombined.mse.avg : 0.0000E+00
0: LossPhysical.loss_avg : 2.2639E-01
When we use RoPE:
Training Loss:
0: 011 : 00500/00512 : 006132 : loss = 1.4425E-01 (lr=1.41E-04, s/sec=0.554)
0:
0: LossPhysical.ERA5.mse.avg : 1.2814E-01
0: LossPhysical.NPPATMS.mse.avg : 1.9135E-02
0: LossPhysical.SurfaceCombined.mse.avg : 2.8549E-01
0: LossPhysical.loss_avg : 1.4425E-01
0:
0:
0: Saved model to /capstor/store/cscs/userlab/ch17/shared_work/models/rlwkrmhj/rlwkrmhj_latest.chkpt
0: 011 : 00510/00512 : 006142 : loss = 1.4240E-01 (lr=1.41E-04, s/sec=0.068)
0:
0: LossPhysical.ERA5.mse.avg : 1.2781E-01
0: LossPhysical.NPPATMS.mse.avg : 1.5484E-02
0: LossPhysical.SurfaceCombined.mse.avg : 2.8392E-01
0: LossPhysical.loss_avg : 1.4240E-01
0:
0: validation (rlwkrmhj) : 011 :
0: 0.055168171995319426
0: LossPhysical.ERA5.mse.avg : 1.6550E-01
0: LossPhysical.NPPATMS.mse.avg : 0.0000E+00
0: LossPhysical.SurfaceCombined.mse.avg : 0.0000E+00
0: LossPhysical.loss_avg : 5.5168E-02
What are the steps to reproduce the bug?
No response
Hedgedoc link to logs and more information. This ticket is public, do not attach files directly.
No response
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't workingmodelRelated to model training or definition (not generic infra)Related to model training or definition (not generic infra)
Type
Projects
Status
In Progress