Skip to content

Bug report: Validation of era5_nppatms_synop #1774

@csjfwang

Description

@csjfwang

What happened?

Big difference between training loss and validation loss:

Training Loss:

0: 011 : 00500/00512 : 006132 : loss = 1.4076E-01 (lr=3.31E-06, s/sec=0.540)
0: 	
0: LossPhysical.ERA5.mse.avg : 1.3174E-01 	
0: LossPhysical.NPPATMS.mse.avg : 1.6151E-02 	
0: LossPhysical.SurfaceCombined.mse.avg : 2.7438E-01 	
0: LossPhysical.loss_avg : 1.4076E-01 	
0: 
0: 
0: Saved model to /capstor/store/cscs/userlab/ch17/shared_work/models/x8d3aehb/x8d3aehb_latest.chkpt
0: 011 : 00510/00512 : 006142 : loss = 1.4489E-01 (lr=5.52E-07, s/sec=0.066)
0: 	
0: LossPhysical.ERA5.mse.avg : 1.3673E-01 	
0: LossPhysical.NPPATMS.mse.avg : 2.0788E-02 	
0: LossPhysical.SurfaceCombined.mse.avg : 2.7715E-01 	
0: LossPhysical.loss_avg : 1.4489E-01 	

Validation Loss:

0: validation (x8d3aehb) : 011 : 
0:                         0.22639146680012345
0: LossPhysical.ERA5.mse.avg : 6.7917E-01 	
0: LossPhysical.NPPATMS.mse.avg : 0.0000E+00 	
0: LossPhysical.SurfaceCombined.mse.avg : 0.0000E+00 	
0: LossPhysical.loss_avg : 2.2639E-01 	

When we use RoPE:

Training Loss:

0: 011 : 00500/00512 : 006132 : loss = 1.4425E-01 (lr=1.41E-04, s/sec=0.554)
0: 	
0: LossPhysical.ERA5.mse.avg : 1.2814E-01 	
0: LossPhysical.NPPATMS.mse.avg : 1.9135E-02 	
0: LossPhysical.SurfaceCombined.mse.avg : 2.8549E-01 	
0: LossPhysical.loss_avg : 1.4425E-01 	
0: 
0: 
0: Saved model to /capstor/store/cscs/userlab/ch17/shared_work/models/rlwkrmhj/rlwkrmhj_latest.chkpt
0: 011 : 00510/00512 : 006142 : loss = 1.4240E-01 (lr=1.41E-04, s/sec=0.068)
0: 	
0: LossPhysical.ERA5.mse.avg : 1.2781E-01 	
0: LossPhysical.NPPATMS.mse.avg : 1.5484E-02 	
0: LossPhysical.SurfaceCombined.mse.avg : 2.8392E-01 	
0: LossPhysical.loss_avg : 1.4240E-01 	
0: 
0: validation (rlwkrmhj) : 011 : 
0:                         0.055168171995319426
0: LossPhysical.ERA5.mse.avg : 1.6550E-01 	
0: LossPhysical.NPPATMS.mse.avg : 0.0000E+00 	
0: LossPhysical.SurfaceCombined.mse.avg : 0.0000E+00 	
0: LossPhysical.loss_avg : 5.5168E-02 	

What are the steps to reproduce the bug?

No response

Hedgedoc link to logs and more information. This ticket is public, do not attach files directly.

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingmodelRelated to model training or definition (not generic infra)

Type

Projects

Status

In Progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions