Skip to content

error during restart in surfrd_veg_all #1446

@ckoven

Description

@ckoven

Describe the issue

I have run into a weird error twice in global 2-degree historical transient simulations. Both times it happened during a restart, but when I tried to re-run each of the cases, it worked and ran without error. So something weird is happening that does not appear to be reproducible. Not sure if whatever is going on has more to do with ELM or FATES, but thought I'd post it here in any case.

Relevant log output

586:  surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl=        3476  and t=
 586:            1
 586:  sum is:   -58.8782232958150     
 586:  ENDRUN:
 586:  ERROR in surfrdUtilsMod.F90 at line 75                                         
 586:                                                                                 
 586:                                                                                 
 586:                                                                                 
 586:                                                                                 
 586:                                                                                 
 586:                                        
 586:  ERROR: Unknown error submitted to shr_abort_abort.
 586: Image              PC                Routine            Line        Source             
 586: e3sm.exe           000000000149C00D  shr_abort_mod_mp_         114  shr_abort_mod.F90
 586: e3sm.exe           0000000000564B27  abortutils_mp_end          43  abortutils.F90
 586: e3sm.exe           0000000000729DAE  surfrdutilsmod_mp          75  surfrdUtilsMod.F90
 586: e3sm.exe           0000000000716C0D  surfrdmod_mp_surf        1189  surfrdMod.F90
 586: e3sm.exe           00000000007097A6  surfrdmod_mp_surf         679  surfrdMod.F90
 586: e3sm.exe           0000000000582432  elm_initializemod         316  elm_initializeMod.F90
 586: e3sm.exe           000000000055B42B  lnd_comp_mct_mp_l         291  lnd_comp_mct.F90
 586: e3sm.exe           000000000045F4C8  component_mod_mp_         257  component_mod.F90
 586: e3sm.exe           000000000044BADF  cime_comp_mod_mp_        1502  cime_comp_mod.F90
 586: e3sm.exe           000000000045C312  MAIN__                    122  cime_driver.F90
 586: e3sm.exe           000000000043347D  Unknown               Unknown  Unknown
 586: libc-2.31.so       0000147BFF1CA1FD  __libc_start_main     Unknown  Unknown
 586: e3sm.exe           00000000004333AA  Unknown               Unknown  Unknown


<different case>

 194:  surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl=        1167  and t=
 194:            1
 194:  sum is:    1.00000783423753     
 194:  ENDRUN:
 194:  ERROR in surfrdUtilsMod.F90 at line 75                                         
 194:                                                                                 
 194:                                                                                 
 194:                                                                                 
 194:                                                                                 
 194:                                                                                 
 194:                                        
 194:  ERROR: Unknown error submitted to shr_abort_abort.
 194: Image              PC                Routine            Line        Source             
 194: e3sm.exe           000000000149C00D  shr_abort_mod_mp_         114  shr_abort_mod.F90
 194: e3sm.exe           0000000000564B27  abortutils_mp_end          43  abortutils.F90
 194: e3sm.exe           0000000000729DAE  surfrdutilsmod_mp          75  surfrdUtilsMod.F90
 194: e3sm.exe           0000000000716C0D  surfrdmod_mp_surf        1189  surfrdMod.F90
 194: e3sm.exe           00000000007097A6  surfrdmod_mp_surf         679  surfrdMod.F90
 194: e3sm.exe           0000000000582432  elm_initializemod         316  elm_initializeMod.F90
 194: e3sm.exe           000000000055B42B  lnd_comp_mct_mp_l         291  lnd_comp_mct.F90
 194: e3sm.exe           000000000045F4C8  component_mod_mp_         257  component_mod.F90
 194: e3sm.exe           000000000044BADF  cime_comp_mod_mp_        1502  cime_comp_mod.F90
 194: e3sm.exe           000000000045C312  MAIN__                    122  cime_driver.F90
 194: e3sm.exe           000000000043347D  Unknown               Unknown  Unknown
 194: libc-2.31.so       000014B6A4DCA1FD  __libc_start_main     Unknown  Unknown
 194: e3sm.exe           00000000004333AA  Unknown               Unknown  Unknown
 194: MPICH ERROR [Rank 194] [job id 41167605.0] [Tue Jul 29 15:59:29 2025] [nid004325] - Abort(1001) (rank 194 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 194
 194: 
 194: aborting job:
 194: application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 194
srun: error: nid004325: task 194: Exited with exit code 255

FATES tag

c2da27f

Host land model tag

39e91e09b5

Machine

perlmutter

Other supported machine name

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    Status

    ❕Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions