-
Notifications
You must be signed in to change notification settings - Fork 105
Open
Description
Describe the issue
I have run into a weird error twice in global 2-degree historical transient simulations. Both times it happened during a restart, but when I tried to re-run each of the cases, it worked and ran without error. So something weird is happening that does not appear to be reproducible. Not sure if whatever is going on has more to do with ELM or FATES, but thought I'd post it here in any case.
Relevant log output
586: surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl= 3476 and t=
586: 1
586: sum is: -58.8782232958150
586: ENDRUN:
586: ERROR in surfrdUtilsMod.F90 at line 75
586:
586:
586:
586:
586:
586:
586: ERROR: Unknown error submitted to shr_abort_abort.
586: Image PC Routine Line Source
586: e3sm.exe 000000000149C00D shr_abort_mod_mp_ 114 shr_abort_mod.F90
586: e3sm.exe 0000000000564B27 abortutils_mp_end 43 abortutils.F90
586: e3sm.exe 0000000000729DAE surfrdutilsmod_mp 75 surfrdUtilsMod.F90
586: e3sm.exe 0000000000716C0D surfrdmod_mp_surf 1189 surfrdMod.F90
586: e3sm.exe 00000000007097A6 surfrdmod_mp_surf 679 surfrdMod.F90
586: e3sm.exe 0000000000582432 elm_initializemod 316 elm_initializeMod.F90
586: e3sm.exe 000000000055B42B lnd_comp_mct_mp_l 291 lnd_comp_mct.F90
586: e3sm.exe 000000000045F4C8 component_mod_mp_ 257 component_mod.F90
586: e3sm.exe 000000000044BADF cime_comp_mod_mp_ 1502 cime_comp_mod.F90
586: e3sm.exe 000000000045C312 MAIN__ 122 cime_driver.F90
586: e3sm.exe 000000000043347D Unknown Unknown Unknown
586: libc-2.31.so 0000147BFF1CA1FD __libc_start_main Unknown Unknown
586: e3sm.exe 00000000004333AA Unknown Unknown Unknown
<different case>
194: surfrd_veg_all ERROR: sum of wt_nat_patch not 1.0 at nl= 1167 and t=
194: 1
194: sum is: 1.00000783423753
194: ENDRUN:
194: ERROR in surfrdUtilsMod.F90 at line 75
194:
194:
194:
194:
194:
194:
194: ERROR: Unknown error submitted to shr_abort_abort.
194: Image PC Routine Line Source
194: e3sm.exe 000000000149C00D shr_abort_mod_mp_ 114 shr_abort_mod.F90
194: e3sm.exe 0000000000564B27 abortutils_mp_end 43 abortutils.F90
194: e3sm.exe 0000000000729DAE surfrdutilsmod_mp 75 surfrdUtilsMod.F90
194: e3sm.exe 0000000000716C0D surfrdmod_mp_surf 1189 surfrdMod.F90
194: e3sm.exe 00000000007097A6 surfrdmod_mp_surf 679 surfrdMod.F90
194: e3sm.exe 0000000000582432 elm_initializemod 316 elm_initializeMod.F90
194: e3sm.exe 000000000055B42B lnd_comp_mct_mp_l 291 lnd_comp_mct.F90
194: e3sm.exe 000000000045F4C8 component_mod_mp_ 257 component_mod.F90
194: e3sm.exe 000000000044BADF cime_comp_mod_mp_ 1502 cime_comp_mod.F90
194: e3sm.exe 000000000045C312 MAIN__ 122 cime_driver.F90
194: e3sm.exe 000000000043347D Unknown Unknown Unknown
194: libc-2.31.so 000014B6A4DCA1FD __libc_start_main Unknown Unknown
194: e3sm.exe 00000000004333AA Unknown Unknown Unknown
194: MPICH ERROR [Rank 194] [job id 41167605.0] [Tue Jul 29 15:59:29 2025] [nid004325] - Abort(1001) (rank 194 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 194
194:
194: aborting job:
194: application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 194
srun: error: nid004325: task 194: Exited with exit code 255FATES tag
Host land model tag
39e91e09b5
Machine
perlmutter
Other supported machine name
No response
Additional context
No response
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
❕Todo