Skip to content

Failure at restart when DOUT_S_SAVE_INTERIM_RESTART_FILES=TRUE #3351

@slevis-lmwg

Description

@slevis-lmwg

Brief summary of bug

I have confirmed @adrifoster's TRENDY2025 restart failure with my ctsm5.3.062 test (original 20-yr simulation described in NCAR/LMWG_dev#109). I restarted my case with
STOP_N = 2 days
REST_N = 1 day
DOUT_S_SAVE_INTERIM_RESTART_FILES = FALSE
RESUBMIT = 1

and it completed all 4 days correctly. Then I restarted again with
STOP_N = 2 days
REST_N = 1 day
DOUT_S_SAVE_INTERIM_RESTART_FILES = TRUE
RESUBMIT = 1

This time it completed the first 2 days but failed when it resubmitted.

General bug information

CTSM version you are using: ctsm5.3.062 (first version splitting hX files into hXa/hXi)

Does this bug cause significantly incorrect results in the model's science? No

Configurations affected: DOUT_S_SAVE_INTERIM_RESTART_FILES = TRUE

Details of bug

Important output or errors that show the problem

My case ran here
/glade/derecho/scratch/slevis/ctsm53062_f09_1850/run
and archived here
/glade/derecho/scratch/slevis/archive/ctsm53062_f09_1850

So far I have noticed the following behavior:

  1. In .../archive/.../rest I type ls 0021-01-0[5-7]* and get (recall the model restarted successfully from 01-05, and then failed to restart from 01-07):
0021-01-05-00000:
ctsm53062_f09_1850.clm2.h0a.0020-12.nc            ctsm53062_f09_1850.mosart.h0i.0020-12.nc            rpointer.atm.0021-01-02-00000  rpointer.lnd.0006-01-01-00000  rpointer.lnd.0021-01-06-00000  rpointer.rof.0021-01-04-00000
ctsm53062_f09_1850.clm2.h0i.0020-12.nc            ctsm53062_f09_1850.mosart.r.0021-01-05-00000.nc     rpointer.atm.0021-01-03-00000  rpointer.lnd.0011-01-01-00000  rpointer.lnd.0021-01-07-00000  rpointer.rof.0021-01-05-00000
ctsm53062_f09_1850.clm2.r.0021-01-05-00000.nc     ctsm53062_f09_1850.mosart.rh0a.0021-01-05-00000.nc  rpointer.atm.0021-01-04-00000  rpointer.lnd.0016-01-01-00000  rpointer.rof.0006-01-01-00000  rpointer.rof.0021-01-06-00000
ctsm53062_f09_1850.clm2.rh0a.0021-01-05-00000.nc  ctsm53062_f09_1850.mosart.rh0i.0021-01-05-00000.nc  rpointer.atm.0021-01-05-00000  rpointer.lnd.0021-01-01-00000  rpointer.rof.0011-01-01-00000  rpointer.rof.0021-01-07-00000
ctsm53062_f09_1850.clm2.rh0i.0021-01-05-00000.nc  rpointer.atm.0006-01-01-00000                       rpointer.atm.0021-01-06-00000  rpointer.lnd.0021-01-02-00000  rpointer.rof.0016-01-01-00000
ctsm53062_f09_1850.cpl.r.0021-01-05-00000.nc      rpointer.atm.0011-01-01-00000                       rpointer.atm.0021-01-07-00000  rpointer.lnd.0021-01-03-00000  rpointer.rof.0021-01-01-00000
ctsm53062_f09_1850.datm.r.0021-01-05-00000.nc     rpointer.atm.0016-01-01-00000                       rpointer.cpl.0021-01-05-00000  rpointer.lnd.0021-01-04-00000  rpointer.rof.0021-01-02-00000
ctsm53062_f09_1850.mosart.h0a.0020-12.nc          rpointer.atm.0021-01-01-00000                       rpointer.lnd.0001-01-01-01800  rpointer.lnd.0021-01-05-00000  rpointer.rof.0021-01-03-00000

0021-01-06-00000:
 ctsm53062_f09_1850.clm2.h0a.0020-12.nc             ctsm53062_f09_1850.clm2.rh0i.0021-01-06-00000.nc   ctsm53062_f09_1850.mosart.h0i.0020-12.nc             rpointer.atm                   'rpointer.lnd$NINST_STRING'
 ctsm53062_f09_1850.clm2.h0i.0020-12.nc             ctsm53062_f09_1850.cpl.r.0021-01-06-00000.nc       ctsm53062_f09_1850.mosart.r.0021-01-06-00000.nc     'rpointer.atm$NINST_STRING'      rpointer.rof
 ctsm53062_f09_1850.clm2.r.0021-01-06-00000.nc      ctsm53062_f09_1850.datm.r.0021-01-06-00000.nc      ctsm53062_f09_1850.mosart.rh0a.0021-01-06-00000.nc   rpointer.cpl.0021-01-06-00000  'rpointer.rof$NINST_STRING'
 ctsm53062_f09_1850.clm2.rh0a.0021-01-06-00000.nc   ctsm53062_f09_1850.mosart.h0a.0020-12.nc           ctsm53062_f09_1850.mosart.rh0i.0021-01-06-00000.nc   rpointer.lnd

0021-01-07-00000:
ctsm53062_f09_1850.clm2.h0a.0020-12.nc         ctsm53062_f09_1850.clm2.rh0a.0021-01-07-00000.nc  ctsm53062_f09_1850.datm.r.0021-01-07-00000.nc  ctsm53062_f09_1850.mosart.r.0021-01-07-00000.nc     rpointer.cpl.0021-01-07-00000
ctsm53062_f09_1850.clm2.h0i.0020-12.nc         ctsm53062_f09_1850.clm2.rh0i.0021-01-07-00000.nc  ctsm53062_f09_1850.mosart.h0a.0020-12.nc       ctsm53062_f09_1850.mosart.rh0a.0021-01-07-00000.nc
ctsm53062_f09_1850.clm2.r.0021-01-07-00000.nc  ctsm53062_f09_1850.cpl.r.0021-01-07-00000.nc      ctsm53062_f09_1850.mosart.h0i.0020-12.nc       ctsm53062_f09_1850.mosart.rh0i.0021-01-07-00000.nc

01-07 seems to be missing numerous rpointer files.
01-06 has rpointer files with strange suffixes.

  1. In .../run I see an empty rpointer.lnd and the following error in cesm.log:
    end-of-file during read, unit 99, file /glade/derecho/scratch/slevis/ctsm53062_f09_1850/run/./rpointer.lnd

@adrifoster and others please add information that you consider helpful.

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugsomething is working incorrectlynextthis should get some attention in the next week or two. Normally each Thursday SE meeting.priority: highHigh priority to fix/merge soon, e.g., because it is a problem in important configurations

    Type

    Projects

    Status

    In Progress

    Status

    In progress - master

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions