Skip to content

Conversation

natalie-perlin
Copy link
Collaborator

@natalie-perlin natalie-perlin commented Aug 14, 2025

Commit Queue Requirements:

  • Fill out all sections of this template.
  • All sub component pull requests have been reviewed by their code managers.
  • Run the full Intel+GNU RT suite (compared to current baselines) on either Hera/Derecho/Hercules
  • Commit 'test_changes.list' from previous step

Description:

Updates for Derecho:

  • update job_card for PBS directive and MPI job submission line
  • rocoto modulefile location
  • ecflow library from spack-stack-1.9.2 build
  • add compile time on Derecho (01:00:00 instead of 00:30:00; compilation takes > 30 min)

This PR adapts submission of the MPI job close to that for other systems that have PBS, i.e.:

#PBS -l select=@[NODES]:ncpus=@[TPN]:ompthreads=@[THRD]
mpiexec -n @[TASKS] --cpu-bind core - -depth @[THRD] ./fv3.exe

where [TASKS]= [TPN] * [NODES]
(for simplicity, assume THRD=1)

Rocoto modulefile path and module loading:

     module use /glade/work/epicufsrt/contrib/derecho/modulefiles
     module load rocoto/1.3.7

Ecflow module used is from spack-stack-1.9.2:

    module use /glade/work/epicufsrt/contrib/spack-stack/derecho/spack-stack-1.9.2/envs/ue-oneapi-2024.2.1/install/modulefiles/oneapi/2024.2.1
     module load stack-python/3.11.7
     module load ecflow/5.11.4

Commit Messages:

  • UFSWM -
    several commits for derecho, i.e.
    commit 0c638f2: update a script fv3_qsub.IN_derecho
    commit 93e19fb: Merge recent changes from develop branch into fix/derecho
    commit 5af4b25: updates of modulefile and compile_qsub for derecho
    commit 913f2e6: New Derecho environment with spack-stack-1.9.2
    commit 290bd0d: Merge remote-tracking branch 'origin/develop' into fix/derecho
    commit d7fdfbc: update for ecflow path on Derecho
    commit c16d2bb: update job_card for Derecho
    commit 92149f4: use new Derecho spack-stack 1.9.2 environment built with ncarenv/24.12
    commit 8076a14: Merge branch 'develop' into fix/derecho
    commit c1a3007: Derecho: updated spack-stack, job_card, RT tests configurations

Priority:

  • Normal

Git Tracking

UFSWM:

Sub component Pull Requests:

  • None

UFSWM Blocking Dependencies:

  • None

Documentation:

  • No documentation update is required for this PR

Changes

Regression Test Changes (Please commit test_changes.list):

  • PR Adds New Tests/Baselines.
  • PR Updates/Changes Baselines.
  • No Baseline Changes.

Input data Changes:

  • None.
  • New input data.
  • Updated input data.

Library Changes/Upgrades:

Please delete what is not needed.
-->

Testing Logs

  1. RTs using ecflow run as following

./rt.sh -a NRAL0032 -c -k -e -w -l rt.conf 2>&1 | tee rt.conf_all.out

Working directories:
/glade/derecho/scratch/nperlin/UFS-WM/ufs-wm-derecho/tests
/glade/derecho/scratch/nperlin/UFS-WM/ufs-wm-derecho/tests/logs/log_derecho_bac/
/glade/derecho/scratch/nperlin/FV3_RT/rt_107863

129 tests were run before network disconnection halted the progress of the RTs; this was the last ecflow statement in the printout file:
"ECFLOW Tasks Remaining: 57/196"

( The RTs will be re-run in several chunks of tests to avoid hanging issues due network disconnection)

This test is not consistent, and it may pass OK when launched with rocoto (see item 2.)

./rt.sh -a NRAL0032 -c -k -r -w -n "cpld_control_c192_p8 intel"
The output:
/glade/derecho/scratch/nperlin/FV3_RT/rt_56593/cpld_control_c192_p8_intel
err.txt
out.txt
job_card.txt
cpld_control_c192_p8_intel.log.txt

However, it will fail (with MPICH error) if the same test is rerun by hand by submission of the job card:
qsub < job.card


Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Hercules
    • GaeaC6
    • Derecho
    • Ursa
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
  • opnReqTest (complete task if unnecessary)

@natalie-perlin
Copy link
Collaborator Author

A spack-stack-1.9.2 on Derecho is being updated. Currently it is built in a user space, but will be be rebuilt in a common space as soon as the spack-stack PR is approved:
JCSDA/spack-stack#1758 (a hot-fix for the previous PR1756)

Selected UFS RTs are currently being run on Derecho with the updated spack-stack-1.9.2 (user space-built).
/glade/derecho/scratch/nperlin/FV3_RT/rt_21274

@DusanJovic-NOAA
Copy link
Collaborator

Now that the derecho job card is (finally) fixed to use $TASKS variable, as all other job cards on all other platforms, we do not need $UFS_TASKS and $PPN in rt.sh an rt_utils.sh and run_tests.sh. Please double check, and remove them.

@natalie-perlin natalie-perlin moved this from Draft to Done in PRs to Process Sep 25, 2025
@natalie-perlin natalie-perlin moved this from Done to Evaluating in PRs to Process Sep 25, 2025
@natalie-perlin natalie-perlin moved this from Evaluating to Review/Schedule in PRs to Process Sep 25, 2025
@natalie-perlin natalie-perlin marked this pull request as ready for review September 25, 2025 15:33
@natalie-perlin
Copy link
Collaborator Author

Now that the derecho job card is (finally) fixed to use $TASKS variable, as all other job cards on all other platforms, we do not need $UFS_TASKS and $PPN in rt.sh an rt_utils.sh and run_tests.sh. Please double check, and remove them.

Are $UFS_TASKS and $PPN only needed for Derecho?

@natalie-perlin natalie-perlin changed the title Derecho updates: job card issues, rocoto, ecflow paths Derecho updates: spack-stack-1.9.2, job card issues, rocoto, ecflow paths Sep 25, 2025
@DusanJovic-NOAA
Copy link
Collaborator

Now that the derecho job card is (finally) fixed to use $TASKS variable, as all other job cards on all other platforms, we do not need $UFS_TASKS and $PPN in rt.sh an rt_utils.sh and run_tests.sh. Please double check, and remove them.

Are $UFS_TASKS and $PPN only needed for Derecho?

I think so.

$ grep -r UFS_TASKS .
./rt_utils.sh:  # TASKS is now set to UFS_TASKS
./rt_utils.sh:  # TASKS is now set to UFS_TASKS
./fv3_conf/fv3_qsub.IN_derecho:mpiexec -n @[UFS_TASKS] -ppn @[PPN] --hostfile $PBS_NODEFILE ./fv3.exe
./run_test.sh:UFS_TASKS=${TASKS}
./run_test.sh:PPN=$(( UFS_TASKS / NODES ))
./run_test.sh:if (( UFS_TASKS - ( PPN * NODES ) > 0 )); then
./run_test.sh:export UFS_TASKS
$ grep -r PPN .
./fv3_conf/fv3_qsub.IN_derecho:mpiexec -n @[UFS_TASKS] -ppn @[PPN] --hostfile $PBS_NODEFILE ./fv3.exe
./rt.sh:      PPN=$(( TASKS / NODES ))
./rt.sh:      if (( TASKS - ( PPN * NODES ) > 0 )); then
./rt.sh:          PPN=$((PPN + 1))
./run_test.sh:PPN=$(( UFS_TASKS / NODES ))
./run_test.sh:if (( UFS_TASKS - ( PPN * NODES ) > 0 )); then
./run_test.sh:  PPN=$((PPN + 1))
./run_test.sh:export PPN
./run_test.sh:  PPN=${TPN}

Only ./fv3_conf/fv3_qsub.IN_derecho uses UFS_TASKS and PPN.

@natalie-perlin natalie-perlin changed the title Derecho updates: spack-stack-1.9.2, job card issues, rocoto, ecflow paths Derecho updates: new spack-stack-1.9.2, job card issues, rocoto, ecflow paths Sep 25, 2025
@gspetro-NOAA
Copy link
Collaborator

gspetro-NOAA commented Sep 25, 2025

@natalie-perlin Thanks for your work on this! In general, only WM CMs should be updating the PR status in the PR board. You technically have access because you're an owner in ufs-community, but in the future, please just ping us on the PR itself or on Slack to update the status. It sounds like you still had a question--will the answer result in further changes for the PR? If so, it should stay in Evaluating until that gets resolved.


if [[ $MACHINE_ID = derecho ]]; then
TPN=96
TPN=128
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't 128 the actual cores/node on Derecho? Why does it need to be specified?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this could be removed, thank you

@DeniseWorthen
Copy link
Collaborator

DeniseWorthen commented Sep 25, 2025

I just ran cpld_debug_gfsv17_intel on derecho using this branch and I got a wall clock time

*****************RESOURCE STATISTICS*******************************
The total amount of wall time                        = 1122.732178

I don't think we need to specify a WLCLK=40 for this test.

@gspetro-NOAA gspetro-NOAA moved this from Schedule to Review in PRs to Process Sep 28, 2025
@natalie-perlin
Copy link
Collaborator Author

Notes after consulting with CISL/Derecho Helpdesk.
Current issues remained:
job_card PBS directives need to be configured distinctly for different tests to work.
Derecho's helpdeck/CISL team confirmed what is the expected/supported way for PBS directives.

Currently, a job_card for the test cpld_control_ciceC_p8_intel need to contain the following for the run to complete successfully:

#PBS -l select=2:ncpus=128:ompthreads=1
#PBS -l walltime=00:30:00zexport MPI_TYPE_DEPTH=20
export OMP_STACKSIZE=512M
export OMP_NUM_THREADS=1
export ESMF_RUNTIME_COMPLIANCECHECK=OFF:depth=4
export ESMF_RUNTIME_PROFILE=ON
export ESMF_RUNTIME_PROFILE_OUTPUT="SUMMARY"
export I_MPI_EXTRA_FILESYSTEM="ON"

mpiexec -n 256 --cpu-bind core -depth 1 ./fv3.exe

However, the test cpld_control_c192_p8_intel needs to have the following for the run to be successful (note the syntax change in PBS directive "-l select"):

#PBS -l select=7:ncpus=128:mpiprocs=128:ompthreads=1
#PBS -l walltime=00:30:00
export MPI_TYPE_DEPTH=20
export OMP_STACKSIZE=512M
export OMP_NUM_THREADS=1
export ESMF_RUNTIME_COMPLIANCECHECK=OFF:depth=4
export ESMF_RUNTIME_PROFILE=ON
export ESMF_RUNTIME_PROFILE_OUTPUT="SUMMARY"
export I_MPI_EXTRA_FILESYSTEM="ON"
mpiexec -n 896 --cpu-bind core -depth 1 ./fv3.exe

CISL/Derecho helpdesk confirms that the PBS directives syntax of
#PBS -l select=[NODES]:ncpus=[TPN]:ompthreads=[THREADS]
is the expected way to have it in the submission script.

Testing showed that the cpld_control_c192_p8_intel fails if the corresponding job_card has the following syntax:
#PBS -l select=7:ncpus=128:ompthreads=1

At the same time, the test cpld_control_ciceC_p8_intel fails if the job_card submission script has the following syntax:
#PBS -l select=2:ncpus=128:mpiprocs=128:ompthreads=1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Review

Development

Successfully merging this pull request may close these issues.

inefficient use of resources on derecho

4 participants