-
Notifications
You must be signed in to change notification settings - Fork 271
Derecho updates: new spack-stack-1.9.2, job card issues, rocoto, ecflow paths #2863
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
d8cd7cf
to
c1a3007
Compare
A spack-stack-1.9.2 on Derecho is being updated. Currently it is built in a user space, but will be be rebuilt in a common space as soon as the spack-stack PR is approved: Selected UFS RTs are currently being run on Derecho with the updated spack-stack-1.9.2 (user space-built). |
Now that the derecho job card is (finally) fixed to use $TASKS variable, as all other job cards on all other platforms, we do not need $UFS_TASKS and $PPN in rt.sh an rt_utils.sh and run_tests.sh. Please double check, and remove them. |
Are $UFS_TASKS and $PPN only needed for Derecho? |
I think so.
Only |
@natalie-perlin Thanks for your work on this! In general, only WM CMs should be updating the PR status in the PR board. You technically have access because you're an owner in ufs-community, but in the future, please just ping us on the PR itself or on Slack to update the status. It sounds like you still had a question--will the answer result in further changes for the PR? If so, it should stay in Evaluating until that gets resolved. |
|
||
if [[ $MACHINE_ID = derecho ]]; then | ||
TPN=96 | ||
TPN=128 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't 128 the actual cores/node on Derecho? Why does it need to be specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this could be removed, thank you
I just ran
I don't think we need to specify a |
Notes after consulting with CISL/Derecho Helpdesk. Currently, a job_card for the test cpld_control_ciceC_p8_intel need to contain the following for the run to complete successfully:
However, the test cpld_control_c192_p8_intel needs to have the following for the run to be successful (note the syntax change in PBS directive "-l select"):
CISL/Derecho helpdesk confirms that the PBS directives syntax of Testing showed that the cpld_control_c192_p8_intel fails if the corresponding job_card has the following syntax: At the same time, the test cpld_control_ciceC_p8_intel fails if the job_card submission script has the following syntax: |
Commit Queue Requirements:
Description:
Updates for Derecho:
This PR adapts submission of the MPI job close to that for other systems that have PBS, i.e.:
where [TASKS]= [TPN] * [NODES]
(for simplicity, assume THRD=1)
Rocoto modulefile path and module loading:
Ecflow module used is from spack-stack-1.9.2:
Commit Messages:
several commits for derecho, i.e.
commit 0c638f2: update a script fv3_qsub.IN_derecho
commit 93e19fb: Merge recent changes from develop branch into fix/derecho
commit 5af4b25: updates of modulefile and compile_qsub for derecho
commit 913f2e6: New Derecho environment with spack-stack-1.9.2
commit 290bd0d: Merge remote-tracking branch 'origin/develop' into fix/derecho
commit d7fdfbc: update for ecflow path on Derecho
commit c16d2bb: update job_card for Derecho
commit 92149f4: use new Derecho spack-stack 1.9.2 environment built with ncarenv/24.12
commit 8076a14: Merge branch 'develop' into fix/derecho
commit c1a3007: Derecho: updated spack-stack, job_card, RT tests configurations
Priority:
Git Tracking
UFSWM:
Sub component Pull Requests:
UFSWM Blocking Dependencies:
Documentation:
Changes
Regression Test Changes (Please commit test_changes.list):
Input data Changes:
Library Changes/Upgrades:
Please delete what is not needed.
-->
Testing Logs
./rt.sh -a NRAL0032 -c -k -e -w -l rt.conf 2>&1 | tee rt.conf_all.out
Working directories:
/glade/derecho/scratch/nperlin/UFS-WM/ufs-wm-derecho/tests
/glade/derecho/scratch/nperlin/UFS-WM/ufs-wm-derecho/tests/logs/log_derecho_bac/
/glade/derecho/scratch/nperlin/FV3_RT/rt_107863
129 tests were run before network disconnection halted the progress of the RTs; this was the last ecflow statement in the printout file:
"ECFLOW Tasks Remaining: 57/196"
( The RTs will be re-run in several chunks of tests to avoid hanging issues due network disconnection)
This test is not consistent, and it may pass OK when launched with rocoto (see item 2.)
./rt.sh -a NRAL0032 -c -k -r -w -n "cpld_control_c192_p8 intel"
The output:
/glade/derecho/scratch/nperlin/FV3_RT/rt_56593/cpld_control_c192_p8_intel
err.txt
out.txt
job_card.txt
cpld_control_c192_p8_intel.log.txt
However, it will fail (with MPICH error) if the same test is rerun by hand by submission of the job card:
qsub < job.card
Testing Log: