Skip to content

Conversation

rickgrubin-noaa
Copy link

@rickgrubin-noaa rickgrubin-noaa commented Aug 1, 2025

Commit Queue Requirements:

  • Fill out all sections of this template.
  • All sub component pull requests have been reviewed by their code managers.
  • Run the full Intel+GNU RT suite (compared to current baselines) on either Hera/Derecho/Hercules
  • Commit 'test_changes.list' from previous step

Description:

This PR updates the gaeac6 Intel modulefiles for spack-stack [email protected] / CPE PrgEnv-intel/8.6.0

Commit Message:

* UFSWM - update gaeac6 Intel modulefiles for [email protected] / CPE PrgEnv-intel/8.6.0 stack

Priority:

  • Normal

Git Tracking

UFSWM:

Sub component Pull Requests:

  • None

UFSWM Blocking Dependencies:

  • None

Documentation:

  • No documentation update is required for this PR (please explain).

No documentation change necessary as documentation does not specifically reference making changes host-specific modulefiles, rather only how to load them, e.g. 3.5.1. Loading the Required Modules


Changes

Regression Test Changes (Please commit test_changes.list):

  • No Baseline Changes.

See attached file RegressionTests_weekly_gaeac6.log generated via ./rt.sh -a epic -r -w

Note that file test_changes.list was length=0 for ./rt.sh -a epic -r -c and ./rt.sh -a epic -r -w

./rt.sh -a epic -r -c followed by ./rt.sh -a epic -r -m generated 100%successful comparisons.

RegressionTests_weekly_gaeac6.log

Input data Changes:

  • None.

Library Changes/Upgrades:


Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Hercules
    • GaeaC6
    • Derecho
    • Ursa
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
  • opnReqTest (complete task if unnecessary)

@gspetro-NOAA
Copy link
Collaborator

@rickgrubin-noaa Is this PR ready for review, or is there further work to do?

@gspetro-NOAA gspetro-NOAA moved this to Evaluating in PRs to Process Sep 18, 2025
@gspetro-NOAA gspetro-NOAA added the No Baseline Change No Baseline Change label Sep 18, 2025
@rickgrubin-noaa
Copy link
Author

@rickgrubin-noaa Is this PR ready for review, or is there further work to do?

It's been ready since August 1 (initial filing).

Branch is synced with HEAD of develop.

@gspetro-NOAA gspetro-NOAA moved this from Evaluating to Review in PRs to Process Sep 29, 2025
Fix typo in MODULEPATH
Fix stack compiler type to load
Require libfabric/1.20.1
Fix stack name, force libfabric/1.20.1
Fixes for gaeac6 OS upgrade
Remove module reset for gaeac6
Updates for new OS
@ulmononian
Copy link
Collaborator

what is the timeline to merge this? the upgrade from the intel classic to oneapi stack resolves issues for high-resolution tests (@JessicaMeixner-NOAA), among some other issues on c6.

@JessicaMeixner-NOAA
Copy link
Collaborator

what is the timeline to merge this? the upgrade from the intel classic to oneapi stack resolves issues for high-resolution tests (@JessicaMeixner-NOAA), among some other issues on c6.

I was actually able to run with the old version of spack-stack after changing an environment variable.

@gspetro-NOAA
Copy link
Collaborator

gspetro-NOAA commented Oct 6, 2025

@ulmononian I was going to test on Ursa, but then it looked like @rickgrubin-noaa was added several more commits, and the control_c48 I ran failed, likely because he was in the midst of updating. If the PR is complete, then I will get back to testing it, and we can move it to "Schedule if everything passes. We have four PRs lined up already for this week, so it would go in on Friday at the earliest unless it can be combined with another PR. Let me know your thoughts on that.

@rickgrubin-noaa
Copy link
Author

@ulmononian I was going to test on Ursa, but then it looked like @rickgrubin-noaa was added several more commits, and the control_c48 I ran failed, likely because he was in the midst of updating. If the PR is complete, then I will get back to testing it, and we can move it to "Schedule if everything passes. We have four PRs lined up already for this week, so it would go in on Friday at the earliest unless it can be combined with another PR. Let me know your thoughts on that.

@gspetro-NOAA the changes are strictly for gaea-c6 -- should be zero impact for anything on ursa.

@gspetro-NOAA
Copy link
Collaborator

@rickgrubin-noaa Sorry-that's what I meant. I did try to test on Gaea C6, and the initial test I tried failed, but you were suddenly pushing a bunch of changes. Are you done now?

@rickgrubin-noaa
Copy link
Author

@rickgrubin-noaa Sorry-that's what I meant. I did try to test on Gaea C6, and the initial test I tried failed, but you were suddenly pushing a bunch of changes. Are you done now?

Yes; done and successfully tested last week.

@ulmononian
Copy link
Collaborator

@ulmononian I was going to test on Ursa, but then it looked like @rickgrubin-noaa was added several more commits, and the control_c48 I ran failed, likely because he was in the midst of updating. If the PR is complete, then I will get back to testing it, and we can move it to "Schedule if everything passes. We have four PRs lined up already for this week, so it would go in on Friday at the earliest unless it can be combined with another PR. Let me know your thoughts on that.

would be great to merge friday or shortly after. it's fine to combine this with another PR if that helps expedite the process and reduce resource usage. thank you!

@DusanJovic-NOAA
Copy link
Collaborator

The default compiler on Gaea C6 is now intel/2025.2, which is supposed to fix the bug causing MOM6 to fail to compile with ifx. Would it be possible to recompile the spack-stack with this compiler so that we can finally start testing the model using both Fortran and C/C++ LLVM based compilers.

@rickgrubin-noaa
Copy link
Author

The default compiler on Gaea C6 is now intel/2025.2, which is supposed to fix the bug causing MOM6 to fail to compile with ifx. Would it be possible to recompile the spack-stack with this compiler so that we can finally start testing the model using both Fortran and C/C++ LLVM based compilers.

@DusanJovic-NOAA there are some known bugs in [email protected] that will be fixed in the next release, however creating host-specific stack configurations for [email protected] is underway.

@gspetro-NOAA
Copy link
Collaborator

gspetro-NOAA commented Oct 7, 2025

@rickgrubin-noaa When I run the control_c48 test, I get a failure. From what I'm seeing, all your testing ran with the -c flag, which creates new baselines, so wouldn't this be a baseline changing PR? Also, what was the reason for running with -w? Just resource conservation?

If this is a baseline changing PR, we normally need you to run the full RT suite (./rt.sh -a epic -e) and push the test_changes.list file and the log for the system you ran on. If you expect the PR to change baselines for every test, then let us know that, and I'll confer with the other CMs to see if you should do the full test or if we'll just regenerate the baselines. The main issue there is just assessing whether the particular baseline changes are reasonable.

@ulmononian
Copy link
Collaborator

@rickgrubin-noaa When I run the control_c48 test, I get a failure. From what I'm seeing, all your testing ran with the -c flag, which creates new baselines, so wouldn't this be a baseline changing PR? Also, what was the reason for running with -w? Just resource conservation?

If this is a baseline changing PR, we normally need you to run the full RT suite (./rt.sh -a epic -e) and push the test_changes.list file and the log for the system you ran on. If you expect the PR to change baselines for every test, then let us know that, and I'll confer with the other CMs to see if you should do the full test or if we'll just regenerate the baselines. The main issue there is just assessing whether the particular baseline changes are reasonable.

@gspetro-NOAA did control_c48 fail in the baseline comparison step or elsewhere? this is really only a compiler/lib change, but baselines could be altered. we can run the full suite without -c and share the logs if that will help.

@RatkoVasic-NOAA
Copy link
Collaborator

@rickgrubin-noaa When I run the control_c48 test, I get a failure. From what I'm seeing, all your testing ran with the -c flag, which creates new baselines, so wouldn't this be a baseline changing PR? Also, what was the reason for running with -w? Just resource conservation?

If this is a baseline changing PR, we normally need you to run the full RT suite (./rt.sh -a epic -e) and push the test_changes.list file and the log for the system you ran on. If you expect the PR to change baselines for every test, then let us know that, and I'll confer with the other CMs to see if you should do the full test or if we'll just regenerate the baselines. The main issue there is just assessing whether the particular baseline changes are reasonable.

@gspetro-NOAA

  • Yes, this is baseline changing PR (different compiler - expected different results)
  • Please use test_changes.list as a whole (replace every baseline)
  • I ran ./rt.sh -c followed by ./rt.sh -m, and it passed ALL regression tests. I believe assigned CM will do the same.
  • You can disregard -w option, it is used when you don't want to compare results, -m option did comparison.

@gspetro-NOAA gspetro-NOAA added Baseline Updates Current baselines will be updated. and removed No Baseline Change No Baseline Change labels Oct 8, 2025
@gspetro-NOAA
Copy link
Collaborator

@ulmononian Yes, it failed in the comparison stage, which is expected for a compiler change, as @RatkoVasic-NOAA said. However, there was some confusion because this is listed as a non-baseline changing PR. It doesn't matter the reason the baselines change; if they change for any reason, it's a baseline changing PR.

On the CM side, Ratko's right that before merging, we would run with the -c command to regenerate baselines. Then there are a few other steps we take. However, this only occurs after the developer has run the full RT suite (./rt.sh -a <account> usually w/-e or -r options, too) on a relevant RDHPCS (usually Ursa, but here, Gaea) and pushed the log and the test_changes.list file.

  • test_changes.list allows us to have a record of what baselines were changed in the PR, but it will only have the full list of changed tests if the full RT suite is run.
  • Pushing the failed log let's us see what the results of the developer's testing were (and the commands used). For example, here, we would expect failures in the log, but we would only expect comparison failures, not failures for other reasons, and it is important to verify that.

In short, what we need is for @rickgrubin-noaa to run the full RT suite without -c and push the resulting test_changes.list file and RegressionTest_gaea.log file. Then we can schedule it for the commit queue.

@gspetro-NOAA
Copy link
Collaborator

I ran the RTs on Gaea C6, and the tests that fail are expected failures. Failures are either:

  • UNABLE TO COMPLETE COMPARISON, which is expected with compiler/baseline changes
  • UNABLE TO START TEST, which is expected for restart tests when the control failed due to a comparison error.

Note that cpld_control_gfsv17_iau_intel failed to start, but it is a control test that depends on another control (cpld_control_gfsv17_intel), so this is also expected, even though it is not called a restart test.
Given that the failures are expected, we should be able to proceed with this PR.

@grantfirl
Copy link
Collaborator

This has been combined into #2882

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Baseline Updates Current baselines will be updated.

Projects

Status: Schedule

Development

Successfully merging this pull request may close these issues.

Update gaea-c6 Intel modulefiles to support [email protected] [email protected] / CPE PrgEnv-intel/8.6.0

7 participants