EAMxx: add pytorch ML emulator test for cld_fraction #7568

bartgol · 2025-08-01T19:16:51Z

Adds a unit test showcasing how to hook up a pytorch emulator to an eamxx atm process.

[BFB]

Notes

The cld_fraction emulator is quite rudimental, and therefore does not perform well, but that's beyond the point here. The test is simply to showcase how to hook up a python ML emulator to an eamxx process.
So far, we only tested the python bindings with a CPU backend. Since torch is capable to work with GPU backends, I want to experiment enabling the test on GPUs
I have not added a baseline test for the pyml emulator.
I am also planning to add a purely C++ test where the ML emulator is implemented via a LAPIS-generated conversion of the pth model currently in the folder. Perhaps in a follow up PR.

Credit to @mahf708 for creating the pytorch model.

- Add (empty) init method to the module - Use try/catch blocks when calling methods (for debugging)

Note: the emulator is NOT good, it just showcases the capability

mahf708 · 2025-08-01T19:22:37Z

Looks good, I will review this carefully in a second. I added @ndkeen as awareness (no need to review, but ofc welcome to do so)

mahf708 · 2025-08-01T19:23:55Z

components/eamxx/tests/single-process/cld_fraction/cldfrac_net_weights.pth

this is "data" (not code) and as such likely doesn't belong here at all (let's either throw it on the inputdata repo, or we can create a dedicated public place on github for toy models that we can just wget)

I thought about the inputdata server. But I thought that a) it's a relatively small file (<100k), and b) I want to wait until the feature is "stable" before starting to put stuff on the data server. I feel that once data is on the server, it's doomed to stay there (as folks could checkout older version of master). I'd like to give the feature/test some "probation" time...

mahf708 · 2025-08-01T19:29:08Z

components/eamxx/tests/single-process/cld_fraction/cld_fraction_ml.py

+    model = CldFracNet(nlevs,nlevs)
+    model.load_state_dict(torch.load(model_file,map_location=torch.device('cpu')))


in the design (which we can revisit), I was thinking of a few things:

We actually don't need the model defined above fwiw, we can just get save it along with the weights and just instantiate it without the code above in the class

I was hoping we would have options for users to run different versions, say:

option 1: run c++ stuff we have by default

option 2: run regular python stuff to just reproduce the c++ (since we have that)

option 3: run pytorch python stuff (which is being added in this PR)

I would do some guarding (try ... except type of stuff) to help give informative errors/messages to users (this may be should be done below)

I agree on all counts. In more details:

Yes, I agree. It's just that I never used pytorch (/* noob mode on */) and was getting torch errors when loading the full model. It was in the early cycles of the impl, so maybe I fixed the core issue and can use the full model pth file now...

We do that for this test. There are 3 tests that do precisely that. The _py and _cpp tests are already tested to verify they are bfb, while the torch one is ofc not bfb with them, but runs the same input file.

I do have some try/catch blocks in c++, but maybe there are other places that I need to guard.

rljacob · 2025-08-01T20:16:37Z

Why don't I see any build or runtime mods pointing to a python install?

bartgol · 2025-08-01T20:27:53Z

Why don't I see any build or runtime mods pointing to a python install?

The python hooks for eamxx require

EAMXX_ENABLE_PYTHON=ON in the cmake configuration: this is OFF by default, we're just enabling in some CI testing for now. As the feature matures, we can think of how to make this CIME configurable (or maybe even ON by default)
The pybind11 package must be available on the system. Right now, we do have it installed on the ghci-snl-cpu runner, so that works. As for the previous point, as the feature matures we can think of adding some pip install mechanism to ensure its presence (or modules logic in cime config).

Edit: of course, I forgot to install the torch module...

bartgol added 4 commits August 1, 2025 13:05

EAMxx: increase error verbosity when PySession::safe_import fails

e598e36

EAMxx: make cldfrac test produce non-trivial output

d559b4b

EAMxx: minor upgrades to clf_fraction_py test

172caa5

- Add (empty) init method to the module - Use try/catch blocks when calling methods (for debugging)

EAMxx: add a pytorch ml emulator test for cld_frac

9268585

Note: the emulator is NOT good, it just showcases the capability

bartgol self-assigned this Aug 1, 2025

bartgol added BFB PR leaves answers BFB EAMxx Issues related to EAMxx labels Aug 1, 2025

bartgol requested a review from mahf708 August 1, 2025 19:16

mahf708 requested a review from ndkeen August 1, 2025 19:22

mahf708 reviewed Aug 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

EAMxx: add pytorch ML emulator test for cld_fraction #7568

EAMxx: add pytorch ML emulator test for cld_fraction #7568

Uh oh!

bartgol commented Aug 1, 2025 •

edited

Loading

Uh oh!

mahf708 commented Aug 1, 2025

Uh oh!

mahf708 Aug 1, 2025

Uh oh!

bartgol Aug 1, 2025

Uh oh!

mahf708 Aug 1, 2025

Uh oh!

bartgol Aug 1, 2025

Uh oh!

rljacob commented Aug 1, 2025

Uh oh!

bartgol commented Aug 1, 2025 •

edited

Loading

Uh oh!

Uh oh!

		model = CldFracNet(nlevs,nlevs)
		model.load_state_dict(torch.load(model_file,map_location=torch.device('cpu')))

EAMxx: add pytorch ML emulator test for cld_fraction #7568

Are you sure you want to change the base?

EAMxx: add pytorch ML emulator test for cld_fraction #7568

Uh oh!

Conversation

bartgol commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mahf708 commented Aug 1, 2025

Uh oh!

mahf708 Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

bartgol Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

mahf708 Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

bartgol Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

rljacob commented Aug 1, 2025

Uh oh!

bartgol commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

bartgol commented Aug 1, 2025 •

edited

Loading

bartgol commented Aug 1, 2025 •

edited

Loading