Skip to content

EAMxx: add pytorch ML emulator test for cld_fraction #7568

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

bartgol
Copy link
Contributor

@bartgol bartgol commented Aug 1, 2025

Adds a unit test showcasing how to hook up a pytorch emulator to an eamxx atm process.

[BFB]


Notes

  • The cld_fraction emulator is quite rudimental, and therefore does not perform well, but that's beyond the point here. The test is simply to showcase how to hook up a python ML emulator to an eamxx process.
  • So far, we only tested the python bindings with a CPU backend. Since torch is capable to work with GPU backends, I want to experiment enabling the test on GPUs
  • I have not added a baseline test for the pyml emulator.
  • I am also planning to add a purely C++ test where the ML emulator is implemented via a LAPIS-generated conversion of the pth model currently in the folder. Perhaps in a follow up PR.

Credit to @mahf708 for creating the pytorch model.

bartgol added 4 commits August 1, 2025 13:05
- Add (empty) init method to the module
- Use try/catch blocks when calling methods (for debugging)
Note: the emulator is NOT good, it just showcases the capability
@bartgol bartgol self-assigned this Aug 1, 2025
@bartgol bartgol added BFB PR leaves answers BFB EAMxx Issues related to EAMxx labels Aug 1, 2025
@bartgol bartgol requested a review from mahf708 August 1, 2025 19:16
@mahf708 mahf708 requested a review from ndkeen August 1, 2025 19:22
@mahf708
Copy link
Contributor

mahf708 commented Aug 1, 2025

Looks good, I will review this carefully in a second. I added @ndkeen as awareness (no need to review, but ofc welcome to do so)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is "data" (not code) and as such likely doesn't belong here at all (let's either throw it on the inputdata repo, or we can create a dedicated public place on github for toy models that we can just wget)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about the inputdata server. But I thought that a) it's a relatively small file (<100k), and b) I want to wait until the feature is "stable" before starting to put stuff on the data server. I feel that once data is on the server, it's doomed to stay there (as folks could checkout older version of master). I'd like to give the feature/test some "probation" time...

Comment on lines +54 to +55
model = CldFracNet(nlevs,nlevs)
model.load_state_dict(torch.load(model_file,map_location=torch.device('cpu')))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the design (which we can revisit), I was thinking of a few things:

  1. We actually don't need the model defined above fwiw, we can just get save it along with the weights and just instantiate it without the code above in the class
  2. I was hoping we would have options for users to run different versions, say:
    1. option 1: run c++ stuff we have by default
    2. option 2: run regular python stuff to just reproduce the c++ (since we have that)
    3. option 3: run pytorch python stuff (which is being added in this PR)
  3. I would do some guarding (try ... except type of stuff) to help give informative errors/messages to users (this may be should be done below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on all counts. In more details:

  1. Yes, I agree. It's just that I never used pytorch (/* noob mode on */) and was getting torch errors when loading the full model. It was in the early cycles of the impl, so maybe I fixed the core issue and can use the full model pth file now...
  2. We do that for this test. There are 3 tests that do precisely that. The _py and _cpp tests are already tested to verify they are bfb, while the torch one is ofc not bfb with them, but runs the same input file.
  3. I do have some try/catch blocks in c++, but maybe there are other places that I need to guard.

@rljacob
Copy link
Member

rljacob commented Aug 1, 2025

Why don't I see any build or runtime mods pointing to a python install?

@bartgol
Copy link
Contributor Author

bartgol commented Aug 1, 2025

Why don't I see any build or runtime mods pointing to a python install?

The python hooks for eamxx require

  • EAMXX_ENABLE_PYTHON=ON in the cmake configuration: this is OFF by default, we're just enabling in some CI testing for now. As the feature matures, we can think of how to make this CIME configurable (or maybe even ON by default)
  • The pybind11 package must be available on the system. Right now, we do have it installed on the ghci-snl-cpu runner, so that works. As for the previous point, as the feature matures we can think of adding some pip install mechanism to ensure its presence (or modules logic in cime config).

Edit: of course, I forgot to install the torch module...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFB PR leaves answers BFB EAMxx Issues related to EAMxx
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants