Skip to content

Conversation

@mahf708
Copy link
Contributor

@mahf708 mahf708 commented Jan 5, 2026

this PR introduces a modular emulator component framework to E3SM, enabling AI models to coexist alongside traditional physics components. The initial implementation provides a derived EATM that runs ACE inference using libtorch.

[BFB]

--

The work is a result of a week-long iterative process between friends, the code was mostly written by the friends, but the crazy design and its drawbacks are mine.

The PR is HUGE, and I don't expect humans eyes to suffer through all of it. I wanted to put it all up at once in one commit (+112 files, +15,020 lines) to get some feedback from interested colleagues on any aspect of it (and there are many). Some highlights below to encourage discussion.

  • a component-less (in cpl7/mct sense) EmulatorComp abstraction lives alone in components/emulator_comps/common along with needed abstracted utilites (coupling, basic io interface, more output stuff, logger, derived diagnostics, inference abstraction)
  • inference is designed to be modular; for now, we only have STUB (for testing) and LIBTORCH (direct C++ interface), but I am very interested in implementing PYTORCH (python interpretor bridge), 2) ONNX (flexible/efficient runtime in C++), and LAPIS (for future Kokkos interop)
  • LIBTORCH works well but I haven't worked on enabling a distributed version (or generally, optimizing performance by things like interweaving dataloading/inference, extending rollout horizon, caching, etc.)
  • the component (for now only EATM) respects MCT, but only as much as it has to --- it quickly moves out of the required MCT calls to C++. The component has some specialized code for its impl; these impl details are expected to be lighter and most of the code will live in the component-less common area
  • a standalone testing framework, cd component/emulator_comps; ./test --help to get started
  • CIME testing, e.g. SMS.gauss180x360_IcoswISC30E3r5.2000_EATM_SLND_SICE_SOCN_SROF_SGLC_SWAV.pm-gpu_gnugpu.eatm-libtorch is a good functioning test
  • some docs, including automatically generated docs from C++ comments via doxygen/mkdoxy

--

Any and all comments will be helpful; the docs link should up after this run through the gh ci machines. Note that I'd like to split this into smaller commits before integrating it. For now, this is strictly for discussion and testing.

@mahf708 mahf708 added BFB PR leaves answers BFB AI and emulators labels Jan 5, 2026
@mahf708 mahf708 marked this pull request as draft January 5, 2026 00:50
@github-actions
Copy link

github-actions bot commented Jan 5, 2026

PR Preview Action v1.8.0

🚀 View preview at
https://E3SM-Project.github.io/E3SM/pr-preview/pr-7964/

Built to branch gh-pages at 2026-01-05 01:44 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@AaronDonahue
Copy link
Contributor

Wow, this is quite the undertaking Naser. To be fair, I'm still coming up to speed on how this is implemented. I suspect the conversation that follows will be robust. My first question is just in the mechanics in how the new component communicates with all of the other components in E3SM. You note that it plays nice with MCT to the degree that it must. How do you envision data being passed through MCT to the other components in E3SM? I believe that MCT only supports 2D fluxes, but many of the AI implementations will require full 3D fields.

@mahf708
Copy link
Contributor Author

mahf708 commented Jan 6, 2026

Wow, this is quite the undertaking Naser. To be fair, I'm still coming up to speed on how this is implemented. I suspect the conversation that follows will be robust. My first question is just in the mechanics in how the new component communicates with all of the other components in E3SM. You note that it plays nice with MCT to the degree that it must. How do you envision data being passed through MCT to the other components in E3SM? I believe that MCT only supports 2D fluxes, but many of the AI implementations will require full 3D fields.

In short, it does so just like any other component. The behavior here is very similar to SCREAM fwiw. The coupler sends some fields to the atmopshere component (EATM is just an atmopshere component). These are all 2D fields and most of them are at the surface. Mostly fluxes and state variables. The F90 MCT call hands this data to EATM in C++ and it organizes it such that, alongside with the rest of flux and state variables, it can make up the 3D structures (arranged in 2D slices) that will be fed into the emulator backened. At the initial time step, there's an initial condition that feeds the rest of the data not coming from the coupler. In other time steps, the previous time step keeps track of the fields being predicted/propagated. The inference backend is a simple functional call of the form output = infer ( input ) where we intentionally keep both output and input as pointers (the backend usually wraps the pointer of input data to construct its appropriate tensor). One TODO item is is to intelligently and transparently keep track of how the data is organized, which I haven't done yet, to ensure that any flattening/unflattening doesn't mess up the order/organization, especially when we have more backends for testing that may require different tensor-based org...

@bartgol
Copy link
Contributor

bartgol commented Jan 7, 2026

I'm trying to understand the big picture: I see stuff in the common folder that doesn't really have much to do with emulators. It almost seems like you are trying to develop a bunch of interfaces/structures for a C++ based coupler. Is that part of the goal? I assume the answer is "no". But if the answer is "no", but then, do you still think the amount of effort spent in developing a very generic interface is going to pay off? I'm not hinting the answer is "no" (or "yes"), it's truly an open question.

To phrase it differently: it feels like you are writing a lot of code that doesn't really have much to do with AI/ML, and would basically set the ground for a common C++ component interface. Nothing against that (per se), but I wonder if it's the right thing for the AI group to work on (as opposed to simply working on the EATM impl, without worrying about generality or reusability). Again, not hinting that "yes" (or "no") is the answer. Just asking.

@mahf708
Copy link
Contributor Author

mahf708 commented Jan 7, 2026

To phrase it differently: it feels like you are writing a lot of code that doesn't really have much to do with AI/ML, and would basically set the ground for a common C++ component interface. Nothing against that (per se), but I wonder if it's the right thing for the AI group to work on (as opposed to simply working on the EATM impl, without worrying about generality or reusability). Again, not hinting that "yes" (or "no") is the answer. Just asking.

That's precisely the goal. In theory, in the next <1 year, we will have at least one EATM and one EOCN that will be official components of e3sm. In fact, I expect us to have multiple of these two (atm, ocn) that can be chosen at runtime --- and potentially EICE and ELND. And so on. The goal here is to write the generic common interface now for EATM on and then adding EOCN, ELND, etc., will be a few files (see the example of eatm). So, I see it in the interest of the AI group (actually E3SM at large) to pursue this now and quickly so that we can plug in all sorts of models in the near future.

Additionally, I personally believe the common infrastructure should be generalized for all components (things like EAMxx, OMEGA, etc.) and I think it should've been done a long time ago. That's a different argument, and I won't fight for it; but for the AI group in isolation, I definitely don't want to deal with having to write specialized code for every little component. I expect a "library" of emulators to emerge in the next few years, and users will be able to choose a combo of emulator/prognostic/data/stub [atm/eatm/datm/satm] as they see suitable for their research needs. When I say a library, I mean something having at least ~10 different flavors to choose from (maybe 10 different EATMs, 5 different EOCNs, etc.)

Does this goal make sense? I put this up as RFC because I wanted to hear people's thoughts about my strategy/vision before finalizing the remaining ~10% of the work needed. I think this can be good for all of e3sm to follow (i.e., all components should share a lot of the same infrastructure and shouldn't reinvent random details with random styles all the time), but whether or not people choose to follow that, it is up for grabs. In the emulator_comps, I think that's the right approach, and I am happy to hear counter-arguments so that I can adjust strategy if needed.

@bartgol
Copy link
Contributor

bartgol commented Jan 8, 2026

I haven't finished to look at the code, but a couple of thoughts before pulling the plug for the day:

  • This PR is very large. I would consider breaking it into smaller chunks, more self-contained.
  • There are lots of concepts that are defined in this PR, which should prob be thought carefully to make sure you're not putting an atmosphere perspective into generic interfaces. E.g., while all components talk to the cpl and must define a horizontal 2d grid, should you mention "levels" in a generic component interface?
  • I may be biased by having taken part to some of its development, but I think that this PR could/should consider using some of EKAT's utilities (logging, yaml parsing, comm wrapper, etc).
  • While the ultimate goal is to develop emulators (as stated in the title), lots of the code is generic, and could be used by any component. There is no "training" capability, so even the "infer" sections are just keen to any general component "eval"/"advance". Given that @rljacob mentioned in another PR that we do want to eventually write the cpl in C++, we will either have to write yet again another component class, or recycle these. I'm not saying the AI group should write code that will be fine for all components, but given the likelihood that it will be used at some point across the board, we may want to spend a bit more time on designing it in a way that may require less hammering later on. Of course, we can also just do whatever is easier for AI folks now, and decide later whether to use any of the emulator_components/common stuff or write something new. I personally vote for writing more sensible code now, but I do understand the need of the AI group to get something going in the short term.

@rljacob
Copy link
Member

rljacob commented Jan 8, 2026

Re: the "common" dir: there is some precedent for this in how the data models were created. The directory components/data_comps/datm/ has only the code specifically needed for a data atmosphere and most of the data model infrastructure (used by datm, docn, etc.) is in places like E3SM/share/streams/shr_dmodel_mod.F90. But this was a simpler problem since there is only one data model per component type.

My concern is similar to @bartgol that the generic code in "common" has hidden assumptions about either an atmosphere or an ACE-architecture model. Can it be used to couple GraphCast, NeuralGCM, Pangu-Weather, Stormer, etc. ?

@mahf708
Copy link
Contributor Author

mahf708 commented Jan 8, 2026

There are lots of concepts that are defined in this PR, which should prob be thought carefully to make sure you're not putting an atmosphere perspective into generic interfaces. E.g., while all components talk to the cpl and must define a horizontal 2d grid, should you mention "levels" in a generic component interface?

My concern is similar to @bartgol that the generic code in "common" has hidden assumptions about either an atmosphere

That can be fixed as we iterate. The common EComp should be able to handle all our components. We can have this such that dimensionality is optional. For the concept of a vertical dimension --- atm has one, ocn has one, even lnd has one (e.g., layers of soil). Not sure about ice or other models. But this can be an optional concept.

or an ACE-architecture model. Can it be used to couple GraphCast, NeuralGCM, Pangu-Weather, Stormer, etc. ?

There's only one little assumption about how to wrap the pointer into a tensor; in other words, how to transform the data blob. Thankfully, I think there may be only two main tensor layouts we may need to worry about in the context of pytorch and those are starting to be in here. One is [B,C,W,H] for vision/image models and the other is [B,CWH] for other types of models like language. There's a chance we will want to feed both into a net though, so we need to think more about this. Also, for a library like tensorflow, the main issue is what happens to the "C" dimension, [B,C,W,H] vs [B,W,H,C], for performance reasons.

Anyway, In this design, there is only going to be one underlying assumption: that this thing is pluggable into the cpl/drv as a component, hence "emulator_comps". If we want to think about NeuralGCM for example, we will want to find a way for a component to replace all components (possible) or modify some of the cpl/drv infra to allow for "merging"/"melding" components, e.g. comp atmlnd = atm+lnd where output_atm = atm (input_atm) and output_lnd = lnd (input_lnd), then set(output_atm+output_lnd) = atmlnd(set(input_atm+output_lnd)) and so on. The idea being that atm and lnd become one component that takes the union set of their respective inputs and produces the union set of their outputs.

In summary, the stuff in common is supposed to be neural-architecture-independent (CCN type, more basic, SFNO, etc.) and physics-indepednent (land, atmosphere, etc.)

@rljacob
Copy link
Member

rljacob commented Jan 19, 2026

A question @jayeshkrishna and I had about I/O. Since the I/O parts of "common" are c++, are you planning to do the output from the C++ side even thought the guts of the emulator are in python? And will there be a need for parallel reads of input or parallel write of output from the emulator?

@mahf708
Copy link
Contributor Author

mahf708 commented Jan 19, 2026

A question @jayeshkrishna and I had about I/O. Since the I/O parts of "common" are c++, are you planning to do the output from the C++ side even thought the guts of the emulator are in python? And will there be a need for parallel reads of input or parallel write of output from the emulator?

Yes, the plans are for the I/O to happen in C++.

While the emulators are originally written in python, we are going to experimenting calling them (just for inference to produce the data) from several backends. Initially, these will be:

  • LibTorch will be the pure C++ call via the libtorch library
  • Python will be C++ obtaining the Python interpreter and executing Python calls (to PyTorch or any other applicable framework)
  • ONNX will be the C++ onnx runtime
  • LAPIS will be essentially converted a model into Kokkos semantics that can be executed via Kokkos C++

In the future, we may add similar support for JAX and Tensorflow, etc., but I suppose that will be straightforward if we have the four backends above in place (e.g., with the exception of dimension ordering, I think the Python one should work automatically, and very likely the ONNX one will also just work)

Back to your question: The I/O will happen at a level closer to the native driver than the the inference. Here's a workflow I have in mind:

flowchart TD
    A[cime] -->|main| B(drv)
    B -->|x2a| C{all atm stuff goes here, basically atm drv}
    C -->|a2x| B
    C --> D(io interface)
    D --> DD(fa:fa-file)
    DD--> D
    D --> C
    C --> F(inference interface)
    F --> FF(fa:fa-hexagon-nodes-bolt)
    FF--> F
    F --> C
    C --> E(diagnostics interface)
    E --> EE(fa:fa-brain)
    EE--> E
    E --> C
Loading

And because that atm drv part gets its parallelism, etc., right from the drv, it will need to support parallel writing if it is run with those settings.

In summary, think of the "AI emulator" part as just the small inference abstraction part that will act as a simple function (given an input tensor, it will produce an output tensor --- nothing else). The rest of an emulator_comp is just a regular component that will need to conform to all conventions in E3SM, that's my design proposal at least. The only major difference from other components (besides inference) is that we are going to consolidate and abstract everything. All domain components (atm, ocn, etc.) will have most of underlying design shared in common and they all will conform to strict standards of how they will run. They will produce similar files, they will read in similar settings. Someone who learns how to run EATM will be run EOCN with perfect ease. So all of these interfaces in the diagram, while they are owned by the "atm" component in the illustration, most of their code will be in common because they all need to conform to strict design standards.

@mahf708
Copy link
Contributor Author

mahf708 commented Jan 19, 2026

And btw, we would love for others to "own" the stuff in common they find interesting to them. For example, if you and @jayeshkrishna want to own the IO interface we are writing, that will actually be great. We will produce the initial implementation just for speed, and then we can hand it off to you once integrated. Same for other stuff.

While we will keep this under emulator_comps, people are more welcome to move it a level higher and replace things in their components. The abstractions will apply to different components (but it will obviously require some work smoothly do so). We will only focus on producing this inside of emulator_comps for now. At the moment, we = Naser + Jeff + Noel

@mahf708
Copy link
Contributor Author

mahf708 commented Jan 19, 2026

Status update on this PR: we will keep it up for this week, and then we will start integrating rewritten pieces of it starting next week or so. It seems people seem to find the overall design to be valid. We will work out the details as we integrate smaller pieces of it one at a time. Goal is to avoid integrating more than ~1000 lines at a time. TBD if we will keep the mkdoxy stuff (which adds lots of lines)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI and emulators BFB PR leaves answers BFB

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants