-
Notifications
You must be signed in to change notification settings - Fork 456
RFC: introducing modular emulator components, starting with eatm #7964
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
|
Wow, this is quite the undertaking Naser. To be fair, I'm still coming up to speed on how this is implemented. I suspect the conversation that follows will be robust. My first question is just in the mechanics in how the new component communicates with all of the other components in E3SM. You note that it plays nice with MCT to the degree that it must. How do you envision data being passed through MCT to the other components in E3SM? I believe that MCT only supports 2D fluxes, but many of the AI implementations will require full 3D fields. |
In short, it does so just like any other component. The behavior here is very similar to SCREAM fwiw. The coupler sends some fields to the atmopshere component (EATM is just an atmopshere component). These are all 2D fields and most of them are at the surface. Mostly fluxes and state variables. The F90 MCT call hands this data to EATM in C++ and it organizes it such that, alongside with the rest of flux and state variables, it can make up the 3D structures (arranged in 2D slices) that will be fed into the emulator backened. At the initial time step, there's an initial condition that feeds the rest of the data not coming from the coupler. In other time steps, the previous time step keeps track of the fields being predicted/propagated. The inference backend is a simple functional call of the form |
|
I'm trying to understand the big picture: I see stuff in the common folder that doesn't really have much to do with emulators. It almost seems like you are trying to develop a bunch of interfaces/structures for a C++ based coupler. Is that part of the goal? I assume the answer is "no". But if the answer is "no", but then, do you still think the amount of effort spent in developing a very generic interface is going to pay off? I'm not hinting the answer is "no" (or "yes"), it's truly an open question. To phrase it differently: it feels like you are writing a lot of code that doesn't really have much to do with AI/ML, and would basically set the ground for a common C++ component interface. Nothing against that (per se), but I wonder if it's the right thing for the AI group to work on (as opposed to simply working on the EATM impl, without worrying about generality or reusability). Again, not hinting that "yes" (or "no") is the answer. Just asking. |
That's precisely the goal. In theory, in the next <1 year, we will have at least one EATM and one EOCN that will be official components of e3sm. In fact, I expect us to have multiple of these two (atm, ocn) that can be chosen at runtime --- and potentially EICE and ELND. And so on. The goal here is to write the generic common interface now for EATM on and then adding EOCN, ELND, etc., will be a few files (see the example of eatm). So, I see it in the interest of the AI group (actually E3SM at large) to pursue this now and quickly so that we can plug in all sorts of models in the near future. Additionally, I personally believe the common infrastructure should be generalized for all components (things like EAMxx, OMEGA, etc.) and I think it should've been done a long time ago. That's a different argument, and I won't fight for it; but for the AI group in isolation, I definitely don't want to deal with having to write specialized code for every little component. I expect a "library" of emulators to emerge in the next few years, and users will be able to choose a combo of emulator/prognostic/data/stub [atm/eatm/datm/satm] as they see suitable for their research needs. When I say a library, I mean something having at least ~10 different flavors to choose from (maybe 10 different EATMs, 5 different EOCNs, etc.) Does this goal make sense? I put this up as RFC because I wanted to hear people's thoughts about my strategy/vision before finalizing the remaining ~10% of the work needed. I think this can be good for all of e3sm to follow (i.e., all components should share a lot of the same infrastructure and shouldn't reinvent random details with random styles all the time), but whether or not people choose to follow that, it is up for grabs. In the |
|
I haven't finished to look at the code, but a couple of thoughts before pulling the plug for the day:
|
|
Re: the "common" dir: there is some precedent for this in how the data models were created. The directory components/data_comps/datm/ has only the code specifically needed for a data atmosphere and most of the data model infrastructure (used by datm, docn, etc.) is in places like E3SM/share/streams/shr_dmodel_mod.F90. But this was a simpler problem since there is only one data model per component type. My concern is similar to @bartgol that the generic code in "common" has hidden assumptions about either an atmosphere or an ACE-architecture model. Can it be used to couple GraphCast, NeuralGCM, Pangu-Weather, Stormer, etc. ? |
That can be fixed as we iterate. The common EComp should be able to handle all our components. We can have this such that dimensionality is optional. For the concept of a vertical dimension --- atm has one, ocn has one, even lnd has one (e.g., layers of soil). Not sure about ice or other models. But this can be an optional concept.
There's only one little assumption about how to wrap the pointer into a tensor; in other words, how to transform the data blob. Thankfully, I think there may be only two main tensor layouts we may need to worry about in the context of pytorch and those are starting to be in here. One is [B,C,W,H] for vision/image models and the other is [B,CWH] for other types of models like language. There's a chance we will want to feed both into a net though, so we need to think more about this. Also, for a library like tensorflow, the main issue is what happens to the "C" dimension, [B,C,W,H] vs [B,W,H,C], for performance reasons. Anyway, In this design, there is only going to be one underlying assumption: that this thing is pluggable into the cpl/drv as a component, hence "emulator_comps". If we want to think about NeuralGCM for example, we will want to find a way for a component to replace all components (possible) or modify some of the cpl/drv infra to allow for "merging"/"melding" components, e.g. In summary, the stuff in common is supposed to be neural-architecture-independent (CCN type, more basic, SFNO, etc.) and physics-indepednent (land, atmosphere, etc.) |
|
A question @jayeshkrishna and I had about I/O. Since the I/O parts of "common" are c++, are you planning to do the output from the C++ side even thought the guts of the emulator are in python? And will there be a need for parallel reads of input or parallel write of output from the emulator? |
Yes, the plans are for the I/O to happen in C++. While the emulators are originally written in python, we are going to experimenting calling them (just for inference to produce the data) from several backends. Initially, these will be:
In the future, we may add similar support for JAX and Tensorflow, etc., but I suppose that will be straightforward if we have the four backends above in place (e.g., with the exception of dimension ordering, I think the Python one should work automatically, and very likely the ONNX one will also just work) Back to your question: The I/O will happen at a level closer to the native driver than the the inference. Here's a workflow I have in mind: flowchart TD
A[cime] -->|main| B(drv)
B -->|x2a| C{all atm stuff goes here, basically atm drv}
C -->|a2x| B
C --> D(io interface)
D --> DD(fa:fa-file)
DD--> D
D --> C
C --> F(inference interface)
F --> FF(fa:fa-hexagon-nodes-bolt)
FF--> F
F --> C
C --> E(diagnostics interface)
E --> EE(fa:fa-brain)
EE--> E
E --> C
And because that In summary, think of the "AI emulator" part as just the small inference abstraction part that will act as a simple function (given an input tensor, it will produce an output tensor --- nothing else). The rest of an emulator_comp is just a regular component that will need to conform to all conventions in E3SM, that's my design proposal at least. The only major difference from other components (besides inference) is that we are going to consolidate and abstract everything. All domain components (atm, ocn, etc.) will have most of underlying design shared in common and they all will conform to strict standards of how they will run. They will produce similar files, they will read in similar settings. Someone who learns how to run EATM will be run EOCN with perfect ease. So all of these interfaces in the diagram, while they are owned by the "atm" component in the illustration, most of their code will be in common because they all need to conform to strict design standards. |
|
And btw, we would love for others to "own" the stuff in common they find interesting to them. For example, if you and @jayeshkrishna want to own the IO interface we are writing, that will actually be great. We will produce the initial implementation just for speed, and then we can hand it off to you once integrated. Same for other stuff. While we will keep this under emulator_comps, people are more welcome to move it a level higher and replace things in their components. The abstractions will apply to different components (but it will obviously require some work smoothly do so). We will only focus on producing this inside of emulator_comps for now. At the moment, we = Naser + Jeff + Noel |
|
Status update on this PR: we will keep it up for this week, and then we will start integrating rewritten pieces of it starting next week or so. It seems people seem to find the overall design to be valid. We will work out the details as we integrate smaller pieces of it one at a time. Goal is to avoid integrating more than ~1000 lines at a time. TBD if we will keep the mkdoxy stuff (which adds lots of lines) |
this PR introduces a modular emulator component framework to E3SM, enabling AI models to coexist alongside traditional physics components. The initial implementation provides a derived EATM that runs ACE inference using libtorch.
[BFB]
--
The work is a result of a week-long iterative process between friends, the code was mostly written by the friends, but the crazy design and its drawbacks are mine.
The PR is HUGE, and I don't expect humans eyes to suffer through all of it. I wanted to put it all up at once in one commit (+112 files, +15,020 lines) to get some feedback from interested colleagues on any aspect of it (and there are many). Some highlights below to encourage discussion.
cd component/emulator_comps; ./test --helpto get started--
Any and all comments will be helpful; the docs link should up after this run through the gh ci machines. Note that I'd like to split this into smaller commits before integrating it. For now, this is strictly for discussion and testing.