-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ELM-FATES Restart Error #1234
Comments
What is standing out to me @ckoven is there is no vegetated area in this element. If this is true, there should be no element, unless it is an "air" element, but that seems unlikely. The Kb value is not 0.5, which is a nominal value given for the air element. |
Thanks @rgknox, I have a minimal run script that reproduces the bug with just one year of runtime, I paste it below. Currently I am using this tag, which combines a few different things, I can try something closer to main as well.
|
@ckoven I ran the above script on Perlmutter using e3sm commit Per our discussion this morning, would you point me to Rosie's parameter file that you were using so that I can test that next? |
@ckoven I'm getting passing cases for both stages using the updated parameter file with the same e3sm commit and fates tag as noted above. I noticed that the perlmutter scratch locations:
|
Just some updates on this. It is a bit more maddening of a bug than I had originally perceived. I've now been able to replicate it a few different ways, and it is possibly related to #1237. The closest-to-main configuration I've gotten is via tag 731bf6e, which adds three canopy layers and Rosie's parameter file on top of a recent-ish main. If I run the above script, then the step one part works fine. Step two fails in one of two ways, depending on whether or not
If debug is not true, then it makes it several months into the run, before crashing with a longer error message, the first few lines of which are below. This is the same failure mode that goes away in the longer continuous runs if I apply the promotion reordering fix in #1237. But the really weird part is that if I apply the promotion reordering fix in #1237 to this script, then the model crashes in the way that I originally encountered in this bug.
So TLDR, this does not seem to be related to the grazing code. I will try to go back closer to main on this, first to see if the crash in debug mode happens with just two potential canopy layers but the modified parameter file, and also to see if it happens on main with default parameters. |
OK, looked more into the minimum set of things needed to trigger this. I get the same failure mode (crash during initialization from a prior otherwise-identical one-year-long run's restart file when debug set to true) if |
That nan-check is inside a local debug, which is only true when the debug logical in the file is hard-coded to true: https://github.com/NGEET/fates/blob/main/radiation/TwoStreamMLPEMod.F90#L36 Huh.. but yeah, its set to true. I suppose that routine is not doing what it is suppose to be doing. I wonder if the simple equivalency test that we commented out will catch it: https://github.com/NGEET/fates/blob/main/radiation/TwoStreamMLPEMod.F90#L374 |
Thanks I can try that |
OK that worked to catch the error without setting debug in the xml file:
|
Another thing that I see and am not sure how to interpret. I thought that this was happening during the first timestep but it isn't, it's a bit afterwards, as seen in the lnd.log:
|
to add further confusion, I tried adding a NaN check for |
That scattering element has nonsense values in it anyway, I'm not surprised its generating nonsense results for a scattering solution. For instance, it has a beam optical depth of exactly zero, which should be impossible if there is material to scatter. Here is where the calculation is made: My best guess is that we have a situation where there are cohorts with no stem or leaf area. No leaf area is normal, ie deciduous trees, but no stem area should be impossible... Unless this uses the new grass allometry that has no SAI or LAI? I know that line 984 above, does divide by the leaf and stem area, so it should be an inf or a nan, but its possible the compiler is protecting the division because its a 0 divided by a 0. |
Thanks @rgknox. So if I work upstream setting traps to try to identify scattering elements where lai and sai are both zero, that might help identify what is going on? |
I would put a trap here right after the lai and sai of the element is calculated, if it is less than nearzero, I would fail and print out the cohort involved. Like here: Are you using Xiulin's new grass allometry? That could create a cohort with no leaf or stem area I think. If that is the case, we just need to put in a provision when we are creating the elements, to make that element an "air" element. |
Thanks again @rgknox. OK I added some more diagnostics to the error message that it was crashing at before, because I noticed that if VAI + LAI was near zero, then logic just above the line you point to should have already made the pft of the scattering element the
|
The vai that it is reporting on the third line, is the integrated vegetation depth (top down coordinate) at which it is being asked to report downwelling radiation in this element. So, its reporting downwelling radiation at the top of the element. The element itself, does have non-zero lai and sai, so that is good, and may help clear up some confusion. Ill look at this more and see if I can figure anything out. |
Maybe the sun is at a very low inclination angle, which is why the Kb term is so high. It could be that this is making life difficult for the solver? For instance we have exponentials that have Kb * LAI. This could create math operations of e^(+-100) or greater. Maybe we just need cap Kb... An exponential function of even e^(-300) should not generate a number larger than can be held in a 16 byte real, but it could still be making life tough on the matrix inversion. https://github.com/NGEET/fates/blob/main/radiation/TwoStreamMLPEMod.F90#L506 Try reducing kb_max to something like 10? But I think it would be good to also get more info on why some of these other terms are zero... For instance, lets look at everything that builds the "Ad" term... https://github.com/NGEET/fates/blob/main/radiation/TwoStreamMLPEMod.F90#L1245C11-L1266C82 Can you add print statements to your current fail statement that includes more of the parameters in the line above:
|
In the spirit of this very special issue number, I'd like to remind everyone of this scene from spaceballs: https://www.youtube.com/watch?v=a6iW-8xPw3k This is also a good reminder about password quality while we are at it. |
So I dumped all the cohorts when the error is generated (in the GetRdDn() function), and found two things that I thought were quite strange...
|
@rgknox that is very weird that there is only one exactly full canopy layer. I don't see how that could happen unless all understory cohorts are getting killed somehow. I don't know enough about the call sequence to understand how point 2 could have happened. |
There is some discussion right now on the CLM call about the distinction between hybrid runs (equivalent to setting finidat to a prior run's restart file), and branch runs (equivalent to continuing a restart file). Given that this crash happens only in the hybrid type case, it might be helpful to know exactly what happens differently during the hybrid case restart read versus a branch case restart read. edit: noting that this is controlled by the |
Just as a sanity check, I ran a slightly modified script from the above in both a |
@ckoven , does this congifuration modify, add or remove any patches or cohorts during the restart procedure, aside from reading in the patch and cohort information from the restart file? If so, we need to be careful to call update_3dpatch_radiation() after everything is settled: That call to update_3dpatch_radiation() is where the scattering element info is prepared, and the solver is run for the first time. If we perform this action and then modify the patch structure in anyway before exiting the initialization, it will be problematic. |
@rgknox I don't believe there is anything in the restart sequence that is different in this configuration than in the standard full-FATES configuration. |
It looks like in a hybrid run, the following initialization sequence, which is maybe meant for a cold-start is run: https://github.com/NGEET/fates/blob/main/main/EDInitMod.F90#L469 Is it possible that something was run twice, or inadvertantly, and that is creating problems during the hybrid run? |
Thanks @rgknox does this line here evaluate as |
I believe it would indeed evaluate as true, based on this: I'd like to run a test to verify though... |
@rgknox that section of code in actually EDIT: I think it only excludes finidat, but not hybrid? |
The finidat will not be ' ', so this block of code will not be triggered for either branch or hybrid. I'm just guessing, but my expectation was that we would not want cold-start type initializations to happen in this situation, that initializations would happen during the restart() call in elmfates_interfacemod.F90. On a side-note, I don't like that string compare on the finidat == ' '. It seems awfully vulnerable that an unspecified finidat has to be empty with exactly 1 space? I feel like we should to a trim and a string length count of one or less. |
Just wanted to note here that @rgknox's fixes on the two branches below appear to fix the issue. https://github.com/rgknox/fates/tree/twostream-restart-bugfixes |
@ckoven detected a problem restarting the model in-between spin-up phases with ELM-FATES. I believe the last phase to complete was an AD spinup, which is usually followed by another pre-industrial run without accerlated decomposition constants. This type of run typically restarts via specifying the finidat file. @ckoven can you provide more details and perhaps share the script that was used to generate the error.
The model fails during two-stream radiation, with what appears to be issues with uninitialized values.
When combined with the write statement code:
The text was updated successfully, but these errors were encountered: