You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am still gathering info on this issue. Here's what we know so far:
Kate Friedman:
So the parallel netcdf issue on Hera is a known issue (lustre issue we think) and as such we run low resolution with serial netcdf there. Used to force all resolutions to serial netcdf on Hera but recent updates in our develop branch changed that (see below).
We also see this issue on WCOSS2 with C384, which I believe GDIT continues to investigate. It didn't track with me at first that folks were hitting this issue on Hera, I thought we'd walled that issue off with resolution recommendations and such. A true cause has yet to be determined but from my experience it happens on lustre filesystems, we never saw this on the prior WCOSS gpfs filesystem. I ran a LOT of tests on WCOSS2 when it cropped up there during the transition. One solution was related to chunking (set it to zero and let the model define it at runtime instead of defining it beforehand) but, while this remedied the issue for the GFS, it slowed down downstream models who read the output with different chunking so we were forced to back that solution out of the C384 enkf forecast jobs in operational GFSv16. We run those C384 enkf forecast jobs with serial netcdf and threw more resources at them to speed them up in ops for now.
Here are the lines in our configs that define parallel vs serial netcdf by resolution (CASE) in our develop branch:
Only with high resolution is parallel netcdf used but we only recommend high resolution on WCOSS2 (where there are decent resources for it) so users are not advised to run it elsewhere (for resources and the occasional failure of the type described above and by Judy).
The text was updated successfully, but these errors were encountered:
I am not sure if this information is useful or adequate, at least on one of our systems the problems described above seem to have started after we had upgraded our Lustre client from:
lustre-client-2.12.6_ddn66-1.el7.x86_64
to
lustre-client-2.12.8_ddn18-1.el7.x86_64
The user has confirmed that it was working fine prior to the upgrade, and she repeated the same case again and it started failing with the following error:
I am still gathering info on this issue. Here's what we know so far:
Kate Friedman:
The text was updated successfully, but these errors were encountered: