Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure with latest version with parallel, but not sequential writes #389

Open
edwardhartnett opened this issue Jan 17, 2023 · 2 comments
Open

Comments

@edwardhartnett
Copy link
Contributor

I am still gathering info on this issue. Here's what we know so far:

Kate Friedman:

So the parallel netcdf issue on Hera is a known issue (lustre issue we think) and as such we run low resolution with serial netcdf there. Used to force all resolutions to serial netcdf on Hera but recent updates in our develop branch changed that (see below).

We also see this issue on WCOSS2 with C384, which I believe GDIT continues to investigate. It didn't track with me at first that folks were hitting this issue on Hera, I thought we'd walled that issue off with resolution recommendations and such. A true cause has yet to be determined but from my experience it happens on lustre filesystems, we never saw this on the prior WCOSS gpfs filesystem. I ran a LOT of tests on WCOSS2 when it cropped up there during the transition. One solution was related to chunking (set it to zero and let the model define it at runtime instead of defining it beforehand) but, while this remedied the issue for the GFS, it slowed down downstream models who read the output with different chunking so we were forced to back that solution out of the C384 enkf forecast jobs in operational GFSv16. We run those C384 enkf forecast jobs with serial netcdf and threw more resources at them to speed them up in ops for now.

Here are the lines in our configs that define parallel vs serial netcdf by resolution (CASE) in our develop branch:

https://github.com/NOAA-EMC/global-workflow/blob/develop/parm/config/config.fv3#L170:L180

Only with high resolution is parallel netcdf used but we only recommend high resolution on WCOSS2 (where there are decent resources for it) so users are not advised to run it elsewhere (for resources and the occasional failure of the type described above and by Judy).

@rreddy2001
Copy link

I am not sure if this information is useful or adequate, at least on one of our systems the problems described above seem to have started after we had upgraded our Lustre client from:

lustre-client-2.12.6_ddn66-1.el7.x86_64

to

lustre-client-2.12.8_ddn18-1.el7.x86_64

The user has confirmed that it was working fine prior to the upgrade, and she repeated the same case again and it started failing with the following error:

921: file: module_write_netcdf.F90 line: 643 NetCDF: HDF error

Then she experimented by setting to change output file from "netcdf-parallel" to "netcdf" and it started working fine again.

Thank you!

@edwardhartnett
Copy link
Contributor Author

OK, so this looks a lot like a problems with lustre and not with netCDF.

Can the upgrade of Lustre be rolled back?

Were the HDF5 tests run with the new version of Lustre? If HDF5 doesn't work, then netCDF-4 is not going to work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants