-
Notifications
You must be signed in to change notification settings - Fork 166
Description
What's the issue?
📚 Please describe the problem you’ve noticed in the documentation.
The documentation needs more information for running high-res / large model size assimilations, and it needs to be more visible in the docs.
The buffer_state_io option can be found in the docs in the "IO - reading and writing of the model state" section of the "Data management in DART" page, but this page is deep into the documentation, and I feel many users do not end up reading. https://docs.dart.ucar.edu/en/latest/guide/data-management-issues.html#data-management-in-dart
This nml option greatly improves the performance of filter with large states, especially when the state vector does not fit into memory on a single node.
Additionally, nowhere in the documentation does it talk about how running on scratch will astronomically improve the IO speed on Derecho. For reference, in a ROMS_Rutgers run with ~650 million elements in the state, running on scratch causes the state space output to be completed in a fraction of the time.
12 nodes and 300 mpiprocs on work (30 mins):
Before state space output TIME: 2025/10/30 09:36:00
After state space output TIME: 2025/10/30 10:06:10
vs 12 nodes and 300 mpiprocs on scratch (2 mins):
Before state space output TIME: 2025/11/06 23:24:23
After state space output TIME: 2025/11/06 23:26:20
This is due to the fact that scratch uses a Lustre file system, which is a different file system from both work and home (https://ncar-hpc-docs-arc-iframe.readthedocs.io/storage-systems/glade/lustre/).
Finally, the options in the &ensemble_manager_nml (namely layout and tasks_per_node) could also be more visible. They are detailed in both the "Data management in DART" and "MODULE ensemble_manager_mod" doc pages, but I think it should also be included in with this new section of the documentation.
What needs to be fixed?
Share what’s incorrect, unclear, missing, or outdated.
Information on these performance enhancing tactics is either missing or very difficult to find in the documentation
Suggestions for improvement
If you have any ideas to fix or generally improve this documentation, please share them.
This information could be added as a small section to the quickstart guide, which would both make the information more visible and also promote our capabilities to run with large states.