Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make wording more clear on netcdf page #96

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

paigem
Copy link
Contributor

@paigem paigem commented May 8, 2023

Closes #92

@paigem paigem requested review from paolap and hot007 May 8, 2023 23:40

As an example, if a file has dimensions `(time=744, lat=180, lon=360)`, the default approach would result in chunks like e.g. `(1, 180, 360)`, so that each disk read extracts an area at a single time step. If this file is expected to be used for timeseries analysis (i.e. to do computations for a single location in space across all timesteps), we would need to read in the entire dataset to access that single location across all time. For this timeseries analysis, a better chunking strategy would be `(744, 1, 1)`, so that each disk read extracts all of time steps at each point location. See the [rechunking section](https://acdguide.github.io/BigData/computations/computations.html#rechunking) for some example tools to help with rechunking. See {ref}`cdo` and {ref}`nco` for overviews of what those tools offer, and also note that NCO has a tool for [timeseries reshaping](http://nco.sourceforge.net/nco.html#Timeseries-Reshaping-mode_002c-aka-Splitting).

For data where mixed mode analysis is required, it is best to find a chunking scheme that balances these two approaches, and results in chunk sizes that are broadly commensurate with typical on-board memory. In other words, we might pick a chunking approach like `(100, 180, 360)`. General advice is to aim for chunks between 100-500MB, to minimise file reads while balancing with typical available memory sizes (say 8GB).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On disk, we should use chunk sizes of 4MB? @paolap thinks this is too small, so maybe we can say ~20-100MB.

@dougiesquire notes that this is dependent on the system used.

  • We should add: experimentation is likely required to figure out the optimal chunk size.
  • Chunking also affects compression capacity.
  • If writing netCDF through Xarray (to_netcdf()), you can specify chunks but Xarray will also use a default if user does not specify.
  • Link to https://docs.unidata.ucar.edu/nug/current/netcdf_perf_chunking.html

Final decision:

  • Write a general sentence about importance of chunking and link to blog
  • Add sentence about difference between chunking on disk vs in dask

Copy link
Contributor

@hot007 hot007 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, but others please review my suggestions before accepting/merging!

For data where mixed mode analysis is required, it is best to find a chunking scheme that balances these two approaches, and results in chunk sizes that are broadly commensurate with typical on-board memory. In other words, we might pick a chunking approach like `(100, 180, 360)` which would result in chunks that are approximately `25MB`. This is reasonably computationally efficient, though could be bigger. General advice is to aim for chunks between 100-500MB, to minimise file reads while balancing with typical available memory sizes (say 8GB).
Large data arrays are composed of smaller units which are called *chunks*. This is why some software, like xarray, can load data lazily, i.e. load into memory only the data chunks it needs to perform a specific operation (see some examples in the [Analysis section](https://acdguide.github.io/BigData/computations/computations-intro.html)).

All data stored in netcdf files have been written in chunks, following some chunking strategy. [NCO](https://acdguide.github.io/BigData/software/software-other.html#nco) has a [list of different chunking policies](https://nco.sourceforge.net/nco.html#Chunking) that you can apply to write a netcdf file. The most common and default approach is to prioritise accessing the data as a grid, so that retrieving all grid points at one timestep will require loading only 1 or few chunks at one time. This chunking strategy means that each timestep is on a different chunk. While this is ideal for some types of computations (e.g. to plot a single timestep of the data), this chunking scheme is very slow (and sometimes prohibitively so) in other cases (e.g. to analyse a timeseries).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
All data stored in netcdf files have been written in chunks, following some chunking strategy. [NCO](https://acdguide.github.io/BigData/software/software-other.html#nco) has a [list of different chunking policies](https://nco.sourceforge.net/nco.html#Chunking) that you can apply to write a netcdf file. The most common and default approach is to prioritise accessing the data as a grid, so that retrieving all grid points at one timestep will require loading only 1 or few chunks at one time. This chunking strategy means that each timestep is on a different chunk. While this is ideal for some types of computations (e.g. to plot a single timestep of the data), this chunking scheme is very slow (and sometimes prohibitively so) in other cases (e.g. to analyse a timeseries).
All data stored in netCDF files have been written to storage in chunks, following some chunking strategy. [NCO](https://acdguide.github.io/BigData/software/software-other.html#nco) has a [list of different chunking policies](https://nco.sourceforge.net/nco.html#Chunking) that you can apply to write a netCDF file. Ideally, chunk sizes should align with/be a multiple of "block" sizes (write quanta) on the underlying storage infrastructure, e.g. a multiple of 4MB. Note that the size of chunks chosen can affect how compressible the resulting file is.
The most common and default approach is to prioritise accessing the data as a grid, so that retrieving all grid points at one timestep will require loading only 1 or few chunks at one time. This chunking strategy means that each timestep is on a different chunk. While this is ideal for some types of computations (e.g. to plot a single timestep of the data), this chunking scheme is very slow to read (and sometimes prohibitively so) in other cases (e.g. to analyse a timeseries). Use of xarray's `to_netcdf()` will use a default chunking unless a chunking scheme is specified, which may not be appropriate for the data. UniData offer some [advice on chunking and netCDF performance](https://docs.unidata.ucar.edu/nug/current/netcdf_perf_chunking.html).
The above regards the data chunks written to storage. When working with `dask`, the user can specify chunk size also. This does not change how the data is stored on disk, but how it is stored in memory. Therefore dask specified chunks should be multiples of file chunks, otherwise read performance can be severely compromised.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tried to address the items arising from our discussion in the meeting but my changes would benefit from others' review!


## NetCDF metadata

When using self-describing data formats (such as netCDF), it is important to understand the various attributes contained in the metadata, how to interact with them, and potential issues of which to remain aware.
When using self-describing data formats (such as netCDF), it is important to understand the various attributes contained in the metadata, how to interact with them, and potential issues to remain aware of.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reject change, the original removed a dangling participle (which may or may not really be a rule in English!)

Suggested change
When using self-describing data formats (such as netCDF), it is important to understand the various attributes contained in the metadata, how to interact with them, and potential issues to remain aware of.
When using self-describing data formats (such as netCDF), it is important to understand the various attributes contained in the metadata, how to interact with them, and potential issues of which to remain aware.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tidy up NetCDF page
2 participants