-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make wording more clear on netcdf page #96
base: main
Are you sure you want to change the base?
Conversation
|
||
As an example, if a file has dimensions `(time=744, lat=180, lon=360)`, the default approach would result in chunks like e.g. `(1, 180, 360)`, so that each disk read extracts an area at a single time step. If this file is expected to be used for timeseries analysis (i.e. to do computations for a single location in space across all timesteps), we would need to read in the entire dataset to access that single location across all time. For this timeseries analysis, a better chunking strategy would be `(744, 1, 1)`, so that each disk read extracts all of time steps at each point location. See the [rechunking section](https://acdguide.github.io/BigData/computations/computations.html#rechunking) for some example tools to help with rechunking. See {ref}`cdo` and {ref}`nco` for overviews of what those tools offer, and also note that NCO has a tool for [timeseries reshaping](http://nco.sourceforge.net/nco.html#Timeseries-Reshaping-mode_002c-aka-Splitting). | ||
|
||
For data where mixed mode analysis is required, it is best to find a chunking scheme that balances these two approaches, and results in chunk sizes that are broadly commensurate with typical on-board memory. In other words, we might pick a chunking approach like `(100, 180, 360)`. General advice is to aim for chunks between 100-500MB, to minimise file reads while balancing with typical available memory sizes (say 8GB). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On disk, we should use chunk sizes of 4MB? @paolap thinks this is too small, so maybe we can say ~20-100MB.
@dougiesquire notes that this is dependent on the system used.
- We should add: experimentation is likely required to figure out the optimal chunk size.
- Chunking also affects compression capacity.
- If writing netCDF through Xarray (
to_netcdf()
), you can specify chunks but Xarray will also use a default if user does not specify. - Link to https://docs.unidata.ucar.edu/nug/current/netcdf_perf_chunking.html
Final decision:
- Write a general sentence about importance of chunking and link to blog
- Add sentence about difference between chunking on disk vs in dask
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, but others please review my suggestions before accepting/merging!
For data where mixed mode analysis is required, it is best to find a chunking scheme that balances these two approaches, and results in chunk sizes that are broadly commensurate with typical on-board memory. In other words, we might pick a chunking approach like `(100, 180, 360)` which would result in chunks that are approximately `25MB`. This is reasonably computationally efficient, though could be bigger. General advice is to aim for chunks between 100-500MB, to minimise file reads while balancing with typical available memory sizes (say 8GB). | ||
Large data arrays are composed of smaller units which are called *chunks*. This is why some software, like xarray, can load data lazily, i.e. load into memory only the data chunks it needs to perform a specific operation (see some examples in the [Analysis section](https://acdguide.github.io/BigData/computations/computations-intro.html)). | ||
|
||
All data stored in netcdf files have been written in chunks, following some chunking strategy. [NCO](https://acdguide.github.io/BigData/software/software-other.html#nco) has a [list of different chunking policies](https://nco.sourceforge.net/nco.html#Chunking) that you can apply to write a netcdf file. The most common and default approach is to prioritise accessing the data as a grid, so that retrieving all grid points at one timestep will require loading only 1 or few chunks at one time. This chunking strategy means that each timestep is on a different chunk. While this is ideal for some types of computations (e.g. to plot a single timestep of the data), this chunking scheme is very slow (and sometimes prohibitively so) in other cases (e.g. to analyse a timeseries). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All data stored in netcdf files have been written in chunks, following some chunking strategy. [NCO](https://acdguide.github.io/BigData/software/software-other.html#nco) has a [list of different chunking policies](https://nco.sourceforge.net/nco.html#Chunking) that you can apply to write a netcdf file. The most common and default approach is to prioritise accessing the data as a grid, so that retrieving all grid points at one timestep will require loading only 1 or few chunks at one time. This chunking strategy means that each timestep is on a different chunk. While this is ideal for some types of computations (e.g. to plot a single timestep of the data), this chunking scheme is very slow (and sometimes prohibitively so) in other cases (e.g. to analyse a timeseries). | |
All data stored in netCDF files have been written to storage in chunks, following some chunking strategy. [NCO](https://acdguide.github.io/BigData/software/software-other.html#nco) has a [list of different chunking policies](https://nco.sourceforge.net/nco.html#Chunking) that you can apply to write a netCDF file. Ideally, chunk sizes should align with/be a multiple of "block" sizes (write quanta) on the underlying storage infrastructure, e.g. a multiple of 4MB. Note that the size of chunks chosen can affect how compressible the resulting file is. | |
The most common and default approach is to prioritise accessing the data as a grid, so that retrieving all grid points at one timestep will require loading only 1 or few chunks at one time. This chunking strategy means that each timestep is on a different chunk. While this is ideal for some types of computations (e.g. to plot a single timestep of the data), this chunking scheme is very slow to read (and sometimes prohibitively so) in other cases (e.g. to analyse a timeseries). Use of xarray's `to_netcdf()` will use a default chunking unless a chunking scheme is specified, which may not be appropriate for the data. UniData offer some [advice on chunking and netCDF performance](https://docs.unidata.ucar.edu/nug/current/netcdf_perf_chunking.html). | |
The above regards the data chunks written to storage. When working with `dask`, the user can specify chunk size also. This does not change how the data is stored on disk, but how it is stored in memory. Therefore dask specified chunks should be multiples of file chunks, otherwise read performance can be severely compromised. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to address the items arising from our discussion in the meeting but my changes would benefit from others' review!
|
||
## NetCDF metadata | ||
|
||
When using self-describing data formats (such as netCDF), it is important to understand the various attributes contained in the metadata, how to interact with them, and potential issues of which to remain aware. | ||
When using self-describing data formats (such as netCDF), it is important to understand the various attributes contained in the metadata, how to interact with them, and potential issues to remain aware of. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reject change, the original removed a dangling participle (which may or may not really be a rule in English!)
When using self-describing data formats (such as netCDF), it is important to understand the various attributes contained in the metadata, how to interact with them, and potential issues to remain aware of. | |
When using self-describing data formats (such as netCDF), it is important to understand the various attributes contained in the metadata, how to interact with them, and potential issues of which to remain aware. |
Closes #92