Open
Description
In our meeting yesterday, we started talking about providing some more resources on chunking. This is such an integral part of effective data processing. But my experience is that, while it's easy to understand the implications of chunking in simple/contrived cases, the real world is always more complex. We wondered whether maybe we could pull together some resources that are closer to real-world examples. As I'm writing this, I'm realising that it will be quite hard (for me at least) to separate the concepts of chunking and dask, but a chapter could look something like:
- Chunking matters
- introduce the idea of chunked data and performance implications. Refer readers elsewhere, e.g. https://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters
- introduce dask and core concepts - chunk sizes, align chunks with storage etc
- ...
- Chunking in the real world
- some "real-world" (geoscience) examples of where thoughtful chunking decisions had big performance implications. Ideally we could curate a few of these to each demonstrate a key concept. @ScottWales, do you have any examples from users you've helped that could help motivate these examples?
- how to apply custom functions across chunks with xarray and dask. Geoscience-specific examples of using
apply_ufunc
withdask="allowed"
(better)dask="parallelized"
(worse, but easier). Also, perhaps a more advanced case like @ScottWales's API calculation. - ...
Interested to hear what others think
Metadata
Metadata
Assignees
Labels
No labels