Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resources on chunking #48

Open
dougiesquire opened this issue Feb 10, 2022 · 11 comments
Open

Resources on chunking #48

dougiesquire opened this issue Feb 10, 2022 · 11 comments
Assignees

Comments

@dougiesquire
Copy link
Contributor

In our meeting yesterday, we started talking about providing some more resources on chunking. This is such an integral part of effective data processing. But my experience is that, while it's easy to understand the implications of chunking in simple/contrived cases, the real world is always more complex. We wondered whether maybe we could pull together some resources that are closer to real-world examples. As I'm writing this, I'm realising that it will be quite hard (for me at least) to separate the concepts of chunking and dask, but a chapter could look something like:

  • Chunking matters
  • Chunking in the real world
    • some "real-world" (geoscience) examples of where thoughtful chunking decisions had big performance implications. Ideally we could curate a few of these to each demonstrate a key concept. @ScottWales, do you have any examples from users you've helped that could help motivate these examples?
    • how to apply custom functions across chunks with xarray and dask. Geoscience-specific examples of using apply_ufunc with dask="allowed" (better) dask="parallelized" (worse, but easier). Also, perhaps a more advanced case like @ScottWales's API calculation.
    • ...

Interested to hear what others think

@hot007
Copy link
Contributor

hot007 commented Feb 11, 2022

@dougiesquire I have a nice example - this dataset (public) contains monthly model output with default chunking, it's not very performant.
On the internal network, we have this dataset which is a tiled version of the same think, chunked for timeseries analysis. The latter makes it possible to do operations across the full 40+ years of the dataset (even over space) to perform, e.g. extreme value analyses which were computationally not tractable with the original dataset.
That's before we start looking at zarrification which we've also done with similar results.

There are also examples in the satellite space of spatially vs temporally optimised chunking, I think for publication most people choose a halfway option?

In terms of dask, I think the most important thing for us to document is just ensuring dask chunks are integer multiples of underlying file chunks, it may be necessary to rewrite the underlying data, otherwise you can get even worse performance with dask than without!

@dougiesquire
Copy link
Contributor Author

That's great - thanks @hot007!

@paigem
Copy link
Contributor

paigem commented Mar 9, 2022

This looks great @dougiesquire and @hot007! Thanks for starting this discussion.

I can add an example where I rechecked my data from temporal to spatial chunks so I could frequency-domain analysis. But I had to use the package rechunker to physically rechunk the data because it was such a big dataset, which might add a level of complexity that we don't want for our examples here.

@paigem
Copy link
Contributor

paigem commented Mar 10, 2022

Points to address from our discussion:

  • point to rechecking paragraph
  • add new page under computations for chunking
  • also mention in data storage
  • there is a dataset guidelines working group, but chunking may make sense to add here

@paigem
Copy link
Contributor

paigem commented Mar 10, 2022

@dougiesquire To follow up on my previous comment, during our meeting today, we discussed this and think it's a great idea to expand on the concept of and best practices for chunking.

  1. There is a short section about chunking at the bottom of the Computations page. We think it would be great to move all discussion of chunking to a new page under "Computations" e.g. called "Chunking".
  2. Chunking should probably also be addressed on the data storage page, with regards to chunking of newly created datasets on disk. (i.e. how datasets are chunked for storage, which is somewhat separate from how to use chunks in computations) The Dataset guidelines working group will probably link to our discussion of chunking for data storage.

@paigem paigem self-assigned this Apr 14, 2022
@paigem
Copy link
Contributor

paigem commented Apr 14, 2022

I've started to work on this locally:

  • added a new page called "Data Chunking" under "Computations"
  • started writing some basic content starting from @dougiesquire 's first comment

@paigem
Copy link
Contributor

paigem commented Apr 14, 2022

More content/links to add to the chunking section in this comment: #12 (comment)

@dougiesquire
Copy link
Contributor Author

Apologies for dropping the ball here @paigem! The last few months have been pretty hectic and I imagine it'll stay that way until the new financial year. Happy to take a look through what you come up with, and I'll try to set aside some time to actually contribute something

@paigem
Copy link
Contributor

paigem commented Apr 14, 2022

No worries @dougiesquire - I'll create the new page and make a start and then tag you and others in this thread so you can add more detail as you see fit. (Also, I meant to tag you @dougiesquire above - I just updated the comment with the correct tag! Sorry about that!)

@paigem
Copy link
Contributor

paigem commented Apr 22, 2022

@dougiesquire @hot007 A new chunking page has been added in PR #55. There is minimal content for now - just wanted to get the page added to make it easier to add content.

@paolap
Copy link
Contributor

paolap commented May 10, 2022

This blog is giving a good introduction to chunks, covering enough to be comprehensive, but is still easy to follow:
https://blog.dask.org/2021/11/02/choosing-dask-chunk-sizes.
Other things that we could cover in terms of real life examples is time series analysis. We have lots of users that attempt a percentile calculation on data with time dimension with chunk size of 1.
It might be worth at the bottom having a paragraph on zarr vs chunking approach

I've started adding to the chunking.md file, as it is a work in progress I pushed my changes to a new branch chunking-payola, but I haven't created a pull request yet. Feel free to add comments, I might work a bit more on it before the meeting on Thursday

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants