-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resources on chunking #48
Comments
@dougiesquire I have a nice example - this dataset (public) contains monthly model output with default chunking, it's not very performant. There are also examples in the satellite space of spatially vs temporally optimised chunking, I think for publication most people choose a halfway option? In terms of dask, I think the most important thing for us to document is just ensuring dask chunks are integer multiples of underlying file chunks, it may be necessary to rewrite the underlying data, otherwise you can get even worse performance with dask than without! |
That's great - thanks @hot007! |
This looks great @dougiesquire and @hot007! Thanks for starting this discussion. I can add an example where I rechecked my data from temporal to spatial chunks so I could frequency-domain analysis. But I had to use the package |
Points to address from our discussion:
|
@dougiesquire To follow up on my previous comment, during our meeting today, we discussed this and think it's a great idea to expand on the concept of and best practices for chunking.
|
I've started to work on this locally:
|
More content/links to add to the chunking section in this comment: #12 (comment) |
Apologies for dropping the ball here @paigem! The last few months have been pretty hectic and I imagine it'll stay that way until the new financial year. Happy to take a look through what you come up with, and I'll try to set aside some time to actually contribute something |
No worries @dougiesquire - I'll create the new page and make a start and then tag you and others in this thread so you can add more detail as you see fit. (Also, I meant to tag you @dougiesquire above - I just updated the comment with the correct tag! Sorry about that!) |
@dougiesquire @hot007 A new chunking page has been added in PR #55. There is minimal content for now - just wanted to get the page added to make it easier to add content. |
This blog is giving a good introduction to chunks, covering enough to be comprehensive, but is still easy to follow: I've started adding to the chunking.md file, as it is a work in progress I pushed my changes to a new branch chunking-payola, but I haven't created a pull request yet. Feel free to add comments, I might work a bit more on it before the meeting on Thursday |
In our meeting yesterday, we started talking about providing some more resources on chunking. This is such an integral part of effective data processing. But my experience is that, while it's easy to understand the implications of chunking in simple/contrived cases, the real world is always more complex. We wondered whether maybe we could pull together some resources that are closer to real-world examples. As I'm writing this, I'm realising that it will be quite hard (for me at least) to separate the concepts of chunking and dask, but a chapter could look something like:
apply_ufunc
withdask="allowed"
(better)dask="parallelized"
(worse, but easier). Also, perhaps a more advanced case like @ScottWales's API calculation.Interested to hear what others think
The text was updated successfully, but these errors were encountered: