-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incremental rechunk #8
Comments
I'm glad this is helpful! 😄
It should definitely be possible in principle. But not as currently implemented. We are trying to release this soon with its current feature set. Once we stabilize the API a bit, we would be happy to have a PR that would add incremental support. |
Hi @davidbrochart. We have done a first release and have some decent docs up. It would be fantastic if you wanted to tackle the incremental case. What sort of API did you have in mind? |
Great @rabernat, I'll try and implement the incremental rechunking. As far as the API is concerned, we probably want to slice the source in order not to rechunk the whole dataset and restart from a different position. So for the initial rechunk we could have: source = zarr.ones((4, 4), chunks=(1, 4), store="source.zarr")
intermediate = "intermediate.zarr"
target = "target.zarr"
rechunked = rechunk(source,
target_chunks=(2, 2),
target_store=target,
max_mem=256000,
temp_store=intermediate,
source_slice=((0, 2), (0, 4))) And for the next rechunk we need to get the next slice and specify that it should be appended to the previous target: rechunked = rechunk(source,
target_chunks=(2, 2),
target_store=target,
max_mem=256000,
temp_store=intermediate,
source_slice=((2, 4), (0, 4)),
target_append=True) What do you think? |
I'm curious why we need the But I guess zarr may not support lazy slicing. |
Yes, I think if we slice the Zarr array we get an in-memory NumPy array. |
Thoughts on the API @TomAugspurger, @tomwhite? |
This feature will be very useful. The API looks good to me. I briefly wondered if |
Also, even if the source is not being written, you may not want to rechunk the whole of it, because it can take a lot of time. Instead, you should be able to rechunk parts of it. It should be optional also in the incremental case, in which case the whole dataset should be rechunked. |
rechunker solves a problem I was trying to solve in a much cleaner way, thanks a lot for working on that. I've tried on the GPM dataset and it seems to work fine.
Do you know if it would work in an incremental mode? By that I mean that if I have already rechunked a part of a dataset, and want to continue later on, is it possible to rechunk only the remaining source and append that to the already rechunked destination?
The text was updated successfully, but these errors were encountered: