Incremental rechunk #8

davidbrochart · 2020-06-11T14:43:36Z

rechunker solves a problem I was trying to solve in a much cleaner way, thanks a lot for working on that. I've tried on the GPM dataset and it seems to work fine.
Do you know if it would work in an incremental mode? By that I mean that if I have already rechunked a part of a dataset, and want to continue later on, is it possible to rechunk only the remaining source and append that to the already rechunked destination?

rabernat · 2020-06-11T14:47:33Z

I'm glad this is helpful! 😄

Do you know if it would work in an incremental mode?

It should definitely be possible in principle. But not as currently implemented.

We are trying to release this soon with its current feature set. Once we stabilize the API a bit, we would be happy to have a PR that would add incremental support.

rabernat · 2020-07-17T03:23:17Z

Hi @davidbrochart. We have done a first release and have some decent docs up. It would be fantastic if you wanted to tackle the incremental case. What sort of API did you have in mind?

davidbrochart · 2020-07-17T07:09:42Z

Great @rabernat, I'll try and implement the incremental rechunking. As far as the API is concerned, we probably want to slice the source in order not to rechunk the whole dataset and restart from a different position. So for the initial rechunk we could have:

source = zarr.ones((4, 4), chunks=(1, 4), store="source.zarr")
intermediate = "intermediate.zarr"
target = "target.zarr"
rechunked = rechunk(source,
                    target_chunks=(2, 2),
                    target_store=target,
                    max_mem=256000,
                    temp_store=intermediate,
                    source_slice=((0, 2), (0, 4)))

And for the next rechunk we need to get the next slice and specify that it should be appended to the previous target:

rechunked = rechunk(source,
                    target_chunks=(2, 2),
                    target_store=target,
                    max_mem=256000,
                    temp_store=intermediate,
                    source_slice=((2, 4), (0, 4)),
                    target_append=True)

What do you think?

rabernat · 2020-07-17T14:23:25Z

I'm curious why we need the source_slice argument. It seems like we should be able to just pass a sliced array, no?

But I guess zarr may not support lazy slicing.

davidbrochart · 2020-07-17T14:28:08Z

But I guess zarr may not support lazy slicing.

Yes, I think if we slice the Zarr array we get an in-memory NumPy array.

rabernat · 2020-07-17T14:30:03Z

Thoughts on the API @TomAugspurger, @tomwhite?

tomwhite · 2020-07-17T15:43:28Z

This feature will be very useful. The API looks good to me.

I briefly wondered if source_slice is needed at all, since in append mode only new data would be rechunked, but that's not safe if the source is being written to at the same time as being incrementally rechunked. So source_slice is needed. It should be optional though to support the non-incremental case.

davidbrochart · 2020-07-17T21:28:15Z

Also, even if the source is not being written, you may not want to rechunk the whole of it, because it can take a lot of time. Instead, you should be able to rechunk parts of it. It should be optional also in the incremental case, in which case the whole dataset should be rechunked.

davidbrochart linked a pull request Jul 18, 2020 that will close this issue

[WIP] Incremental rechunking #28

Open

jbeezley mentioned this issue Nov 29, 2023

Rechunk to an existing store #148

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental rechunk #8

Incremental rechunk #8

davidbrochart commented Jun 11, 2020

rabernat commented Jun 11, 2020

rabernat commented Jul 17, 2020

davidbrochart commented Jul 17, 2020

rabernat commented Jul 17, 2020

davidbrochart commented Jul 17, 2020

rabernat commented Jul 17, 2020

tomwhite commented Jul 17, 2020

davidbrochart commented Jul 17, 2020

Incremental rechunk #8

Incremental rechunk #8

Comments

davidbrochart commented Jun 11, 2020

rabernat commented Jun 11, 2020

rabernat commented Jul 17, 2020

davidbrochart commented Jul 17, 2020

rabernat commented Jul 17, 2020

davidbrochart commented Jul 17, 2020

rabernat commented Jul 17, 2020

tomwhite commented Jul 17, 2020

davidbrochart commented Jul 17, 2020