-
Notifications
You must be signed in to change notification settings - Fork 59
Description
Our ZarrParser unfortunately has to list every chunk in the object store (see #850 (comment)). But I think we can make this a lot faster and less memory-intensive.
Currently we use a vendored function from zarr-python
| async def _concurrent_map( |
and use it to call store.getsize concurrently on all the (possible) keys of a zarr array.
VirtualiZarr/virtualizarr/parsers/zarr.py
Line 118 in 785de91
| [(k,) for k in chunk_keys], zarr_array.store.getsize |
Then the results go into a python dict
VirtualiZarr/virtualizarr/parsers/zarr.py
Line 125 in 785de91
| key: {"path": p, "offset": offset, "length": length} |
which becomes the chunk manifest for that array.
Instead what we could do is:
- use
obstore.listto do the concurrent loop in rust instead of python - pass
return_arrow=Trueto get a stream of PyArrow RecordBatches back - construct the python
ChunkManifestobject's numpy arrays1 directly from the Arrow arrays, minimizing memory copies (i.e. the opposite of what I did in Pass manifests to icechunk as pyarrow arrays #861)
I haven't benchmarked the current approach but I'm pretty sure this would be waaay faster.
Footnotes
-
Unfortunately we can't just keep the manifests as arrow arrays the whole way through because of the potential need to concatenate manifests along arbitrary dims. ↩