Speed up ZarrParser using obstore and Arrow?

Our `ZarrParser` unfortunately has to list every chunk in the object store (see https://github.com/zarr-developers/VirtualiZarr/issues/850#issuecomment-3962495995). But I think we can make this a lot faster and less memory-intensive.

Currently we use a vendored function from zarr-python 

https://github.com/zarr-developers/VirtualiZarr/blob/785de91306f5b5ef5863d8a92c9c9e0751f7d98d/virtualizarr/vendor/zarr/core/common.py#L17

and use it to call `store.getsize` concurrently on all the (possible) keys of a zarr array.

https://github.com/zarr-developers/VirtualiZarr/blob/785de91306f5b5ef5863d8a92c9c9e0751f7d98d/virtualizarr/parsers/zarr.py#L118

Then the results go into a python dict

https://github.com/zarr-developers/VirtualiZarr/blob/785de91306f5b5ef5863d8a92c9c9e0751f7d98d/virtualizarr/parsers/zarr.py#L125

which becomes the chunk manifest for that array.

Instead what we could do is:
- use [`obstore.list`](https://developmentseed.org/obstore/v0.9.0/api/list/#obstore.list) to do the concurrent loop in rust instead of python
- pass `return_arrow=True` to get a stream of PyArrow RecordBatches back
- construct the python `ChunkManifest` object's numpy arrays[^1] directly from the Arrow arrays, minimizing memory copies (i.e. the opposite of what I did in #861)

I haven't benchmarked the current approach but I'm pretty sure this would be waaay faster.

[^1]: Unfortunately we can't just keep the manifests as arrow arrays the whole way through because of the potential need to concatenate manifests along arbitrary dims.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up ZarrParser using obstore and Arrow? #891

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Speed up ZarrParser using obstore and Arrow? #891

Description

Footnotes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions