Skip to content

Speed up ZarrParser using obstore and Arrow? #891

@TomNicholas

Description

@TomNicholas

Our ZarrParser unfortunately has to list every chunk in the object store (see #850 (comment)). But I think we can make this a lot faster and less memory-intensive.

Currently we use a vendored function from zarr-python

async def _concurrent_map(

and use it to call store.getsize concurrently on all the (possible) keys of a zarr array.

[(k,) for k in chunk_keys], zarr_array.store.getsize

Then the results go into a python dict

key: {"path": p, "offset": offset, "length": length}

which becomes the chunk manifest for that array.

Instead what we could do is:

  • use obstore.list to do the concurrent loop in rust instead of python
  • pass return_arrow=True to get a stream of PyArrow RecordBatches back
  • construct the python ChunkManifest object's numpy arrays1 directly from the Arrow arrays, minimizing memory copies (i.e. the opposite of what I did in Pass manifests to icechunk as pyarrow arrays #861)

I haven't benchmarked the current approach but I'm pretty sure this would be waaay faster.

Footnotes

  1. Unfortunately we can't just keep the manifests as arrow arrays the whole way through because of the potential need to concatenate manifests along arbitrary dims.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions