Minimal Zarr client #239

jeromekelleher · 2025-06-25T16:42:42Z

jeromekelleher
Jun 25, 2025
Maintainer

I've been wondering how hard it would be to implement a minimal, read-only Zarr client such as we'd need for this repo. Here's a prototype I knocked together while procrastinating from doing things I really should be working on:

import json
import pathlib
import itertools
import sys

import numpy as np
import blosc


class Store:
    def __init__(self, path):
        self.path = pathlib.Path(path)

    def fetch_chunk(self, array_name, coords):
        path = self.path / array_name
        for coord in coords:
            path /= str(coord)

        with open(path, "rb") as f:
            return f.read()

    def fetch_metadata(self, array_name):
        path = self.path / array_name / ".zarray"
        with open(path) as f:
            return f.read()

    def fetch_attrs(self, array_name):
        path = self.path / array_name / ".zattrs"
        with open(path) as f:
            return f.read()


class ArrayBlocks:
    def __init__(self, array):
        self.array = array

    def __getitem__(self, chunk_coords):
        if len(chunk_coords) != len(self.array.shape):
            raise ValueError("coordinate mismatch")
        data = self.array.store.fetch_chunk(self.array.name, chunk_coords)
        chunk_size = list(self.array.chunk_size)
        decompressed = blosc.decompress(data)
        buff = np.frombuffer(decompressed, dtype=self.array.dtype)

        complete_chunk = True
        slices = []
        for dim, coord in enumerate(chunk_coords):
            mod = self.array.shape[dim] % self.array.chunks[dim]
            if coord == self.array.cdata_shape[dim] - 1 and mod != 0:
                slices.append(slice(0, mod))
                complete_chunk = False
            else:
                slices.append(slice(0, None))
        # The buffer chunk is always stored with the full shape
        a = buff.reshape(self.array.chunk_size)
        if complete_chunk:
            return a
        return a[tuple(slices)]


class Array:
    def __init__(self, name, store):
        self.name = name
        self.store = store
        self.metadata = json.loads(self.store.fetch_metadata(name))
        assert self.metadata["zarr_format"] == 2
        assert self.metadata["dimension_separator"] == "/"
        assert self.metadata["fill_value"] == None
        assert self.metadata["filters"] == None
        assert self.metadata["order"] == "C"
        self.attrs = json.loads(self.store.fetch_attrs(name))
        self.chunks = tuple(self.metadata["chunks"])
        self.shape = tuple(self.metadata["shape"])
        assert len(self.shape) == len(self.chunks)
        num_dims = len(self.shape)
        cdata_shape = []
        chunk_size = []
        for dim in range(num_dims):
            num_chunks = (self.shape[dim] // self.chunks[dim]) + (
                (self.shape[dim] % self.chunks[dim]) > 0
            )
            cdata_shape.append(max(1, num_chunks))
            chunk_size.append(min(self.chunks[dim], self.shape[dim]))
        self.dtype = self.metadata["dtype"]
        self.chunk_size = chunk_size
        self.cdata_shape = tuple(cdata_shape)
        self.compressor = compressor_factory(self.metadata["compressor"])
        self.blocks = ArrayBlocks(self)


def main():
    # store = Store("sample.vcz")
    store = Store(sys.argv[1])
    array = Array(sys.argv[2], store)
    chunk_ranges = [range(num_chunks) for num_chunks in array.cdata_shape]

    import zarr
    import numpy.testing as nt

    root = zarr.open(sys.argv[1])
    z = root[sys.argv[2]]

    assert z.chunks == array.chunks
    assert z.shape == array.shape
    assert z.cdata_shape == array.cdata_shape

    for chunk_coord in itertools.product(*chunk_ranges):
        A = array.blocks[chunk_coord]
        # print(chunk_coord, A.shape, A.dtype, np.sum(A))
        # print(A)
        B = z.blocks[chunk_coord]
        assert A.shape == B.shape
        assert A.dtype == B.dtype
        # print(A)
        # print(B)
        nt.assert_array_equal(A, B)


main()

Other than strings (see below), this seems to work fine on the output of bio2zarr, and gets identical results to Zarr-python.

I used the upstream blosc as I was curious to see how much we depended on numcodecs. You can plug in numcodecs.blosc here, though, and it works just fine.

The Store interface here is very simple, and you could easily see how this could be extended to support Zip and S3 etc by having a dependency on boto3, etc (which I would prefer to Fsspec, which is very complicated and much more than we need). We would probably do things a bit differently in practise by making it async from the ground up, though, and making sure that this integrated well with plans for the readahead cache.

We're not using any of the fancy indexing from Zarr, so just providing the Array.blocks interface is all we need.

Strings are a pain, and we'd have to support the string output of bio2zarr with a different code path I think, most likely just using the VLenUtf8 codec from numcodecs.

The basic thinking here, is that we can probably get 99% of the things we want to get done with a small subset of the Zarr protocol functionality (are filters really that useful? Shouldn't we just store the truncated data, e.g., for floating point data?) and just using blosc as the standard compressor in the spec. Then VCZ clients can be smaller and simpler.

benjeffery · 2025-06-25T17:28:02Z

benjeffery
Jun 25, 2025
Maintainer

I'm +1 on a minimal zarr implementation, we need tight control on what is being fetched and cached.

I think in zarrv3 all the string problems go away with dtype="T", although I'd need to dig in a bit to confirm this. If true it is a big argument for going v3 only if there's no other showstoppers.

4 replies

jeromekelleher Jun 25, 2025
Maintainer Author

Zarr python V3 or zarr format V3?

benjeffery Jun 26, 2025
Maintainer

As my testing with zarr-python v3 had resulted in a specific string data type, I had assumed it would be in the v3 spec. However it isn't and is defined as an "extension" instead: https://github.com/zarr-developers/zarr-extensions/tree/main/data-types/string

jeromekelleher Jun 26, 2025
Maintainer Author

So to confirm, if we have zarr-python v3 and using zarr-format=v2, we can use a "string" dtype and this is compatible with Numpy StringDtype?

My goal here ultimately would be to define a minimal working set of dtypes that can cover our uses and just support those (so, the various integers, floats, doubles and StringDtype, basically)

benjeffery Jun 26, 2025
Maintainer

Sadly no, setting zarr_format=2 gives an unsupported dtype error when creating a zarr array from a StringDType.

tomwhite · 2025-06-30T09:56:37Z

tomwhite
Jun 30, 2025
Maintainer

It's an interesting exercise, but I'd need convincing that it's a good idea to write our own Zarr implementation for production use. I'd rather use, and contribute to (when needed), the upstream Zarr efforts (Zarr Python itself, obstore, zarrs-python, zarrs).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Minimal Zarr client #239

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Minimal Zarr client #239

Uh oh!

jeromekelleher Jun 25, 2025 Maintainer

Replies: 2 comments · 4 replies

Uh oh!

benjeffery Jun 25, 2025 Maintainer

Uh oh!

jeromekelleher Jun 25, 2025 Maintainer Author

Uh oh!

benjeffery Jun 26, 2025 Maintainer

Uh oh!

jeromekelleher Jun 26, 2025 Maintainer Author

Uh oh!

Uh oh!

benjeffery Jun 26, 2025 Maintainer

Uh oh!

tomwhite Jun 30, 2025 Maintainer

jeromekelleher
Jun 25, 2025
Maintainer

Replies: 2 comments 4 replies

benjeffery
Jun 25, 2025
Maintainer

jeromekelleher Jun 25, 2025
Maintainer Author

benjeffery Jun 26, 2025
Maintainer

jeromekelleher Jun 26, 2025
Maintainer Author

benjeffery Jun 26, 2025
Maintainer

tomwhite
Jun 30, 2025
Maintainer