Minimal Zarr client #239
Replies: 2 comments 4 replies
-
I'm +1 on a minimal zarr implementation, we need tight control on what is being fetched and cached. I think in zarrv3 all the string problems go away with |
Beta Was this translation helpful? Give feedback.
-
It's an interesting exercise, but I'd need convincing that it's a good idea to write our own Zarr implementation for production use. I'd rather use, and contribute to (when needed), the upstream Zarr efforts (Zarr Python itself, obstore, zarrs-python, zarrs). |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I've been wondering how hard it would be to implement a minimal, read-only Zarr client such as we'd need for this repo. Here's a prototype I knocked together while procrastinating from doing things I really should be working on:
Other than strings (see below), this seems to work fine on the output of bio2zarr, and gets identical results to Zarr-python.
I used the upstream blosc as I was curious to see how much we depended on numcodecs. You can plug in numcodecs.blosc here, though, and it works just fine.
The Store interface here is very simple, and you could easily see how this could be extended to support Zip and S3 etc by having a dependency on boto3, etc (which I would prefer to Fsspec, which is very complicated and much more than we need). We would probably do things a bit differently in practise by making it async from the ground up, though, and making sure that this integrated well with plans for the readahead cache.
We're not using any of the fancy indexing from Zarr, so just providing the
Array.blocks
interface is all we need.Strings are a pain, and we'd have to support the string output of bio2zarr with a different code path I think, most likely just using the VLenUtf8 codec from numcodecs.
The basic thinking here, is that we can probably get 99% of the things we want to get done with a small subset of the Zarr protocol functionality (are filters really that useful? Shouldn't we just store the truncated data, e.g., for floating point data?) and just using blosc as the standard compressor in the spec. Then VCZ clients can be smaller and simpler.
Beta Was this translation helpful? Give feedback.
All reactions