-
-
Notifications
You must be signed in to change notification settings - Fork 2.8k
Description
Feature description
GDAL's Zarr driver does not support reading v3 Zarrs with string data types (even if we are not reading those arrays). Arrays with string (variable-length UTF-8) or fixed_length_utf32 data types fail with "Invalid or unsupported format for data_type".
Requested support:
string - variable-length UTF-8 (Zarr v3 extension - spec)
fixed_length_utf32 - (AFAIK non-standard type for numpy compat from zarr-python docs)
Additional context
Summary
GDAL's Zarr driver does not support reading Zarr v3 string data types (string, fixed_length_utf32, null_terminated_bytes). The driver fails with "Invalid or unsupported format for data_type" when opening arrays with these types.
Environment
- GDAL version: 3.13.0dev-4769b527b275fdb286cba95c8b35bbd131168e54 (nightly)
- Zarr-Python version: 3.1.5
- NumPy version: 1.26.4
- Platform: Ubuntu (Docker image
ghcr.io/osgeo/gdal:ubuntu-small-latest)
Tested Configurations
We tested all combinations of:
- String types: Variable-length UTF-8 (
string), Fixed-length UTF-32 (fixed_length_utf32), Fixed-length bytes (null_terminated_bytes) - Storage: Non-sharded (regular chunks) and Sharded (
sharding_indexedcodec)
Results Summary
| Configuration | Zarr data_type | Sharded | zarr-python | GDAL |
|---|---|---|---|---|
| Variable-length UTF-8 | string |
No | OK | FAIL |
| Fixed-length UTF-32 | fixed_length_utf32 |
No | OK | FAIL |
| Fixed-length bytes | null_terminated_bytes |
No | OK | FAIL |
| Variable-length UTF-8 | string |
Yes | OK | FAIL |
| Fixed-length UTF-32 | fixed_length_utf32 |
Yes | OK | FAIL |
| Fixed-length bytes | null_terminated_bytes |
Yes | OK | FAIL |
All string data types fail in GDAL regardless of sharding configuration.
Error Messages
Variable-length UTF-8 strings (string)
Invalid or unsupported format for data_type: string
Fixed-length UTF-32 strings (fixed_length_utf32)
Invalid or unsupported format for data_type: { "name": "fixed_length_utf32", "configuration": { "length_bytes": 20 } }
Fixed-length bytes (null_terminated_bytes)
Invalid or unsupported format for data_type: { "name": "null_terminated_bytes", "configuration": { "length_bytes": 5 } }
Example zarr.json Files
Non-sharded variable-length UTF-8 strings
{
"shape": [3, 3],
"data_type": "string",
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [2, 2]
}
},
"chunk_key_encoding": {
"name": "default",
"configuration": {
"separator": "/"
}
},
"fill_value": "",
"codecs": [
{
"name": "vlen-utf8",
"configuration": {}
},
{
"name": "zstd",
"configuration": {
"level": 0,
"checksum": false
}
}
],
"attributes": {},
"zarr_format": 3,
"node_type": "array",
"storage_transformers": []
}Non-sharded fixed-length UTF-32 strings
{
"shape": [3, 3],
"data_type": {
"name": "fixed_length_utf32",
"configuration": {
"length_bytes": 20
}
},
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [2, 2]
}
},
"chunk_key_encoding": {
"name": "default",
"configuration": {
"separator": "/"
}
},
"fill_value": "",
"codecs": [
{
"name": "bytes",
"configuration": {
"endian": "little"
}
},
{
"name": "zstd",
"configuration": {
"level": 0,
"checksum": false
}
}
],
"attributes": {},
"zarr_format": 3,
"node_type": "array",
"storage_transformers": []
}Sharded variable-length UTF-8 strings
{
"shape": [3, 3],
"data_type": "string",
"chunk_grid": {
"name": "regular",
"configuration": {
"chunk_shape": [3, 3]
}
},
"chunk_key_encoding": {
"name": "default",
"configuration": {
"separator": "/"
}
},
"fill_value": "",
"codecs": [
{
"name": "sharding_indexed",
"configuration": {
"chunk_shape": [1, 1],
"codecs": [
{
"name": "vlen-utf8",
"configuration": {}
},
{
"name": "zstd",
"configuration": {
"level": 0,
"checksum": false
}
}
],
"index_codecs": [
{
"name": "bytes",
"configuration": {
"endian": "little"
}
},
{
"name": "crc32c"
}
],
"index_location": "end"
}
}
],
"attributes": {},
"zarr_format": 3,
"node_type": "array",
"storage_transformers": []
}Comparison with Zarr v2
For reference, we also tested Zarr v2 string support:
- Zarr v2 fixed-length strings (
<U5dtype): GDAL recognizes the array and identifies it as a string type via the Multidimensional API, but the Python bindings fail with "String buffer data type not supported in SWIG bindings". This suggests the underlying C++ code may have some string support, but it's not exposed to Python.
Reproduction Script
#!/usr/bin/env python3
"""Test script to reproduce GDAL Zarr v3 string issues."""
import tempfile
from pathlib import Path
import numpy as np
import zarr
from osgeo import gdal
gdal.UseExceptions()
STRING_DATA = np.array([
["hello", "world", "foo"],
["bar", "baz", "qux"],
["longer_string_here", "short", "medium_len"],
], dtype=object)
with tempfile.TemporaryDirectory() as tmpdir:
base_path = Path(tmpdir)
# Create Zarr v3 with variable-length strings
path = base_path / "test_strings"
store = zarr.open_group(path, mode="w", zarr_format=3)
arr = store.create_array(
"data",
shape=STRING_DATA.shape,
chunks=(2, 2),
dtype="str",
)
arr[:] = STRING_DATA
# Verify zarr-python can read it
arr_read = zarr.open_array(path / "data", mode="r")
print(f"zarr-python read: {arr_read[:]}")
# Try to read with GDAL
try:
ds = gdal.Open(f"ZARR:{path}/data")
print(f"GDAL opened successfully")
except Exception as e:
print(f"GDAL error: {e}")Expected Behavior
GDAL should be able to read Zarr v3 arrays with string data types:
string(variable-length UTF-8)fixed_length_utf32null_terminated_bytes
Related Links
- Zarr v3 data types specification: https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#data-types
- Zarr extension data types (in progress): https://github.com/zarr-developers/zarr-extensions/tree/main/data-types
- GDAL Zarr driver documentation: https://gdal.org/en/stable/drivers/raster/zarr.html