Skip to content

GDAL Zarr v3 String Data Type Support #13782

@james-willis

Description

@james-willis

Feature description

GDAL's Zarr driver does not support reading v3 Zarrs with string data types (even if we are not reading those arrays). Arrays with string (variable-length UTF-8) or fixed_length_utf32 data types fail with "Invalid or unsupported format for data_type".

Requested support:

string - variable-length UTF-8 (Zarr v3 extension - spec)
fixed_length_utf32 - (AFAIK non-standard type for numpy compat from zarr-python docs)

Additional context

Summary

GDAL's Zarr driver does not support reading Zarr v3 string data types (string, fixed_length_utf32, null_terminated_bytes). The driver fails with "Invalid or unsupported format for data_type" when opening arrays with these types.

Environment

  • GDAL version: 3.13.0dev-4769b527b275fdb286cba95c8b35bbd131168e54 (nightly)
  • Zarr-Python version: 3.1.5
  • NumPy version: 1.26.4
  • Platform: Ubuntu (Docker image ghcr.io/osgeo/gdal:ubuntu-small-latest)

Tested Configurations

We tested all combinations of:

  • String types: Variable-length UTF-8 (string), Fixed-length UTF-32 (fixed_length_utf32), Fixed-length bytes (null_terminated_bytes)
  • Storage: Non-sharded (regular chunks) and Sharded (sharding_indexed codec)

Results Summary

Configuration Zarr data_type Sharded zarr-python GDAL
Variable-length UTF-8 string No OK FAIL
Fixed-length UTF-32 fixed_length_utf32 No OK FAIL
Fixed-length bytes null_terminated_bytes No OK FAIL
Variable-length UTF-8 string Yes OK FAIL
Fixed-length UTF-32 fixed_length_utf32 Yes OK FAIL
Fixed-length bytes null_terminated_bytes Yes OK FAIL

All string data types fail in GDAL regardless of sharding configuration.

Error Messages

Variable-length UTF-8 strings (string)

Invalid or unsupported format for data_type: string

Fixed-length UTF-32 strings (fixed_length_utf32)

Invalid or unsupported format for data_type: { "name": "fixed_length_utf32", "configuration": { "length_bytes": 20 } }

Fixed-length bytes (null_terminated_bytes)

Invalid or unsupported format for data_type: { "name": "null_terminated_bytes", "configuration": { "length_bytes": 5 } }

Example zarr.json Files

Non-sharded variable-length UTF-8 strings

{
  "shape": [3, 3],
  "data_type": "string",
  "chunk_grid": {
    "name": "regular",
    "configuration": {
      "chunk_shape": [2, 2]
    }
  },
  "chunk_key_encoding": {
    "name": "default",
    "configuration": {
      "separator": "/"
    }
  },
  "fill_value": "",
  "codecs": [
    {
      "name": "vlen-utf8",
      "configuration": {}
    },
    {
      "name": "zstd",
      "configuration": {
        "level": 0,
        "checksum": false
      }
    }
  ],
  "attributes": {},
  "zarr_format": 3,
  "node_type": "array",
  "storage_transformers": []
}

Non-sharded fixed-length UTF-32 strings

{
  "shape": [3, 3],
  "data_type": {
    "name": "fixed_length_utf32",
    "configuration": {
      "length_bytes": 20
    }
  },
  "chunk_grid": {
    "name": "regular",
    "configuration": {
      "chunk_shape": [2, 2]
    }
  },
  "chunk_key_encoding": {
    "name": "default",
    "configuration": {
      "separator": "/"
    }
  },
  "fill_value": "",
  "codecs": [
    {
      "name": "bytes",
      "configuration": {
        "endian": "little"
      }
    },
    {
      "name": "zstd",
      "configuration": {
        "level": 0,
        "checksum": false
      }
    }
  ],
  "attributes": {},
  "zarr_format": 3,
  "node_type": "array",
  "storage_transformers": []
}

Sharded variable-length UTF-8 strings

{
  "shape": [3, 3],
  "data_type": "string",
  "chunk_grid": {
    "name": "regular",
    "configuration": {
      "chunk_shape": [3, 3]
    }
  },
  "chunk_key_encoding": {
    "name": "default",
    "configuration": {
      "separator": "/"
    }
  },
  "fill_value": "",
  "codecs": [
    {
      "name": "sharding_indexed",
      "configuration": {
        "chunk_shape": [1, 1],
        "codecs": [
          {
            "name": "vlen-utf8",
            "configuration": {}
          },
          {
            "name": "zstd",
            "configuration": {
              "level": 0,
              "checksum": false
            }
          }
        ],
        "index_codecs": [
          {
            "name": "bytes",
            "configuration": {
              "endian": "little"
            }
          },
          {
            "name": "crc32c"
          }
        ],
        "index_location": "end"
      }
    }
  ],
  "attributes": {},
  "zarr_format": 3,
  "node_type": "array",
  "storage_transformers": []
}

Comparison with Zarr v2

For reference, we also tested Zarr v2 string support:

  • Zarr v2 fixed-length strings (<U5 dtype): GDAL recognizes the array and identifies it as a string type via the Multidimensional API, but the Python bindings fail with "String buffer data type not supported in SWIG bindings". This suggests the underlying C++ code may have some string support, but it's not exposed to Python.

Reproduction Script

#!/usr/bin/env python3
"""Test script to reproduce GDAL Zarr v3 string issues."""

import tempfile
from pathlib import Path

import numpy as np
import zarr
from osgeo import gdal

gdal.UseExceptions()

STRING_DATA = np.array([
    ["hello", "world", "foo"],
    ["bar", "baz", "qux"],
    ["longer_string_here", "short", "medium_len"],
], dtype=object)

with tempfile.TemporaryDirectory() as tmpdir:
    base_path = Path(tmpdir)
    
    # Create Zarr v3 with variable-length strings
    path = base_path / "test_strings"
    store = zarr.open_group(path, mode="w", zarr_format=3)
    arr = store.create_array(
        "data",
        shape=STRING_DATA.shape,
        chunks=(2, 2),
        dtype="str",
    )
    arr[:] = STRING_DATA
    
    # Verify zarr-python can read it
    arr_read = zarr.open_array(path / "data", mode="r")
    print(f"zarr-python read: {arr_read[:]}")
    
    # Try to read with GDAL
    try:
        ds = gdal.Open(f"ZARR:{path}/data")
        print(f"GDAL opened successfully")
    except Exception as e:
        print(f"GDAL error: {e}")

Expected Behavior

GDAL should be able to read Zarr v3 arrays with string data types:

  1. string (variable-length UTF-8)
  2. fixed_length_utf32
  3. null_terminated_bytes

Related Links

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions