Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add guidance on how to best use STAC and Zarr together to the Cloud-Optimized Geospatial Formats Guide #134

Open
maxrjones opened this issue Feb 10, 2025 · 8 comments
Assignees

Comments

@maxrjones
Copy link
Collaborator

Duplicate of NASA-IMPACT/veda-odd#81 at @wildintellect's request

@gadomski's talk at the Pangeo showcase prompted a great question and subsequent discussion about how best to use Zarr with STAC. Questions about using Zarr with STAC have come up many times. Therefore, I think that guidance on this topic would really help the migration to cloud-native file formats. I think it would be great to use notes from today's discussion as the starting point for a PR to the could optimized geospatial formats guide. The PR would add a framework for deciding when/how to use Zarr with STAC. We could share the PR with the broader STAC, Pangeo, and CNG communities for comments, feedback, and additions.

@jsignell
Copy link
Collaborator

Ok here is what I have thought through so far:

STAC (SpatioTemporal Asset Catalogs) is a specification for defining and searching any type of data that has spatial and temporal dimensions. STAC has seen significant adoption in the earth observation community. Zarr is a specification for storing groups of cloud-optimized arrays. Zarr has been adopted by the earth modeling community (led by Pangeo). Both STAC and Zarr offer a flexible nested structure with arbitrary metadata at a variety of levels -- for STAC: catalog, collection, item, asset, for Zarry: group, array, (and maybe in the future chunk with icechunk/kerchunk??). This flexibility has contributed to their popularity, but can also makes it hard to tell what they are designed to be particularly good at.

Comparison table

STAC Zarr
for data with spatial and temporal dimensions for groups of arrays
supports arbitrarty metadata for catalogs, collections, items, assets supports arbitrary metadata for groups, arrays
good at search and discovery good at filtering within a dataset
searching returns items filtering returns chunks
stores metadata separately from data stores metadata alongside data (except when virtualized)

Note: This mostly has to do with the STAC spec and the STAC API spec, but STAC also has the extensions which include an onthology of domain-specific metadata as well as a mechanism for adding new extensions. This is similar to CF conventions on STAC but

Some additional thoughts:

  • STAC makes it easy to separate the concerns of who can search and find data from who can read data. This is useful when you want to allow access only via defined interfaces (like titiler or tipg).
  • In order to read a subset of a file using STAC you always need to touch that file - for instance you need to read the header of a COG file. This is different from a virtulaized zarr dataset where you can access a specific chunk.
  • Zarr is good for when data tends to be alligned.
    • it is an antipattern to put all your data into one zarr store (Tom Nicholas)
    • put things in one icechunk store that you expect to update together (icechunk docs)
  • STAC is good for level 1 and 2 data (Pete Gadomski)

Interface

For people familiar with xarray they are likely to want to get to xarray as quickly as possible and don't mind doing more filtering once they are there.

For people more familiar with STAC they are likely to want to use STAC tooling to do searching and so it might make sense to push STAC filters down into zarr (maybe using a tool like xpublish?).

So there should be good tooling to go back and forth between STAC and zarr.

Here is what exists so far:

Accessing Zarr with STAC-like patterns

Think of this as an nd data cube of regularly gridded model output

For data producers:

  • If you have an xarray object you can write out STAC to represent it (xstac)
    • Exposes variables to STAC using datacube extension
    • Includes kwargs that let xarray know how to open the dataset xarray extension
    • Includes ideas for how to store kerchunk in STAC
      • Comment from Tom Augspurger: we want to avoid JSON strings inside the pgstac as much as possible so that we can take advantage of pgstac's dehydration
  • FUTURE: (once STAC spec v1.1.0 is fully incorporated in tooling) Take advantage of inheritance to reduce duplication at the item and asset levels.

For data consumers:

  • If the zarr dataset is already cataloged in STAC following xstac conventions, ideally you can go straight to xarray using xpystac backend
  • If the zarr dataset is not cataloged in STAC I don't think anything currently exists. If we want to pursue this there are two paths: translate STAC requests on the fly directly into xarray, or do a one-time step similar to virtualization to generate an ad-hoc static STAC object.

Acessing STAC with xarray patterns

Think of this as a bunch of COGs at the same place over a period of time.

For data producers:

  • Take advantage of roles to make it clear which assets should be included in a data cube.
  • FUTURE: Potentially provide a virtual reference file to make the intended stacking explicit

For data consumers:

  • If the dataset is array-like, ideally you can go straight to xarray using the xpystac backend or odc-stac or stackstac directly.

Storing results:

This is a newer area of development. The core idea is that instead of repeatedly querying you could store the results for easy access.

  • Store the result of a searching STAC:
    • stac-geoparquet (using stacrs)
    • FUTURE: (maybe if array-like): icechunk or kerchunk (using virtualizarr STAC reader + xpystac)
  • Store the result of filtering Zarr:
    • icechunk or kerchunk (using virtualizarr)

Based off of the following discussions:

Other Comments:

STAC is also an extremely verbose representation of this metadata for regularly gridded datasets. Virtualization with kerchunk or Virtualizarr is a often a much more efficient representation of the individual underlying files and xarray over a virtual dataset provides an excellent “query” experience.

Sean Harkins

Same theme that comes up often, of where the line is between the external metadata that belongs in a catalog versus the internal structure of a dataset. Zarr, for instance, has a hierarchy with metadata at every level - should they be duplicated in a catalog for the sake of search ability? Sometimes...

Martin Durant

@jsignell jsignell self-assigned this Feb 12, 2025
@gadomski
Copy link

@jsignell one thing that I think is important, but I'm not sure if it really is, is that Zarr describes how the data should be stored and how the metadata should look, whereas STAC only describes how the metadata should look. A STAC item/asset can point to anything, wheras Zarr has opinions on how the data are actually laid out on disk.

Please correct me if I'm wrong about this, however, I'm quite less familiar w/ Zarr. Mostly working off the chunk definition in the spec, etc.

@jsignell
Copy link
Collaborator

Zarr describes how the data should be stored and how the metadata should look, whereas STAC only describes how the metadata should look.

Oh yeah I think this is definitely right. That is a core difference. Zarr is a file format and STAC is not.

@maxrjones
Copy link
Collaborator Author

A few notes to add to the discussion:

Centering the decision making framework around establishing priorities

I find it's helpful to first encourage data providers/catalogers to be explicit about their goals, which impacts the decisions they might make based on general recommendations. This also helps make it clear in the guide that it's usually not possible to optimize for all benefits at the same time. Examples of common motivations include:

  1. Provide a catalog that enables the simplest possible access patterns for science users from <insert-languages>
  2. Provide a catalog that is compatible with <insert-application>
  3. Minimize the number of GET requests for accessing datasets with <insert characteristics> (for Zarr, this is likely large-scale multi-dimensional datasets)
  4. Minimize data transfer for accessing subsets of datasets with <insert characteristics> (for Zarr, this is likely large-scale multi-dimensional datasets)
  5. Minimize the amount of infrastructure required to maintain the catalog
  6. Minimize the cost of generating the catalog

In the context of this guide, I think it would be great to show how different choices impact the amount of work required to generate catalogs, the number of GET requests needed for common access patterns, and the amount of data transferred for different access patterns. I'd also like to note where development needs to happen to drive further interoperability.

How people typically organize datacubes with STAC

This is the most common approach that I've seen for organizing datacubes with STAC:

  1. Define a STAC collection for each dataset (i.e., set of data collected using the same platform, algorithms, model, etc).
  2. Define a STAC item for each unique temporal and regional extent within that dataset.
  3. Include in each STAC item assets for each variable or band, including href links to the files containing that data.

While this enables great work, this approach still has downsides:

  • Libraries like stacstack and odc.stac need to scan the metadata for each file in order to lazily load a data cube, which increases time due to additional GET requests and larger payload sizes.
  • Users frequently get confused about which tools to use (e.g., stacstack vs odc.stac)
  • For simple concatenation-based datacubes (in contrast to composites/mosaics), odc.stac and stacstack involve more complicated APIs than necessary.

Where to go from here (one portion):

Demonstrate the strengths and weaknesses of different experimental approaches to dealing with the downsides listed above, including:

  1. Embedding byte ranges and keys or chunk manifests in the item properties
  2. Providing links to virtual datasets (Kerchunk or Icechunk) via collection assets
  3. Provide links to virtual datasets via STAC items

@jsignell
Copy link
Collaborator

This is the most common approach that I've seen for organizing datacubes with STAC:

  1. Define a STAC collection for each dataset (i.e., set of data collected using the same platform, algorithms, model, etc).
  2. Define a STAC item for each unique temporal and regional extent within that dataset.
  3. Include in each STAC item assets for each variable or band, including href links to the files containing that data.

When I first read this I thought you were saying that you'd seen people using STAC + Zarr this way, but on second read I think this is specifically describing the STAC + COG workflow. Just want to confirm that that is what you are talking about here @maxrjones?

@jsignell
Copy link
Collaborator

I am trying to decide how much to focus on STAC + COG vs STAC + Zarr/VirtualiZarr as opposed to discussing the different ways you can do STAC + Zarr/VirtualiZarr

I guess it's important context-setting to mention the common STAC + COG setup.

@wildintellect
Copy link
Contributor

STAC + COG could also be it's own page, and you can just reference it. I think the focus should be on STAC + Zarr/VirtualiZarr maybe with some minor notes how this differs from the existing "traditional" model.

@maxrjones
Copy link
Collaborator Author

This is the most common approach that I've seen for organizing datacubes with STAC:

  1. Define a STAC collection for each dataset (i.e., set of data collected using the same platform, algorithms, model, etc).
  2. Define a STAC item for each unique temporal and regional extent within that dataset.
  3. Include in each STAC item assets for each variable or band, including href links to the files containing that data.

When I first read this I thought you were saying that you'd seen people using STAC + Zarr this way, but on second read I think this is specifically describing the STAC + COG workflow. Just want to confirm that that is what you are talking about here @maxrjones?

IIRC (1) and (2) are common for both Zarr and COG whereas (3) is unique to COGs since most Zarr stores with have multiple variables in the same store if it's not a 1:1 transfer from NetCDF or COG.

I would look at the CIL CMIP6 data on PC for an example of Zarr in STAC. I'm sure there are others, but that's one I'm familiar with due to having worked on downscaling applications.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants