-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add guidance on how to best use STAC and Zarr together to the Cloud-Optimized Geospatial Formats Guide #134
Comments
Ok here is what I have thought through so far: STAC (SpatioTemporal Asset Catalogs) is a specification for defining and searching any type of data that has spatial and temporal dimensions. STAC has seen significant adoption in the earth observation community. Zarr is a specification for storing groups of cloud-optimized arrays. Zarr has been adopted by the earth modeling community (led by Pangeo). Both STAC and Zarr offer a flexible nested structure with arbitrary metadata at a variety of levels -- for STAC: catalog, collection, item, asset, for Zarry: group, array, (and maybe in the future chunk with icechunk/kerchunk??). This flexibility has contributed to their popularity, but can also makes it hard to tell what they are designed to be particularly good at. Comparison table
Note: This mostly has to do with the STAC spec and the STAC API spec, but STAC also has the extensions which include an onthology of domain-specific metadata as well as a mechanism for adding new extensions. This is similar to CF conventions on STAC but Some additional thoughts:
InterfaceFor people familiar with xarray they are likely to want to get to xarray as quickly as possible and don't mind doing more filtering once they are there. For people more familiar with STAC they are likely to want to use STAC tooling to do searching and so it might make sense to push STAC filters down into zarr (maybe using a tool like xpublish?). So there should be good tooling to go back and forth between STAC and zarr. Here is what exists so far: Accessing Zarr with STAC-like patternsThink of this as an nd data cube of regularly gridded model output For data producers:
For data consumers:
Acessing STAC with xarray patternsThink of this as a bunch of COGs at the same place over a period of time. For data producers:
For data consumers:
Storing results:This is a newer area of development. The core idea is that instead of repeatedly querying you could store the results for easy access.
Based off of the following discussions:
Other Comments:
Sean Harkins
Martin Durant |
@jsignell one thing that I think is important, but I'm not sure if it really is, is that Zarr describes how the data should be stored and how the metadata should look, whereas STAC only describes how the metadata should look. A STAC item/asset can point to anything, wheras Zarr has opinions on how the data are actually laid out on disk. Please correct me if I'm wrong about this, however, I'm quite less familiar w/ Zarr. Mostly working off the chunk definition in the spec, etc. |
Oh yeah I think this is definitely right. That is a core difference. Zarr is a file format and STAC is not. |
A few notes to add to the discussion: Centering the decision making framework around establishing prioritiesI find it's helpful to first encourage data providers/catalogers to be explicit about their goals, which impacts the decisions they might make based on general recommendations. This also helps make it clear in the guide that it's usually not possible to optimize for all benefits at the same time. Examples of common motivations include:
In the context of this guide, I think it would be great to show how different choices impact the amount of work required to generate catalogs, the number of GET requests needed for common access patterns, and the amount of data transferred for different access patterns. I'd also like to note where development needs to happen to drive further interoperability. How people typically organize datacubes with STACThis is the most common approach that I've seen for organizing datacubes with STAC:
While this enables great work, this approach still has downsides:
Where to go from here (one portion):Demonstrate the strengths and weaknesses of different experimental approaches to dealing with the downsides listed above, including:
|
When I first read this I thought you were saying that you'd seen people using STAC + Zarr this way, but on second read I think this is specifically describing the STAC + COG workflow. Just want to confirm that that is what you are talking about here @maxrjones? |
I am trying to decide how much to focus on STAC + COG vs STAC + Zarr/VirtualiZarr as opposed to discussing the different ways you can do STAC + Zarr/VirtualiZarr I guess it's important context-setting to mention the common STAC + COG setup. |
|
IIRC (1) and (2) are common for both Zarr and COG whereas (3) is unique to COGs since most Zarr stores with have multiple variables in the same store if it's not a 1:1 transfer from NetCDF or COG. I would look at the CIL CMIP6 data on PC for an example of Zarr in STAC. I'm sure there are others, but that's one I'm familiar with due to having worked on downscaling applications. |
Duplicate of NASA-IMPACT/veda-odd#81 at @wildintellect's request
@gadomski's talk at the Pangeo showcase prompted a great question and subsequent discussion about how best to use Zarr with STAC. Questions about using Zarr with STAC have come up many times. Therefore, I think that guidance on this topic would really help the migration to cloud-native file formats. I think it would be great to use notes from today's discussion as the starting point for a PR to the could optimized geospatial formats guide. The PR would add a framework for deciding when/how to use Zarr with STAC. We could share the PR with the broader STAC, Pangeo, and CNG communities for comments, feedback, and additions.
The text was updated successfully, but these errors were encountered: