Skip to content

Commit c890147

Browse files
authoredDec 2, 2024··
Update notebooks to use VirtualiZarr (#69)
1 parent f70420e commit c890147

File tree

8 files changed

+328
-503
lines changed

8 files changed

+328
-503
lines changed
 

‎README.md

+22-22
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,33 @@
11
<img src="thumbnail.png" alt="thumbnail" width="300"/>
22

3-
# Kerchunk Cookbook
3+
# Virtual Zarr Cookbook (Kerchunk and VirtualiZarr)
44

55
[![nightly-build](https://github.com/ProjectPythia/kerchunk-cookbook/actions/workflows/nightly-build.yaml/badge.svg)](https://github.com/ProjectPythia/kerchunk-cookbook/actions/workflows/nightly-build.yaml)
66
[![Binder](https://binder.projectpythia.org/badge_logo.svg)](https://binder.projectpythia.org/v2/gh/ProjectPythia/kerchunk-cookbook/main?labpath=notebooks)
77
[![DOI](https://zenodo.org/badge/588661659.svg)](https://zenodo.org/badge/latestdoi/588661659)
88

9-
This Project Pythia Cookbook covers using the [Kerchunk](https://fsspec.github.io/kerchunk/)
10-
library to access archival data formats as if they were
11-
ARCO (Analysis-Ready-Cloud-Optimized) data.
9+
This Project Pythia Cookbook covers using the [Kerchunk](https://fsspec.github.io/kerchunk/), [VirtualiZarr](https://virtualizarr.readthedocs.io/en/latest/index.html), and [Zarr-Python](https://zarr.readthedocs.io/en/stable/) libraries to access archival data formats as if they were ARCO (Analysis-Ready-Cloud-Optimized) data.
1210

1311
## Motivation
1412

15-
The `Kerchunk` library allows you to access chunked and compressed
13+
The `Kerchunk` library pioneered the access of chunked and compressed
1614
data formats (such as NetCDF3. HDF5, GRIB2, TIFF & FITS), many of
1715
which are the primary data formats for many data archives, as if
1816
they were in ARCO formats such as Zarr which allows for parallel,
1917
chunk-specific access. Instead of creating a new copy of the dataset
2018
in the Zarr spec/format, `Kerchunk` reads through the data archive
2119
and extracts the byte range and compression information of each
22-
chunk, then writes that information to a .json file (or alternate
23-
backends in future releases). For more details on how this process
24-
works please see this page on the
25-
[Kerchunk docs](https://fsspec.github.io/kerchunk/detail.html)).
26-
These summary files can then be combined to generated a `Kerchunk`
27-
reference for that dataset, which can be read via
28-
[Zarr](https://zarr.readthedocs.io) and
20+
chunk, then writes that information to a "virtual Zarr store" using a
21+
JSON or Parquet "reference file". The `VirtualiZarr`
22+
library provides a simple way to create these "virtual stores" using familiary
23+
`xarray` syntax. Lastly, the `icechunk` provides a new way to store and re-use these references.
24+
25+
These virtual Zarr stores can be re-used and read via [Zarr](https://zarr.readthedocs.io) and
2926
[Xarray](https://docs.xarray.dev/en/stable/).
3027

28+
For more details on how this process works please see this page on the
29+
[Kerchunk docs](https://fsspec.github.io/kerchunk/detail.html)).
30+
3131
## Authors
3232

3333
[Raphael Hagen](https://github.com/norlandrhagen)
@@ -48,24 +48,24 @@ the creator of `Kerchunk` and the
4848
This cookbook is broken up into two sections,
4949
Foundations and Example Notebooks.
5050

51-
### Section 1 Foundations
51+
### Section 1 - Foundations
5252

5353
In the `Foundations` section we will demonstrate
54-
how to use `Kerchunk` to create reference sets
54+
how to use `Kerchunk` and `VirtualiZarr` to create reference files
5555
from single file sources, as well as to create
56-
multi-file virtual datasets from collections of files.
56+
multi-file virtual Zarr stores from collections of files.
5757

58-
### Section 2 Generating Reference Files
58+
### Section 2 - Generating Virtual Zarr Stores
5959

60-
The notebooks in the `Generating Reference Files` section
61-
demonstrate how to use `Kerchunk` to create
60+
The notebooks in the `Generating Virtual Zarr Stores` section
61+
demonstrates how to use `Kerchunk` and `VirtualiZarr` to create
6262
datasets for all the supported file formats.
63-
`Kerchunk` currently supports NetCDF3,
64-
NetCDF4/HDF5, GRIB2, TIFF (including CoG).
63+
These libraries currently support virtualizing NetCDF3,
64+
NetCDF4/HDF5, GRIB2, TIFF (including COG).
6565

66-
### Section 3 Using Pre-Generated References
66+
### Section 3 - Using Virtual Zarr Stores
6767

68-
The `Pre-Generated References` section contains notebooks demonstrating how to load existing references into `Xarray` and `Xarray-Datatree`, generated coordinates for GeoTiffs using `xrefcoord` and plotting using `Hvplot Datashader`.
68+
The `Using Virtual Zarr Stores` section contains notebooks demonstrating how to load existing references into `Xarray`, generating coordinates for GeoTiffs using `xrefcoord`, and plotting using `Hvplot Datashader`.
6969

7070
## Running the Notebooks
7171

‎environment.yml

+2-2
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ channels:
33
- conda-forge
44
dependencies:
55
- cfgrib
6-
- dask
6+
- dask>=2024.10.0
77
- dask-labextension
88
- datashader
99
- distributed
@@ -28,12 +28,12 @@ dependencies:
2828
- pooch
2929
- pre-commit
3030
- pyOpenSSL
31-
- python=3.10
3231
- rioxarray
3332
- s3fs
3433
- scipy
3534
- tifffile
3635
- ujson
36+
- virtualizarr
3737
- xarray>=2024.10.0
3838
- zarr
3939
- sphinx-pythia-theme

‎notebooks/advanced/Parquet_Reference_Storage.ipynb

+70-78
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
"cell_type": "markdown",
1515
"metadata": {},
1616
"source": [
17-
"# Store Kerchunk Reference Files as Parquet"
17+
"# Store virtual datasets as Kerchunk Parquet references"
1818
]
1919
},
2020
{
@@ -24,18 +24,18 @@
2424
"source": [
2525
"## Overview\n",
2626
" \n",
27-
"In this notebook we will cover how to store Kerchunk references as Parquet files instead of json. For large reference datasets, using Parquet should have performance implications as the overall reference file size should be smaller and the memory overhead of combining the reference files should be lower. \n",
27+
"In this notebook we will cover how to store virtual datasets as Kerchunk Parquet references instead of Kerchunk JSON references. For large virtual datasets, using Parquet should have performance implications as the overall reference file size should be smaller and the memory overhead of combining the reference files should be lower. \n",
2828
"\n",
2929
"\n",
3030
"This notebook builds upon the [Kerchunk Basics](notebooks/foundations/01_kerchunk_basics.ipynb), [Multi-File Datasets with Kerchunk](notebooks/foundations/02_kerchunk_multi_file.ipynb) and the [Kerchunk and Dask](notebooks/foundations/03_kerchunk_dask.ipynb) notebooks. \n",
3131
"\n",
3232
"## Prerequisites\n",
3333
"| Concepts | Importance | Notes |\n",
3434
"| --- | --- | --- |\n",
35-
"| [Kerchunk Basics](../foundations/kerchunk_basics) | Required | Core |\n",
36-
"| [Multiple Files and Kerchunk](../foundations/kerchunk_multi_file) | Required | Core |\n",
37-
"| [Introduction to Xarray](https://foundations.projectpythia.org/core/xarray/xarray-intro.html) | Recommended | IO/Visualization |\n",
38-
"| [Intro to Dask](https://tutorial.dask.org/00_overview.html) | Required | Parallel Processing |\n",
35+
"| [Basics of virtual Zarr stores](../foundations/01_kerchunk_basics.ipynb) | Required | Core |\n",
36+
"| [Multi-file virtual datasets with VirtualiZarr](../foundations/02_kerchunk_multi_file.ipynb) | Required | Core |\n",
37+
"| [Parallel virtual dataset creation with VirtualiZarr, Kerchunk, and Dask](../foundations/03_kerchunk_dask) | Required | Core |\n",
38+
"| [Introduction to Xarray](https://foundations.projectpythia.org/core/xarray/xarray-intro.html) | Required | IO/Visualization |\n",
3939
"\n",
4040
"- **Time to learn**: 30 minutes\n",
4141
"---"
@@ -47,8 +47,6 @@
4747
"metadata": {},
4848
"source": [
4949
"## Imports\n",
50-
"In addition to the previous imports we used throughout the tutorial, we are adding a few imports:\n",
51-
"- `LazyReferenceMapper` and `ReferenceFileSystem` from `fsspec.implementations.reference` for lazy Parquet.\n",
5250
"\n",
5351
"\n"
5452
]
@@ -60,15 +58,12 @@
6058
"outputs": [],
6159
"source": [
6260
"import logging\n",
63-
"import os\n",
6461
"\n",
6562
"import dask\n",
6663
"import fsspec\n",
6764
"import xarray as xr\n",
6865
"from distributed import Client\n",
69-
"from fsspec.implementations.reference import LazyReferenceMapper, ReferenceFileSystem\n",
70-
"from kerchunk.combine import MultiZarrToZarr\n",
71-
"from kerchunk.hdf import SingleHdf5ToZarr"
66+
"from virtualizarr import open_virtual_dataset"
7267
]
7368
},
7469
{
@@ -111,10 +106,28 @@
111106
"files_paths = fs_read.glob(\"s3://smn-ar-wrf/DATA/WRF/DET/2022/12/31/12/*\")\n",
112107
"\n",
113108
"# Here we prepend the prefix 's3://', which points to AWS.\n",
114-
"file_pattern = sorted([\"s3://\" + f for f in files_paths])\n",
115-
"\n",
116-
"# Grab the first seven files to speed up example.\n",
117-
"file_pattern = file_pattern[0:2]"
109+
"files_paths = sorted([\"s3://\" + f for f in files_paths])"
110+
]
111+
},
112+
{
113+
"cell_type": "markdown",
114+
"metadata": {},
115+
"source": [
116+
"### Subset the Data\n",
117+
"To speed up our example, lets take a subset of the year of data. "
118+
]
119+
},
120+
{
121+
"cell_type": "code",
122+
"execution_count": null,
123+
"metadata": {},
124+
"outputs": [],
125+
"source": [
126+
"# If the subset_flag == True (default), the list of input files will\n",
127+
"# be subset to speed up the processing\n",
128+
"subset_flag = True\n",
129+
"if subset_flag:\n",
130+
" files_paths = files_paths[0:4]"
118131
]
119132
},
120133
{
@@ -123,7 +136,8 @@
123136
"metadata": {},
124137
"source": [
125138
"# Generate Lazy References\n",
126-
"Below we will create a `fsspec` filesystem to read the references from `s3` and create a function to generate dask delayed tasks."
139+
"\n",
140+
"Here we create a function to generate a list of Dask delayed objects."
127141
]
128142
},
129143
{
@@ -132,20 +146,18 @@
132146
"metadata": {},
133147
"outputs": [],
134148
"source": [
135-
"# Use Kerchunk's `SingleHdf5ToZarr` method to create a `Kerchunk`\n",
136-
"# index from a NetCDF file.\n",
137-
"fs_read = fsspec.filesystem(\"s3\", anon=True, skip_instance_cache=True)\n",
138-
"so = dict(mode=\"rb\", anon=True, default_fill_cache=False, default_cache_type=\"first\")\n",
139-
"\n",
140-
"\n",
141-
"def generate_json_reference(fil):\n",
142-
" with fs_read.open(fil, **so) as infile:\n",
143-
" h5chunks = SingleHdf5ToZarr(infile, fil, inline_threshold=300)\n",
144-
" return h5chunks.translate() # outf\n",
149+
"def generate_virtual_dataset(file, storage_options):\n",
150+
" return open_virtual_dataset(\n",
151+
" file, indexes={}, reader_options={\"storage_options\": storage_options}\n",
152+
" )\n",
145153
"\n",
146154
"\n",
155+
"storage_options = dict(anon=True, default_fill_cache=False, default_cache_type=\"first\")\n",
147156
"# Generate Dask Delayed objects\n",
148-
"tasks = [dask.delayed(generate_json_reference)(fil) for fil in file_pattern]"
157+
"tasks = [\n",
158+
" dask.delayed(generate_virtual_dataset)(file, storage_options)\n",
159+
" for file in files_paths\n",
160+
"]"
149161
]
150162
},
151163
{
@@ -163,7 +175,15 @@
163175
"metadata": {},
164176
"outputs": [],
165177
"source": [
166-
"single_refs = dask.compute(tasks)[0]"
178+
"virtual_datasets = list(dask.compute(*tasks))"
179+
]
180+
},
181+
{
182+
"attachments": {},
183+
"cell_type": "markdown",
184+
"metadata": {},
185+
"source": [
186+
"## Combine virtual datasets using VirtualiZarr"
167187
]
168188
},
169189
{
@@ -172,23 +192,17 @@
172192
"metadata": {},
173193
"outputs": [],
174194
"source": [
175-
"len(single_refs)"
195+
"combined_vds = xr.combine_nested(\n",
196+
" virtual_datasets, concat_dim=[\"time\"], coords=\"minimal\", compat=\"override\"\n",
197+
")\n",
198+
"combined_vds"
176199
]
177200
},
178201
{
179-
"attachments": {},
180202
"cell_type": "markdown",
181203
"metadata": {},
182204
"source": [
183-
"## Combine In-Memory References with MultiZarrToZarr\n",
184-
"This section will look notably different than the previous examples that have written to `.json`.\n",
185-
"\n",
186-
"In the following code block we are:\n",
187-
"- Creating an `fsspec` filesystem.\n",
188-
"- Create a empty `parquet` file to write to. \n",
189-
"- Creating an `fsspec` `LazyReferenceMapper` to pass into `MultiZarrToZarr`\n",
190-
"- Building a `MultiZarrToZarr` object of the combined references.\n",
191-
"- Calling `.flush()` on our LazyReferenceMapper, to write the combined reference to our `parquet` file."
205+
"## Write the virtual dataset to a Kerchunk Parquet reference"
192206
]
193207
},
194208
{
@@ -197,26 +211,7 @@
197211
"metadata": {},
198212
"outputs": [],
199213
"source": [
200-
"fs = fsspec.filesystem(\"file\")\n",
201-
"\n",
202-
"if os.path.exists(\"combined.parq\"):\n",
203-
" import shutil\n",
204-
"\n",
205-
" shutil.rmtree(\"combined.parq\")\n",
206-
"os.makedirs(\"combined.parq\")\n",
207-
"\n",
208-
"out = LazyReferenceMapper.create(root=\"combined.parq\", fs=fs, record_size=1000)\n",
209-
"\n",
210-
"mzz = MultiZarrToZarr(\n",
211-
" single_refs,\n",
212-
" remote_protocol=\"s3\",\n",
213-
" concat_dims=[\"time\"],\n",
214-
" identical_dims=[\"y\", \"x\"],\n",
215-
" remote_options={\"anon\": True},\n",
216-
" out=out,\n",
217-
").translate()\n",
218-
"\n",
219-
"out.flush()"
214+
"combined_vds.virtualize.to_kerchunk(\"combined.parq\", format=\"parquet\")"
220215
]
221216
},
222217
{
@@ -255,31 +250,28 @@
255250
"metadata": {},
256251
"outputs": [],
257252
"source": [
258-
"fs = ReferenceFileSystem(\n",
253+
"storage_options = {\n",
254+
" \"remote_protocol\": \"s3\",\n",
255+
" \"skip_instance_cache\": True,\n",
256+
" \"remote_options\": {\"anon\": True},\n",
257+
" \"target_protocol\": \"file\",\n",
258+
" \"lazy\": True,\n",
259+
"} # options passed to fsspec\n",
260+
"open_dataset_options = {\"chunks\": {}} # opens passed to xarray\n",
261+
"\n",
262+
"ds = xr.open_dataset(\n",
259263
" \"combined.parq\",\n",
260-
" remote_protocol=\"s3\",\n",
261-
" target_protocol=\"file\",\n",
262-
" lazy=True,\n",
263-
" remote_options={\"anon\": True},\n",
264+
" engine=\"kerchunk\",\n",
265+
" storage_options=storage_options,\n",
266+
" open_dataset_options=open_dataset_options,\n",
264267
")\n",
265-
"ds = xr.open_dataset(\n",
266-
" fs.get_mapper(), engine=\"zarr\", backend_kwargs={\"consolidated\": False}\n",
267-
")"
268-
]
269-
},
270-
{
271-
"cell_type": "code",
272-
"execution_count": null,
273-
"metadata": {},
274-
"outputs": [],
275-
"source": [
276268
"ds"
277269
]
278270
}
279271
],
280272
"metadata": {
281273
"kernelspec": {
282-
"display_name": "kerchunk-cookbook-dev",
274+
"display_name": "kerchunk-cookbook",
283275
"language": "python",
284276
"name": "python3"
285277
},
@@ -293,7 +285,7 @@
293285
"name": "python",
294286
"nbconvert_exporter": "python",
295287
"pygments_lexer": "ipython3",
296-
"version": "3.11.9"
288+
"version": "3.12.7"
297289
}
298290
},
299291
"nbformat": 4,

0 commit comments

Comments
 (0)
Please sign in to comment.