Description
Documenting this here for permanence (already brought up during the km-scale-hackathon).
The current implementation basically faces a hard scaling stop based on the clients machine since we rely on a labelled xarray coordinate ('cell_ids'), which will by default be loaded to memory.
Here is a small example with public data that will blow up my laptop at higher zoom levels, since the cell dimension itself (if labelled) will be larger than my system memory:
import xdggs
import xarray as xr
import numpy as np
import dask.array as dsa
# zoom = 15
zoom = 5
path = f"https://nowake.nicam.jp/files/220m/data_healpix_{zoom}.zarr"
ds = xr.open_dataset(path, engine='zarr', chunks={})
cell_ids = xr.DataArray(dsa.arange(ds.cell.size,chunks=(1048576)),dims=['cell_ids'], name='cell_ids')
ds = ds.assign_coords({'cell_ids': cell_ids})
ds.cell_ids.attrs = {
"grid_name": "healpix",
"level": zoom,
"indexing_scheme": "nested",
}
ds
even if I assign a coordinate without index (see @benbovy s comment in pydata/xarray#1650 (comment)):
import xdggs
import xarray as xr
import numpy as np
import dask.array as dsa
# zoom = 15
zoom = 5
path = f"https://nowake.nicam.jp/files/220m/data_healpix_{zoom}.zarr"
ds = xr.open_dataset(path, engine='zarr', chunks={})
# the coordinate labels are wayyy bigger than my memory! So xarray loading them by default is a non-starter.
# Trying https://github.com/pydata/xarray/issues/1650#issuecomment-1697282386
cell_ids = xr.Coordinates({"cell_ids": ("cell_ids", dsa.arange(ds.cell.size, chunks=(1048576)))}, indexes={})
ds = ds.assign_coords(cell_ids)
ds.cell_ids.attrs = {
"grid_name": "healpix",
"level": zoom,
"indexing_scheme": "nested",
}
ds
upon applying xdggs.decode(ds)
, the coordinates will be loaded into memory. I am not quite sure what the options are here, but we should probably treat this case as a general scenario, and have the goal to be able to do something like this without issue eventually:
- Open a massively large dggs dataset
- decode, subset, and plot