Design/Datamodel Decision: Very large datasets

Documenting this here for permanence (already brought up during the km-scale-hackathon). 

The current implementation basically faces a hard scaling stop based on the clients machine since we rely on a labelled xarray coordinate ('cell_ids'), which will by default be loaded to memory. 

Here is a small example with public data that will blow up my laptop at higher zoom levels, since the cell dimension itself (if labelled) will be larger than my system memory:

```python
import xdggs
import xarray as xr
import numpy as np
import dask.array as dsa
# zoom = 15
zoom = 5
path = f"https://nowake.nicam.jp/files/220m/data_healpix_{zoom}.zarr"
ds = xr.open_dataset(path, engine='zarr', chunks={})

cell_ids = xr.DataArray(dsa.arange(ds.cell.size,chunks=(1048576)),dims=['cell_ids'], name='cell_ids')
ds = ds.assign_coords({'cell_ids': cell_ids})

ds.cell_ids.attrs = {
    "grid_name": "healpix",
    "level": zoom,
    "indexing_scheme": "nested",
}
ds
```

even if I assign a coordinate without index (see @benbovy s comment in https://github.com/pydata/xarray/issues/1650#issuecomment-1697282386):
```python
import xdggs
import xarray as xr
import numpy as np
import dask.array as dsa
# zoom = 15
zoom = 5
path = f"https://nowake.nicam.jp/files/220m/data_healpix_{zoom}.zarr"
ds = xr.open_dataset(path, engine='zarr', chunks={})

# the coordinate labels are wayyy bigger than my memory! So xarray loading them by default is a non-starter.
# Trying https://github.com/pydata/xarray/issues/1650#issuecomment-1697282386
cell_ids = xr.Coordinates({"cell_ids": ("cell_ids", dsa.arange(ds.cell.size, chunks=(1048576)))}, indexes={})

ds = ds.assign_coords(cell_ids)
ds.cell_ids.attrs = {
    "grid_name": "healpix",
    "level": zoom,
    "indexing_scheme": "nested",
}
ds
```
upon applying `xdggs.decode(ds)`, the coordinates will be loaded into memory. I am not quite sure what the options are here, but we should probably treat this case as a general scenario, and have the goal to be able to do something like this without issue eventually:
- Open a massively large dggs dataset
- decode, subset, and plot


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Design/Datamodel Decision: Very large datasets #143

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Design/Datamodel Decision: Very large datasets #143

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions