Skip to content

Design/Datamodel Decision: Very large datasets #143

Open
@jbusecke

Description

@jbusecke

Documenting this here for permanence (already brought up during the km-scale-hackathon).

The current implementation basically faces a hard scaling stop based on the clients machine since we rely on a labelled xarray coordinate ('cell_ids'), which will by default be loaded to memory.

Here is a small example with public data that will blow up my laptop at higher zoom levels, since the cell dimension itself (if labelled) will be larger than my system memory:

import xdggs
import xarray as xr
import numpy as np
import dask.array as dsa
# zoom = 15
zoom = 5
path = f"https://nowake.nicam.jp/files/220m/data_healpix_{zoom}.zarr"
ds = xr.open_dataset(path, engine='zarr', chunks={})

cell_ids = xr.DataArray(dsa.arange(ds.cell.size,chunks=(1048576)),dims=['cell_ids'], name='cell_ids')
ds = ds.assign_coords({'cell_ids': cell_ids})

ds.cell_ids.attrs = {
    "grid_name": "healpix",
    "level": zoom,
    "indexing_scheme": "nested",
}
ds

even if I assign a coordinate without index (see @benbovy s comment in pydata/xarray#1650 (comment)):

import xdggs
import xarray as xr
import numpy as np
import dask.array as dsa
# zoom = 15
zoom = 5
path = f"https://nowake.nicam.jp/files/220m/data_healpix_{zoom}.zarr"
ds = xr.open_dataset(path, engine='zarr', chunks={})

# the coordinate labels are wayyy bigger than my memory! So xarray loading them by default is a non-starter.
# Trying https://github.com/pydata/xarray/issues/1650#issuecomment-1697282386
cell_ids = xr.Coordinates({"cell_ids": ("cell_ids", dsa.arange(ds.cell.size, chunks=(1048576)))}, indexes={})

ds = ds.assign_coords(cell_ids)
ds.cell_ids.attrs = {
    "grid_name": "healpix",
    "level": zoom,
    "indexing_scheme": "nested",
}
ds

upon applying xdggs.decode(ds), the coordinates will be loaded into memory. I am not quite sure what the options are here, but we should probably treat this case as a general scenario, and have the goal to be able to do something like this without issue eventually:

  • Open a massively large dggs dataset
  • decode, subset, and plot

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions