Description
Is your feature request related to a problem? Please describe.
Loading an STF netcdf file from disk via xarray, as is and without further processing, leads to the following in memory dataset:
These days, such a dataset would arguably be designed using strings wherever possible for coordinates. In particular slicing through data to retrieve data for a given station can be done with the statement rain.sel( station_id = "123456A")
.
Trying to do a similar selection on the data set as loaded, rain.sel(station_id=123456)
would lead to KeyError: "'station_id' is not a valid dimension or coordinate
A similar observation can be done for the time
dimension needing to use date/time representation, perhaps even more so than for station_id
Note also that having coordinates that are starting at one (i.e. Fortran indexing) rather than zero (C/Python) is not inconsequential in practice; this is notoriously fertile ground for "off by one" bugs when slicing data for use in modelling.
Describe the solution you'd like
The In-memory representation of STF via xarray has the following coordinates
- "time" (pandas.Timestamp or np.datetime64)
- station_id (str)
- ens_member (int) (note: probably can only be
int
in python anyway, likely 64 bits. int32 retained on-disk) - lead_time (int)
Additional context
Possible downsides:
- What is the performance downside, if any noticeable, when converting the in-memory data for writing to disk with "to_netcdf"
- This would likely make using xarray "lazy-loading" capabilities not possible anymore. Does it really matter "these days"?
Other considerations:
- "RAM windowed" time series for very large data not fitting in RAM.
xarray
backed bynetcdf
anyway does not offer native capabilities for this so far as I know. One would need to use zarr as an xarray backend, or perhaps other options (time series DBs?)