Skip to content

Comments

Speedup: Add optional LightDataArray backend and migrate datetimes to np.datetime64#397

Open
dfulu wants to merge 1 commit intodev_feb2026_speedupsfrom
lightdataarray
Open

Speedup: Add optional LightDataArray backend and migrate datetimes to np.datetime64#397
dfulu wants to merge 1 commit intodev_feb2026_speedupsfrom
lightdataarray

Conversation

@dfulu
Copy link
Member

@dfulu dfulu commented Feb 24, 2026

Pull Request

Description

There are two main changes in this PR which combine the speed up the data-sampler.

  1. Removing/reducing the use of pandas DatetimeIndex class and migrating to use numpy datetime64 everywhere
  2. Replacing xarray DataArray with a custom class which mimics a limited portion of xr.DataArray's functionality and so is faster

These changes were motivated by running code profiling with pyinstrument

1. Getting rid of pandas timestamps

When profiling the code I found that we were spending a silly amount of the time doing things with timestamps. This is just because the pandas Timestamp, DatetimeIndex and date_range() are slow. I have moved us to use numpy datetime64 objects everywhere to handle datetimes. The pandas timestamp stuff is built on top of numpy datetime64 anyway so we've just removed a layer of overhead. Unfortunately this costs us in simplicity of code. The pandas Timestamps have those nice .hour, .day_of_year, .ceil() attributes and methods which numpy does not have. I have added a new file time_utils.py with a bunch of helper functions to make working with the numpy datetime64 objects easier and to replace some of the utility we were getting from pandas.

I've added tests for these utility functions which check them against the pandas equivalents.

2. Replacing xarray DataArray

This is the biggest part of the speedup. In profiling we were spending a lot of time on xarray's internal methods which we don't really need. I've added a new class LightDataArray which replaces the xarray DataArray in the code path under _get_sample(). This means that we can use xarray for the initialisation of the torch Dataset (where the speed of xarray doesn't matter) but then we use the new class when we are slicing out samples. We do not need all of xarray's functionality when slicing these samples, and so we can do it faster with the custom class.

I have based the custom class on the xarray DataArray class. For example I've created isel(), .sel(), and .load() methods and it has a .values property. This means we can get away with less code changes here and also means it should be pretty trivial if we want to switch back to xarray. In the current code I have left it as a option to use the new LightDataArray class when calling _get_sample() and it works pulling from xr.DataArrays or LightDataArrays. This has the benefit of making it easy to test.

The LightDataArray object has a from_xarray() method which instantiates a LightDataArray from an xr.DataArray. The new class also has a to_xarray() method which instantiates an xr.DataArray from a LightDataArray instance. This makes it easy to test that the methods of LightDataArray match the xarray methods. For example, we can run something like

# Create a LightDataArray from the xarray DataArray
lda = LightDataArray.from_xarray(da)

# Run some xarray method
da_Result = da.sel(...)

# Run the same method on the LightArray
lda_result = lda.sel(...)

# Convert the LDA back to xarray and check the results are the same
lda_result.to_xarray()).equals(da_result)

So we use the xarray behaviour as the expected behaviour for our tests

Checklist:

  • My code follows OCF's coding style guidelines
  • I have performed a self-review of my own code
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked my code and corrected any misspellings

else:
# Get the coordinates of the sample
t0, location_id = self.valid_t0_and_location_ids.iloc[idx]
t0 = self.valid_t0_and_location_ids["t0"].values[idx]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change required since self.valid_t0_and_location_ids.iloc[idx] wasn't working well with returning a np.datetime64 object. It was stubbornly returning a pandas Timestamp


def tensorstore_read(xarray_dict: dict) -> dict:
"""Start reading a nested dictionary of xarray-tensorstore DataArrays."""
def read_data_dict(xarray_dict: dict) -> dict:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function still aims to do the same thing, just renamed for more clarity

ds_2 = ds_nwp_ecmwf.copy(deep=True)
ds_2["init_time"] = pd.date_range(
start=ds_nwp_ecmwf.init_time.max().values + pd.Timedelta("6h"),
start=ds_nwp_ecmwf.init_time.values.max() + pd.Timedelta("6h"),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Calling the values first returns an array of np.datetime64 objects so start is cast to np.datetime64. I didn't change the pandas timedelta here. Maybe I should have, but I wanted to keep the number of changes low. So some pandas timestamps currently remain in the tests

data_vars={
"generation_mw": (("time_utc", "location_id"), np.random.randint(0, 100, (10, 2))),
"capacity_mwp": (("location_id",), [90.0, 110.0]),
"capacity_mwp": (("location_id",), [90, 110]),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test to to check that an error is raised if the dtyle of the generation is int. Since we concat the generation and capacities together into a single array now, the generation would be promoted to floats if the capacities were floats. So this test would fail

ds = ds.assign_coords(capacity_mwp=ds.capacity_mwp)

da = ds.generation_mw
da = ds.to_dataarray("gen_param").transpose("time_utc", "location_id", "gen_param")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to move the capacities out of the coords. This makes the dataarray simpler and allows us to enforce that all coords at 1-dimensional. This allows theLightDataArray to be simpler by reducing the scope

# We only look at the dimensional coords. It is possible that other coordinate systems are
# included as non-dimensional coords
dimensional_coords = set(da.xindexes)
dimensional_coords = set(da.dims)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is functionally the same. I just switched so that the LightDataArray didn't have to have .xindexes and only needed .dims

# This is the max staleness we can use considering the max step of the input data
max_possible_staleness = (
pd.Timedelta(da["step"].max().item())
da["step"].values.max()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is just to return a datetime64 object


# Find the first forecast step
first_forecast_step = pd.Timedelta(da["step"].min().item())
first_forecast_step = da["step"].values[0]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We assume the steps are sorted from low to high here. I think we implicitly assume that in our slicing anyway and we should probably make that more explicit in loading datasets (we don't right now)

channel_stds = self.stds_dict["nwp"][nwp_key]

da_nwp = (da_nwp - channel_means) / channel_stds
da_nwp.data = (da_nwp.data - channel_means) / channel_stds
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace the xarray .data is faster than the original line and this also works with the LightDataArray

Args:
datetimes: the datetimes to get POSIX timestamps for
"""
return datetimes.astype("datetime64[ns]").astype(np.float64) * 1e-9
Copy link
Member Author

@dfulu dfulu Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To match the pd.Timestamp.timestamp() method, this needs to use float64 to avoid losing accuracy

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made the get_day_fraction() function also float64 to match this

from xarray_tensorstore import read as xtr_read


def minutes(minutes: int | list[float]) -> pd.Timedelta | pd.TimedeltaIndex:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just moved to time_utils.py


if "generation" in dataset_dict:
da_generation = dataset_dict["generation"]
da_generation = da_generation / da_generation.capacity_mwp.values
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The normalisation here was hard to emulate with the new generation data structure where the generation and capacity are in the same array but on different indices. It would be hard to normalise the generation values without also normalising the capacity values too. I moved the nornalisation into the convert_generation_to_numpy_sample() function instead

da: Xarray DataArray containing generation data
t0_idx: Index of the t0 timestamp in the time dimension of the generation data
"""
generation_values = da.sel(gen_param="generation_mw").values
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the normalisation in here instead of the main body of the .process_and_combine_datasets() method. See comment on that method for the reasoning

@dfulu dfulu marked this pull request as ready for review February 24, 2026 15:34
@dfulu dfulu requested a review from Sukh-P February 24, 2026 15:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant