Speedup: Add optional LightDataArray backend and migrate datetimes to np.datetime64 by dfulu · Pull Request #397 · openclimatefix/ocf-data-sampler

dfulu · 2026-02-24T14:30:52Z

Pull Request

Description

There are two main changes in this PR which combine the speed up the data-sampler.

Removing/reducing the use of pandas DatetimeIndex class and migrating to use numpy datetime64 everywhere
Replacing xarray DataArray with a custom class which mimics a limited portion of xr.DataArray's functionality and so is faster

These changes were motivated by running code profiling with pyinstrument

1. Getting rid of pandas timestamps

When profiling the code I found that we were spending a silly amount of the time doing things with timestamps. This is just because the pandas Timestamp, DatetimeIndex and date_range() are slow. I have moved us to use numpy datetime64 objects everywhere to handle datetimes. The pandas timestamp stuff is built on top of numpy datetime64 anyway so we've just removed a layer of overhead. Unfortunately this costs us in simplicity of code. The pandas Timestamps have those nice .hour, .day_of_year, .ceil() attributes and methods which numpy does not have. I have added a new file time_utils.py with a bunch of helper functions to make working with the numpy datetime64 objects easier and to replace some of the utility we were getting from pandas.

I've added tests for these utility functions which check them against the pandas equivalents.

2. Replacing xarray DataArray

This is the biggest part of the speedup. In profiling we were spending a lot of time on xarray's internal methods which we don't really need. I've added a new class LightDataArray which replaces the xarray DataArray in the code path under _get_sample(). This means that we can use xarray for the initialisation of the torch Dataset (where the speed of xarray doesn't matter) but then we use the new class when we are slicing out samples. We do not need all of xarray's functionality when slicing these samples, and so we can do it faster with the custom class.

I have based the custom class on the xarray DataArray class. For example I've created isel(), .sel(), and .load() methods and it has a .values property. This means we can get away with less code changes here and also means it should be pretty trivial if we want to switch back to xarray. In the current code I have left it as a option to use the new LightDataArray class when calling _get_sample() and it works pulling from xr.DataArrays or LightDataArrays. This has the benefit of making it easy to test.

The LightDataArray object has a from_xarray() method which instantiates a LightDataArray from an xr.DataArray. The new class also has a to_xarray() method which instantiates an xr.DataArray from a LightDataArray instance. This makes it easy to test that the methods of LightDataArray match the xarray methods. For example, we can run something like

# Create a LightDataArray from the xarray DataArray
lda = LightDataArray.from_xarray(da)

# Run some xarray method
da_Result = da.sel(...)

# Run the same method on the LightArray
lda_result = lda.sel(...)

# Convert the LDA back to xarray and check the results are the same
lda_result.to_xarray()).equals(da_result)

So we use the xarray behaviour as the expected behaviour for our tests

Checklist:

My code follows OCF's coding style guidelines
I have performed a self-review of my own code
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
I have checked my code and corrected any misspellings

dfulu · 2026-02-24T14:41:08Z

ocf_data_sampler/torch_datasets/pvnet_dataset.py

        else:
            # Get the coordinates of the sample
-            t0, location_id = self.valid_t0_and_location_ids.iloc[idx]
+            t0 = self.valid_t0_and_location_ids["t0"].values[idx]


This change required since self.valid_t0_and_location_ids.iloc[idx] wasn't working well with returning a np.datetime64 object. It was stubbornly returning a pandas Timestamp

dfulu · 2026-02-24T14:44:53Z

ocf_data_sampler/utils.py


-def tensorstore_read(xarray_dict: dict) -> dict:
-    """Start reading a nested dictionary of xarray-tensorstore DataArrays."""
+def read_data_dict(xarray_dict: dict) -> dict:


This function still aims to do the same thing, just renamed for more clarity

dfulu · 2026-02-24T14:49:23Z

tests/load/conftest.py

    ds_2 = ds_nwp_ecmwf.copy(deep=True)
    ds_2["init_time"] = pd.date_range(
-        start=ds_nwp_ecmwf.init_time.max().values + pd.Timedelta("6h"),
+        start=ds_nwp_ecmwf.init_time.values.max() + pd.Timedelta("6h"),


Calling the values first returns an array of np.datetime64 objects so start is cast to np.datetime64. I didn't change the pandas timedelta here. Maybe I should have, but I wanted to keep the number of changes low. So some pandas timestamps currently remain in the tests

dfulu · 2026-02-24T14:51:03Z

tests/load/test_load_generation.py

        data_vars={
            "generation_mw": (("time_utc", "location_id"), np.random.randint(0, 100, (10, 2))),
-            "capacity_mwp": (("location_id",), [90.0, 110.0]),
+            "capacity_mwp": (("location_id",), [90, 110]),


This test to to check that an error is raised if the dtyle of the generation is int. Since we concat the generation and capacities together into a single array now, the generation would be promoted to floats if the capacities were floats. So this test would fail

dfulu · 2026-02-24T14:57:53Z

ocf_data_sampler/load/generation.py

-    ds = ds.assign_coords(capacity_mwp=ds.capacity_mwp)
-
-    da = ds.generation_mw
+    da = ds.to_dataarray("gen_param").transpose("time_utc", "location_id", "gen_param")


I wanted to move the capacities out of the coords. This makes the dataarray simpler and allows us to enforce that all coords at 1-dimensional. This allows theLightDataArray to be simpler by reducing the scope

dfulu · 2026-02-24T15:04:09Z

ocf_data_sampler/select/geospatial.py

    # We only look at the dimensional coords. It is possible that other coordinate systems are
    # included as non-dimensional coords
-    dimensional_coords = set(da.xindexes)
+    dimensional_coords = set(da.dims)


This is functionally the same. I just switched so that the LightDataArray didn't have to have .xindexes and only needed .dims

dfulu · 2026-02-24T15:05:57Z

ocf_data_sampler/torch_datasets/utils/valid_time_periods.py

            # This is the max staleness we can use considering the max step of the input data
            max_possible_staleness = (
-                pd.Timedelta(da["step"].max().item())
+                da["step"].values.max()


This change is just to return a datetime64 object

dfulu · 2026-02-24T15:07:39Z

ocf_data_sampler/torch_datasets/utils/valid_time_periods.py


            # Find the first forecast step
-            first_forecast_step = pd.Timedelta(da["step"].min().item())
+            first_forecast_step = da["step"].values[0]


We assume the steps are sorted from low to high here. I think we implicitly assume that in our slicing anyway and we should probably make that more explicit in loading datasets (we don't right now)

dfulu · 2026-02-24T15:09:03Z

ocf_data_sampler/torch_datasets/pvnet_dataset.py

                channel_stds = self.stds_dict["nwp"][nwp_key]

-                da_nwp = (da_nwp - channel_means) / channel_stds
+                da_nwp.data = (da_nwp.data - channel_means) / channel_stds


Replace the xarray .data is faster than the original line and this also works with the LightDataArray

dfulu · 2026-02-24T15:12:46Z

ocf_data_sampler/time_utils.py

+    Args:
+        datetimes: the datetimes to get POSIX timestamps for
+    """
+    return datetimes.astype("datetime64[ns]").astype(np.float64) * 1e-9


To match the pd.Timestamp.timestamp() method, this needs to use float64 to avoid losing accuracy

I made the get_day_fraction() function also float64 to match this

dfulu · 2026-02-24T15:22:01Z

ocf_data_sampler/utils.py

+from xarray_tensorstore import read as xtr_read

-
-def minutes(minutes: int | list[float]) -> pd.Timedelta | pd.TimedeltaIndex:


This was just moved to time_utils.py

dfulu · 2026-02-24T15:32:41Z

ocf_data_sampler/torch_datasets/pvnet_dataset.py


        if "generation" in dataset_dict:
            da_generation = dataset_dict["generation"]
-            da_generation = da_generation / da_generation.capacity_mwp.values


The normalisation here was hard to emulate with the new generation data structure where the generation and capacity are in the same array but on different indices. It would be hard to normalise the generation values without also normalising the capacity values too. I moved the nornalisation into the convert_generation_to_numpy_sample() function instead

dfulu · 2026-02-24T15:34:05Z

ocf_data_sampler/numpy_sample/generation.py

        da: Xarray DataArray containing generation data
        t0_idx: Index of the t0 timestamp in the time dimension of the generation data
    """
+    generation_values = da.sel(gen_param="generation_mw").values


I moved the normalisation in here instead of the main body of the .process_and_combine_datasets() method. See comment on that method for the reasoning

…datetimes

dfulu force-pushed the lightdataarray branch from e114c2e to 3126fae Compare February 24, 2026 14:31

dfulu commented Feb 24, 2026

View reviewed changes

dfulu force-pushed the lightdataarray branch from 3126fae to 8b78420 Compare February 24, 2026 14:54

dfulu commented Feb 24, 2026

View reviewed changes

dfulu marked this pull request as ready for review February 24, 2026 15:34

dfulu requested a review from Sukh-P February 24, 2026 15:34

Add optional LightDataArray backend and migrate to np.datetime64 for …

21c5289

…datetimes

dfulu force-pushed the lightdataarray branch from 8b78420 to 21c5289 Compare February 24, 2026 15:36

dfulu mentioned this pull request Feb 24, 2026

Speedup: Switch to faster solar coordinate algorithm #398

Draft

5 tasks

		from xarray_tensorstore import read as xtr_read


		def minutes(minutes: int \| list[float]) -> pd.Timedelta \| pd.TimedeltaIndex:

Uh oh!

Comments

Conversation

dfulu commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Description

1. Getting rid of pandas timestamps

2. Replacing xarray DataArray

Checklist:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dfulu Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

dfulu commented Feb 24, 2026 •

edited

Loading

dfulu Feb 24, 2026 •

edited

Loading