Skip to content

Performance : speeding up pickling of cftime arrays #253

@aulemahal

Description

@aulemahal

I used cftime 1.4.1 and 1.5.0 when exploring this.

My worklfows involve large datasets and complex functions. I use xarray, backed-up by dask. In one of the more complex processing, I use xarray's map_blocks and a handful of other dask-lazy methods on a large dataset that uses the NoLeap calendar. The dataset is large with 950 chunks and a 55114-element time coordinate. It seems a lot of time is spent in pickling the latter.

More precisely, this line of dask : https://github.com/dask/dask/blob/1c4a84225d1bd26e58d716d2844190cc23ebcfec/dask/base.py#L1028 calls pickle.dumps on the numpy array of type O that stores the cftime.Datetime objects.

When profiling the graph creation (no computation triggered yet), I can see that this step is the one that takes the most time. Slightly more than another function in xarray's CFTimeIndex creation.

MWE:

import pickle
import numpy as np
import pandas as pd
import xarray as xr


cft = xr.cftime_range('1950-01-01', '2100-01-01', freq='D')  # cftime array
npy = pd.date_range('1950-01-01', '2100-01-01', freq='D')  # same shape but numpy's datetime objects
oar = np.array([1] * npy.size, dtype='O')  # sanity check, normal array of object dtype, but builtin element type

timeit calls in a notebook:
Capture d’écran de 2021-08-10 11-10-49

So even if it is normal that pickiling an object array is slower, the cftime array is still 2 orders of magnitude slower than a basic array. I am not very knowledgeable in how pickle works, but I believe something could be made to speed this up.

Any ideas?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions