-
Notifications
You must be signed in to change notification settings - Fork 42
Description
I used cftime 1.4.1 and 1.5.0 when exploring this.
My worklfows involve large datasets and complex functions. I use xarray, backed-up by dask. In one of the more complex processing, I use xarray's map_blocks and a handful of other dask-lazy methods on a large dataset that uses the NoLeap calendar. The dataset is large with 950 chunks and a 55114-element time coordinate. It seems a lot of time is spent in pickling the latter.
More precisely, this line of dask : https://github.com/dask/dask/blob/1c4a84225d1bd26e58d716d2844190cc23ebcfec/dask/base.py#L1028 calls pickle.dumps on the numpy array of type O that stores the cftime.Datetime objects.
When profiling the graph creation (no computation triggered yet), I can see that this step is the one that takes the most time. Slightly more than another function in xarray's CFTimeIndex creation.
MWE:
import pickle
import numpy as np
import pandas as pd
import xarray as xr
cft = xr.cftime_range('1950-01-01', '2100-01-01', freq='D') # cftime array
npy = pd.date_range('1950-01-01', '2100-01-01', freq='D') # same shape but numpy's datetime objects
oar = np.array([1] * npy.size, dtype='O') # sanity check, normal array of object dtype, but builtin element typeSo even if it is normal that pickiling an object array is slower, the cftime array is still 2 orders of magnitude slower than a basic array. I am not very knowledgeable in how pickle works, but I believe something could be made to speed this up.
Any ideas?
