-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
First invocation of open_dataset
takes 3 seconds due to backend entrypoint discovery being slow
#10178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@keewis suggested just now that passing the instance of the |
Does |
Can confirm that
is significantly faster (0.661 s) |
no, because we first assemble a (cached) mapping of names to valid entrypoints |
for reference, how long this takes depends on your environment: if you have a lot of packages installed this will take a long time. In my own local env this is pretty much instantaneous:
|
Do we have a fast path if engine is provided and is one of the built-in engines? |
not at the moment, no. If I remember correctly, the idea was to use the same code path for all backend engines. |
Would it be as simple as inspecting the global In [1]: import xarray as xr
In [2]: xr.backends.common.BACKEND_ENTRYPOINTS
Out[2]:
{'store': (None, xarray.backends.store.StoreBackendEntrypoint),
'netcdf4': ('netCDF4', xarray.backends.netCDF4_.NetCDF4BackendEntrypoint),
'h5netcdf': ('h5netcdf', xarray.backends.h5netcdf_.H5netcdfBackendEntrypoint),
'pydap': ('pydap', xarray.backends.pydap_.PydapBackendEntrypoint),
'scipy': ('scipy', xarray.backends.scipy_.ScipyBackendEntrypoint),
'zarr': ('zarr', xarray.backends.zarr.ZarrBackendEntrypoint)} |
this should be possible, yes. We'd have to figure out how to provide good error messages in case one of the backend dependencies can't be imported, hopefully without duplicating the error reporting machinery for the entrypoint-based backends. Another potential option would be to populate the list of available entrypoints with all installed entrypoints without performing the import (so without calling Edit: I guess that only works if we don't need to guess, because for that all entrypoints need to be loaded. |
What happened?
The first time I open an Xarray dataset--any dataset--it takes around 3 seconds. Any subsequent invocation of open_dataset is relatively much faster.
Here's an example and a profile file.
4938479 function calls (4808880 primitive calls) in 3.812 seconds
open_dataset.prof.zip
The bulk of the time is spent in this function
xarray/xarray/backends/plugins.py
Lines 66 to 79 in 66f6c17
And specifically, the
entrypoint.load()
line.What did you expect to happen?
This is an unacceptable overhead for low-latency applications, e.g. a serverless application that needs to quickly open a dataset. I expect the load time to be in ms for data on disk.
Minimal Complete Verifiable Example
MVCE confirmation
Relevant log output
Anything else we need to know?
No response
Environment
INSTALLED VERSIONS
commit: None
python: 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:23:07) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 5.10.220-209.869.amzn2.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('C', 'UTF-8')
libhdf5: 1.14.3
libnetcdf: 4.9.2
xarray: 2025.3.0
pandas: 2.2.2
numpy: 1.26.4
scipy: 1.13.1
netCDF4: 1.6.5
pydap: installed
h5netcdf: 1.3.0
h5py: 3.11.0
zarr: 3.0.6
cftime: 1.6.4
nc_time_axis: 1.4.1
iris: None
bottleneck: 1.4.0
dask: 2024.6.2
distributed: 2024.6.2
matplotlib: 3.8.4
cartopy: 0.23.0
seaborn: 0.13.2
numbagg: 0.8.1
fsspec: 2024.6.0
cupy: None
pint: 0.23
sparse: 0.15.4
flox: 0.10.0
numpy_groupies: 0.11.1
setuptools: 70.1.0
pip: 24.0
conda: None
pytest: 8.2.2
mypy: None
IPython: 8.25.0
sphinx: None
The text was updated successfully, but these errors were encountered: