Skip to content

First invocation of open_dataset takes 3 seconds due to backend entrypoint discovery being slow #10178

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
5 tasks done
rabernat opened this issue Mar 26, 2025 · 9 comments
Open
5 tasks done

Comments

@rabernat
Copy link
Contributor

What happened?

The first time I open an Xarray dataset--any dataset--it takes around 3 seconds. Any subsequent invocation of open_dataset is relatively much faster.

Here's an example and a profile file.

import xarray as xr

%%prun -D open_dataset.prof
xr.open_dataset("~/.cache/xarray_tutorial_data/69c68be1605878a6c8efdd34d85b4ca1-air_temperature.nc", engine="netcdf4")
# alternatively, but make sure it's already downloaded
# xr.tutorials.open_dataset("air_temperature")

4938479 function calls (4808880 primitive calls) in 3.812 seconds

open_dataset.prof.zip

Image

The bulk of the time is spent in this function

def backends_dict_from_pkg(
entrypoints: list[EntryPoint],
) -> dict[str, type[BackendEntrypoint]]:
backend_entrypoints = {}
for entrypoint in entrypoints:
name = entrypoint.name
try:
backend = entrypoint.load()
backend_entrypoints[name] = backend
except Exception as ex:
warnings.warn(
f"Engine {name!r} loading failed:\n{ex}", RuntimeWarning, stacklevel=2
)
return backend_entrypoints

And specifically, the entrypoint.load() line.

What did you expect to happen?

This is an unacceptable overhead for low-latency applications, e.g. a serverless application that needs to quickly open a dataset. I expect the load time to be in ms for data on disk.

Minimal Complete Verifiable Example

import xarray as xr
xr.tutorials.open_dataset("air_temperature")

MVCE confirmation

  • Minimal example — the example is as focused as reasonably possible to demonstrate the underlying issue in xarray.
  • Complete example — the example is self-contained, including all data and the text of any traceback.
  • Verifiable example — the example copy & pastes into an IPython prompt or Binder notebook, returning the result.
  • New issue — a search of GitHub Issues suggests this is not a duplicate.
  • Recent environment — the issue occurs with the latest version of xarray and its dependencies.

Relevant log output

Anything else we need to know?

No response

Environment

INSTALLED VERSIONS

commit: None
python: 3.12.4 | packaged by conda-forge | (main, Jun 17 2024, 10:23:07) [GCC 12.3.0]
python-bits: 64
OS: Linux
OS-release: 5.10.220-209.869.amzn2.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: ('C', 'UTF-8')
libhdf5: 1.14.3
libnetcdf: 4.9.2

xarray: 2025.3.0
pandas: 2.2.2
numpy: 1.26.4
scipy: 1.13.1
netCDF4: 1.6.5
pydap: installed
h5netcdf: 1.3.0
h5py: 3.11.0
zarr: 3.0.6
cftime: 1.6.4
nc_time_axis: 1.4.1
iris: None
bottleneck: 1.4.0
dask: 2024.6.2
distributed: 2024.6.2
matplotlib: 3.8.4
cartopy: 0.23.0
seaborn: 0.13.2
numbagg: 0.8.1
fsspec: 2024.6.0
cupy: None
pint: 0.23
sparse: 0.15.4
flox: 0.10.0
numpy_groupies: 0.11.1
setuptools: 70.1.0
pip: 24.0
conda: None
pytest: 8.2.2
mypy: None
IPython: 8.25.0
sphinx: None

@rabernat rabernat added bug needs triage Issue that has not been reviewed by xarray team member labels Mar 26, 2025
@TomNicholas TomNicholas added topic-backends topic-performance and removed needs triage Issue that has not been reviewed by xarray team member labels Mar 26, 2025
@TomNicholas
Copy link
Member

@keewis suggested just now that passing the instance of the BackendEntryPoint class explicitly to open_dataset might be faster. That should hopefully avoid importlib having to search through all the metadata of all the packages installed in your environment in order to find the correct entrypoint. If that is faster, that would suggest a fastpath we could add (at least for the backends that ship with xarray).

@rabernat
Copy link
Contributor Author

Passing the instance of the BackendEntryPoint class explicitly to open_dataset might be faster.

Does engine="netcdf4" not do this already?

@rabernat
Copy link
Contributor Author

Can confirm that

xr.open_dataset(
    "~/.cache/xarray_tutorial_data/69c68be1605878a6c8efdd34d85b4ca1-air_temperature.nc",
    engine=xr.backends.NetCDF4BackendEntrypoint
)

is significantly faster (0.661 s)

@keewis
Copy link
Collaborator

keewis commented Mar 26, 2025

Does engine="netcdf4" not do this already?

no, because we first assemble a (cached) mapping of names to valid entrypoints

@keewis
Copy link
Collaborator

keewis commented Mar 26, 2025

for reference, how long this takes depends on your environment: if you have a lot of packages installed this will take a long time. In my own local env this is pretty much instantaneous:

In [1]: %load_ext pyinstrument
   ...: import xarray as xr

In [2]: %%pyinstrument
   ...: _ = xr.tutorial.open_dataset("air_temperature")
   ...: 
   ...: 

  _     ._   __/__   _ _  _  _ _/_   Recorded: 17:14:55  Samples:  164
 /_//_/// /_\ / //_// / //_'/ //     Duration: 0.364     CPU time: 0.364
/   _/                      v5.0.1

Cell [2]

0.362 <module>  <ipython-input-2-cc9cf5075b8f>:1
`- 0.362 open_dataset  xarray/tutorial.py:83
   |- 0.357 open_dataset  xarray/backends/api.py:479
   |  |- 0.343 NetCDF4BackendEntrypoint.open_dataset  xarray/backends/netCDF4_.py:644
   |  |  |- 0.334 StoreBackendEntrypoint.open_dataset  xarray/backends/store.py:30
   |  |  |  |- 0.258 decode_cf_variables  xarray/conventions.py:345
   |  |  |  |  `- 0.258 decode_cf_variable  xarray/conventions.py:109
   |  |  |  |     `- 0.258 CFDatetimeCoder.decode  xarray/coding/times.py:1352
   |  |  |  |        `- 0.258 _decode_cf_datetime_dtype  xarray/coding/times.py:301
   |  |  |  |           `- 0.255 last_item  xarray/core/formatting.py:124
   |  |  |  |              `- 0.255 to_numpy  xarray/namedarray/pycompat.py:99
   |  |  |  |                 `- 0.255 array_type  xarray/namedarray/pycompat.py:81
   |  |  |  |                    `- 0.255 _get_cached_duck_array_module  xarray/namedarray/pycompat.py:72
   |  |  |  |                       `- 0.255 DuckArrayModule.__init__  xarray/namedarray/pycompat.py:34
   |  |  |  |                          `- 0.255 import_module  importlib/__init__.py:73
   |  |  |  |                                [137 frames hidden]  sparse, numba, importlib, coverage, i...
   |  |  |  `- 0.076 Dataset.__init__  xarray/core/dataset.py:419
   |  |  |     `- 0.076 merge_data_and_coords  xarray/core/merge.py:1068
   |  |  |        `- 0.076 merge_core  xarray/core/merge.py:632
   |  |  |           `- 0.076 collect_variables_and_indexes  xarray/core/merge.py:312
   |  |  |              `- 0.076 create_default_index_implicit  xarray/core/indexes.py:1504
   |  |  |                 `- 0.076 PandasIndex.from_variables  xarray/core/indexes.py:615
   |  |  |                    `- 0.076 PandasIndex.__init__  xarray/core/indexes.py:581
   |  |  |                       `- 0.076 safe_cast_to_index  xarray/core/indexes.py:434
   |  |  |                          `- 0.076 to_numpy  xarray/namedarray/pycompat.py:99
   |  |  |                             `- 0.076 is_chunked_array  xarray/namedarray/pycompat.py:91
   |  |  |                                `- 0.076 is_duck_dask_array  xarray/namedarray/utils.py:89
   |  |  |                                   `- 0.076 is_dask_collection  xarray/namedarray/utils.py:63
   |  |  |                                      `- 0.074 <module>  dask/__init__.py:1
   |  |  |                                            [63 frames hidden]  dask, importlib, psutil, jinja2, re, ...
   |  |  `- 0.009 NetCDF4DataStore.open  xarray/backends/netCDF4_.py:399
   |  |     |- 0.005 <module>  netCDF4/__init__.py:1
   |  |     `- 0.004 NetCDF4DataStore.__init__  xarray/backends/netCDF4_.py:373
   |  |        `- 0.004 NetCDF4DataStore.ds  xarray/backends/netCDF4_.py:459
   |  |           `- 0.004 NetCDF4DataStore._acquire  xarray/backends/netCDF4_.py:454
   |  |              `- 0.004 _GeneratorContextManager.__enter__  contextlib.py:132
   |  |                 `- 0.004 CachingFileManager.acquire_context  xarray/backends/file_manager.py:196
   |  |                    `- 0.004 CachingFileManager._acquire_with_cache_info  xarray/backends/file_manager.py:207
   |  `- 0.014 guess_engine  xarray/backends/plugins.py:140
   |     `- 0.014 list_engines  xarray/backends/plugins.py:117
   |        `- 0.014 entry_points  importlib/metadata/__init__.py:901
   |              [6 frames hidden]  importlib
   `- 0.004 _check_netcdf_engine_installed  xarray/tutorial.py:56

@dcherian
Copy link
Contributor

Do we have a fast path if engine is provided and is one of the built-in engines?

@keewis
Copy link
Collaborator

keewis commented Mar 26, 2025

not at the moment, no. If I remember correctly, the idea was to use the same code path for all backend engines.

@TomNicholas
Copy link
Member

TomNicholas commented Mar 26, 2025

Would it be as simple as inspecting the global xarray.backends.common.BACKEND_ENTRYPOINTS dict there? That already seems to get populated with each of the built-in BackendEntrypoint classes at import time.

In [1]: import xarray as xr

In [2]: xr.backends.common.BACKEND_ENTRYPOINTS
Out[2]: 
{'store': (None, xarray.backends.store.StoreBackendEntrypoint),
 'netcdf4': ('netCDF4', xarray.backends.netCDF4_.NetCDF4BackendEntrypoint),
 'h5netcdf': ('h5netcdf', xarray.backends.h5netcdf_.H5netcdfBackendEntrypoint),
 'pydap': ('pydap', xarray.backends.pydap_.PydapBackendEntrypoint),
 'scipy': ('scipy', xarray.backends.scipy_.ScipyBackendEntrypoint),
 'zarr': ('zarr', xarray.backends.zarr.ZarrBackendEntrypoint)}

@keewis
Copy link
Collaborator

keewis commented Mar 27, 2025

this should be possible, yes. We'd have to figure out how to provide good error messages in case one of the backend dependencies can't be imported, hopefully without duplicating the error reporting machinery for the entrypoint-based backends.

Another potential option would be to populate the list of available entrypoints with all installed entrypoints without performing the import (so without calling entrypoint.load(), which it seems is what takes the longest), and only once that particular entrypoint is requested do we perform the import / loading of the entrypoint. That might even make it easier to debug backends that currently are not detected because a dependency fails to import (for example, cfgrib used to fail because of eccodes).

Edit: I guess that only works if we don't need to guess, because for that all entrypoints need to be loaded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants