-
Notifications
You must be signed in to change notification settings - Fork 52
Description
Description
The changes from intake_esm 2025.2.3 to 2025.7.9 seem to have degraded performance for our usecases. I guess this is related to the new polars backend, but I'm not totally sure.
In what follows I am comparing the "previous" environment with
intake-esm 2025.2.3 pyhd8ed1ab_1 conda-forg
pandas 2.3.1 py313h08cd8bf_0 conda-forge
pyarrow 21.0.0 py313h78bf25f_0 conda-forge
with the "new" environment:
intake_esm @ main
pandas 2.3.3 py313h08cd8bf_1 conda-forge
polars 1.32.3 default_h3512890_0 conda-forge
pyarrow 21.0.0 py313h78bf25f_0 conda-forge
Python 3.13.5 in both cases.
I use a medium catalog with 159 277 rows shared below, and a larger one with 3 472 904 rows (too big to share).
I see two areas where the loss is significative.
RAM consumption
Using memory_profiler, I did:
import intake
def func():
cat = intake.open_esm_datastore('simulation.json')
scat = cat.search(variable='tasmin')
srcs = scat.unique()The profile for "previous" gives:
Line # Mem usage Increment Occurrences Line Contents
=============================================================
50 183.2 MiB 183.2 MiB 1 @profile
51 def func(kwargs):
52 297.9 MiB 114.6 MiB 1 cat = intake.open_esm_datastore('/baril/scenario/catalogues/simulation.json', read_csv_kwargs=kwargs)
53 299.3 MiB 1.5 MiB 1 scat = cat.search(variable='tasmin')
54 299.7 MiB 0.4 MiB 1 srcs = scat.unique()
And for "new":
Line # Mem usage Increment Occurrences Line Contents
=============================================================
50 221.9 MiB 221.9 MiB 1 @profile
51 def func(kwargs):
52 229.0 MiB 7.1 MiB 1 cat = intake.open_esm_datastore('simulation.json')
53 695.3 MiB 466.3 MiB 1 scat = cat.search(variable='tasmin')
54 703.3 MiB 8.0 MiB 1 srcs = scat.unique()
As expected, the polars backend is lazy and RAM is filled only at the first search. However, it uses more than twice the RAM! As no converters or schemas are passed, everything is a string here.
This example is with a medium-sized catalog. With another huge one, I get 1500 vs 9000 Mo.
Walltime
The new polars backend makes the "opening" phase much faster, which is not surprising as this is what polars is good at, and also because of the "lazy frame" feature.
However, once I start searching the catalog, the new intake_esm seems slower than before. Using the same code as above, but this time in a IPython cell so I can use timeit:
"previous":
In [5]: import intake
In [6]: cat = intake.open_esm_datastore('/baril/scenario/catalogues/simulation.json')
In [7]: %timeit cat = intake.open_esm_datastore('/baril/scenario/catalogues/simulation.json')
431 ms ± 4.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]: %timeit cat.search(variable='tasmin')
16.3 ms ± 394 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
and "new":
In [1]: import intake
In [2]: cat = intake.open_esm_datastore('simulation.json')
In [3]: %timeit cat = intake.open_esm_datastore('simulation.json')
326 μs ± 9.58 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# To avoid effects of the first search ?
In [13]: cat.search(variable='tasmin')
Out[13]: <IPython.core.display.HTML object>
In [14]: %timeit cat.search(variable='tasmin')
79.5 ms ± 4.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Again, I used this medium catalog in the examples because it small enough that I can share it. However, with my large catalog, I get:
- Previous: open = 11 s, search = 120 ms
- New: open = 360 ms, search = 1.05s
We see here that clearly the new intake_esm has helped a lot with the opening! But the search is much slower...
Thanks for all the work on this package! We use it extensively at Ouranos. I hope this performance report can help the development. Tell me if I can add anything to the analysis!