Skip to content

Significative RAM and wall time increase with intake_esm 2025.7.9 #753

@aulemahal

Description

@aulemahal

Description

The changes from intake_esm 2025.2.3 to 2025.7.9 seem to have degraded performance for our usecases. I guess this is related to the new polars backend, but I'm not totally sure.

In what follows I am comparing the "previous" environment with

intake-esm                2025.2.3           pyhd8ed1ab_1    conda-forg
pandas                    2.3.1           py313h08cd8bf_0    conda-forge
pyarrow                   21.0.0          py313h78bf25f_0    conda-forge

with the "new" environment:

intake_esm @ main
pandas                    2.3.3           py313h08cd8bf_1    conda-forge
polars                    1.32.3          default_h3512890_0    conda-forge
pyarrow                   21.0.0          py313h78bf25f_0    conda-forge

Python 3.13.5 in both cases.

I use a medium catalog with 159 277 rows shared below, and a larger one with 3 472 904 rows (too big to share).

medium-catalog.zip

I see two areas where the loss is significative.

RAM consumption

Using memory_profiler, I did:

import intake

def func():
	cat = intake.open_esm_datastore('simulation.json')
	scat = cat.search(variable='tasmin')
	srcs = scat.unique()

The profile for "previous" gives:

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    50    183.2 MiB    183.2 MiB           1   @profile
    51                                         def func(kwargs):
    52    297.9 MiB    114.6 MiB           1   	cat = intake.open_esm_datastore('/baril/scenario/catalogues/simulation.json', read_csv_kwargs=kwargs)
    53    299.3 MiB      1.5 MiB           1   	scat = cat.search(variable='tasmin')
    54    299.7 MiB      0.4 MiB           1   	srcs = scat.unique()

And for "new":

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    50    221.9 MiB    221.9 MiB           1   @profile
    51                                         def func(kwargs):
    52    229.0 MiB      7.1 MiB           1   	cat = intake.open_esm_datastore('simulation.json')
    53    695.3 MiB    466.3 MiB           1   	scat = cat.search(variable='tasmin')
    54    703.3 MiB      8.0 MiB           1   	srcs = scat.unique()

As expected, the polars backend is lazy and RAM is filled only at the first search. However, it uses more than twice the RAM! As no converters or schemas are passed, everything is a string here.

This example is with a medium-sized catalog. With another huge one, I get 1500 vs 9000 Mo.

Walltime

The new polars backend makes the "opening" phase much faster, which is not surprising as this is what polars is good at, and also because of the "lazy frame" feature.

However, once I start searching the catalog, the new intake_esm seems slower than before. Using the same code as above, but this time in a IPython cell so I can use timeit:

"previous":

In [5]: import intake

In [6]: cat = intake.open_esm_datastore('/baril/scenario/catalogues/simulation.json')

In [7]: %timeit cat = intake.open_esm_datastore('/baril/scenario/catalogues/simulation.json')
431 ms ± 4.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: %timeit cat.search(variable='tasmin')
16.3 ms ± 394 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)

and "new":

In [1]: import intake

In [2]: cat = intake.open_esm_datastore('simulation.json')

In [3]: %timeit cat = intake.open_esm_datastore('simulation.json')
326 μs ± 9.58 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# To avoid effects of the first search ?
In [13]: cat.search(variable='tasmin')
Out[13]: <IPython.core.display.HTML object>

In [14]: %timeit cat.search(variable='tasmin')
79.5 ms ± 4.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Again, I used this medium catalog in the examples because it small enough that I can share it. However, with my large catalog, I get:

  • Previous: open = 11 s, search = 120 ms
  • New: open = 360 ms, search = 1.05s

We see here that clearly the new intake_esm has helped a lot with the opening! But the search is much slower...

Thanks for all the work on this package! We use it extensively at Ouranos. I hope this performance report can help the development. Tell me if I can add anything to the analysis!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions