Significative RAM and wall time increase with intake_esm 2025.7.9

## Description

The changes from intake_esm 2025.2.3 to 2025.7.9 seem to have degraded performance for our usecases. I guess this is related to the new polars backend, but I'm not totally sure.

In what follows I am comparing the "previous" environment with 
```
intake-esm                2025.2.3           pyhd8ed1ab_1    conda-forg
pandas                    2.3.1           py313h08cd8bf_0    conda-forge
pyarrow                   21.0.0          py313h78bf25f_0    conda-forge
```
with the "new" environment:
```
intake_esm @ main
pandas                    2.3.3           py313h08cd8bf_1    conda-forge
polars                    1.32.3          default_h3512890_0    conda-forge
pyarrow                   21.0.0          py313h78bf25f_0    conda-forge
```

Python 3.13.5 in both cases.

I use a medium catalog with 159 277 rows shared below, and a larger one with 3 472 904 rows (too big to share).

[medium-catalog.zip](https://github.com/user-attachments/files/22980271/medium-catalog.zip)

I see two areas where the loss is significative.

#### RAM consumption
Using `memory_profiler`, I did:

```python
import intake

def func():
	cat = intake.open_esm_datastore('simulation.json')
	scat = cat.search(variable='tasmin')
	srcs = scat.unique()
```

The profile for "previous" gives:
```
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    50    183.2 MiB    183.2 MiB           1   @profile
    51                                         def func(kwargs):
    52    297.9 MiB    114.6 MiB           1   	cat = intake.open_esm_datastore('/baril/scenario/catalogues/simulation.json', read_csv_kwargs=kwargs)
    53    299.3 MiB      1.5 MiB           1   	scat = cat.search(variable='tasmin')
    54    299.7 MiB      0.4 MiB           1   	srcs = scat.unique()
```

And for "new": 
```
Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    50    221.9 MiB    221.9 MiB           1   @profile
    51                                         def func(kwargs):
    52    229.0 MiB      7.1 MiB           1   	cat = intake.open_esm_datastore('simulation.json')
    53    695.3 MiB    466.3 MiB           1   	scat = cat.search(variable='tasmin')
    54    703.3 MiB      8.0 MiB           1   	srcs = scat.unique()
```

As expected, the polars backend is lazy and RAM is filled only at the first search. However, it uses more than twice the RAM! As no converters or schemas are passed, everything is a string here.

This example is with a medium-sized catalog. With another huge one, I get 1500 vs 9000 Mo.


#### Walltime
The new polars backend makes the "opening" phase much faster, which is not surprising as this is what polars is good at, and also because of the "lazy frame" feature.

However, once I start searching the catalog, the new intake_esm seems slower than before. Using the same code as above, but this time in a IPython cell so I can use `timeit`:

"previous":
```
In [5]: import intake

In [6]: cat = intake.open_esm_datastore('/baril/scenario/catalogues/simulation.json')

In [7]: %timeit cat = intake.open_esm_datastore('/baril/scenario/catalogues/simulation.json')
431 ms ± 4.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [8]: %timeit cat.search(variable='tasmin')
16.3 ms ± 394 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
and "new":
```
In [1]: import intake

In [2]: cat = intake.open_esm_datastore('simulation.json')

In [3]: %timeit cat = intake.open_esm_datastore('simulation.json')
326 μs ± 9.58 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

# To avoid effects of the first search ?
In [13]: cat.search(variable='tasmin')
Out[13]: <IPython.core.display.HTML object>

In [14]: %timeit cat.search(variable='tasmin')
79.5 ms ± 4.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

Again, I used this medium catalog in the examples because it small enough that I can share it. However, with my large catalog, I get:

- Previous: open = 11 s, search = 120 ms
- New:  open = 360 ms, search = 1.05s

We see here that clearly the new intake_esm has helped a lot with the opening! But the search is much slower...

Thanks for all the work on this package! We use it extensively at Ouranos. I hope this performance report can help the development. Tell me if I can add anything to the analysis!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Significative RAM and wall time increase with intake_esm 2025.7.9 #753

Description

RAM consumption

Walltime

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Significative RAM and wall time increase with intake_esm 2025.7.9 #753

Description

Description

RAM consumption

Walltime

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions