Serialising a File will also serialise the cache which can grow very large #1747

phofl · 2024-11-06T15:28:05Z

We've run into this when using Xarray together with Dask. The default way of calling this is like this at the moment:

file_list = []
model = "ACCESS-CM2"
variable = "hurs"
data_dir = f"s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/{model}/historical/r1i1p1f1/{variable}/*.nc"
file_list += [f"s3://{path}" for path in s3fs.glob(data_dir)]

# Opening the files here
files = [s3fs.open(f) for f in file_list]

ds = xr.open_mfdataset(
    files,
    engine="h5netcdf",
    combine="nested",
    concat_dim="time",
    data_vars="minimal",
    coords="minimal",
    compat="override",
    parallel=True,
)

The files are accessed initially to get the meta information before they are serialised. The initial access populates the cache with a lot of data.

This triggers a very large cache for 4 of the 130 files which is pretty bad when serialising things.

Serialising the cache doesn't seem like a great idea generally and specifically for remote file systems. Was this decision made intentionally or is this rather something that hasn't been a problem so far?

Ideally, we could purge the cache before things are serialised in fsspec / the inherited libraries. Is this something you would consider? I'd be happy to put up a PR if there is agreement about this.

The text was updated successfully, but these errors were encountered:

martindurant · 2024-11-06T15:38:25Z

You are right, AbstractBufferedFile could do with a __reduce__ method to control what gets serialised. Certainly the cache should not be serialised, and indeed probably the file should be re-opened from fresh on each deserialise (perhaps with a seek to the correct place). Write-mode files should not be serialisable at all.

Note that OpenFile instances are designed exactly to encapsulate this kind of information without caches and other state - these are what should really be passed around.

Additionally, I don't really understand exactly what you are pickling - the xarray object itself? I don't know that such a use is really anticipated.

phofl · 2024-11-12T10:06:57Z

Additionally, I don't really understand exactly what you are pickling - the xarray object itself? I don't know that such a use is really anticipated.

No, not xarray objects themselves. Xarray extracts some metadata for netcdf files with the opened file and then this object is put into the graph (that happens in open_mfdataset fwiw). This causes the cache to be populated. I haven't dug through everything in the API yet, so can't tell you exactly when and where this happens

dcherian · 2024-11-12T14:13:38Z

OpenFile instances

These don't appear to fit the open_mfdataset pattern in the OP.

martindurant · 2024-11-12T14:22:00Z

Would someone like to write a reasonable reduce function for AbstractBufferedFile?

phofl · 2024-11-12T14:24:48Z

Yep, I would take a look at this

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialising a File will also serialise the cache which can grow very large #1747

Serialising a File will also serialise the cache which can grow very large #1747

phofl commented Nov 6, 2024 •

edited

Loading

martindurant commented Nov 6, 2024

phofl commented Nov 12, 2024

dcherian commented Nov 12, 2024

martindurant commented Nov 12, 2024

phofl commented Nov 12, 2024

Serialising a File will also serialise the cache which can grow very large #1747

Serialising a File will also serialise the cache which can grow very large #1747

Comments

phofl commented Nov 6, 2024 • edited Loading

martindurant commented Nov 6, 2024

phofl commented Nov 12, 2024

dcherian commented Nov 12, 2024

martindurant commented Nov 12, 2024

phofl commented Nov 12, 2024

phofl commented Nov 6, 2024 •

edited

Loading