Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialising a File will also serialise the cache which can grow very large #1747

Open
phofl opened this issue Nov 6, 2024 · 5 comments
Open

Comments

@phofl
Copy link

phofl commented Nov 6, 2024

We've run into this when using Xarray together with Dask. The default way of calling this is like this at the moment:

file_list = []
model = "ACCESS-CM2"
variable = "hurs"
data_dir = f"s3://nex-gddp-cmip6/NEX-GDDP-CMIP6/{model}/historical/r1i1p1f1/{variable}/*.nc"
file_list += [f"s3://{path}" for path in s3fs.glob(data_dir)]

# Opening the files here
files = [s3fs.open(f) for f in file_list]

ds = xr.open_mfdataset(
    files,
    engine="h5netcdf",
    combine="nested",
    concat_dim="time",
    data_vars="minimal",
    coords="minimal",
    compat="override",
    parallel=True,
)

The files are accessed initially to get the meta information before they are serialised. The initial access populates the cache with a lot of data.

This triggers a very large cache for 4 of the 130 files which is pretty bad when serialising things.

Serialising the cache doesn't seem like a great idea generally and specifically for remote file systems. Was this decision made intentionally or is this rather something that hasn't been a problem so far?

Ideally, we could purge the cache before things are serialised in fsspec / the inherited libraries. Is this something you would consider? I'd be happy to put up a PR if there is agreement about this.

@martindurant
Copy link
Member

You are right, AbstractBufferedFile could do with a __reduce__ method to control what gets serialised. Certainly the cache should not be serialised, and indeed probably the file should be re-opened from fresh on each deserialise (perhaps with a seek to the correct place). Write-mode files should not be serialisable at all.

Note that OpenFile instances are designed exactly to encapsulate this kind of information without caches and other state - these are what should really be passed around.

Additionally, I don't really understand exactly what you are pickling - the xarray object itself? I don't know that such a use is really anticipated.

@phofl
Copy link
Author

phofl commented Nov 12, 2024

Additionally, I don't really understand exactly what you are pickling - the xarray object itself? I don't know that such a use is really anticipated.

No, not xarray objects themselves. Xarray extracts some metadata for netcdf files with the opened file and then this object is put into the graph (that happens in open_mfdataset fwiw). This causes the cache to be populated. I haven't dug through everything in the API yet, so can't tell you exactly when and where this happens

@dcherian
Copy link

OpenFile instances

These don't appear to fit the open_mfdataset pattern in the OP.

@martindurant
Copy link
Member

Would someone like to write a reasonable reduce function for AbstractBufferedFile?

@phofl
Copy link
Author

phofl commented Nov 12, 2024

Yep, I would take a look at this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants