Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LazyReference: Passing options to .open of the reference file #1746

Open
wachsylon opened this issue Nov 6, 2024 · 6 comments
Open

LazyReference: Passing options to .open of the reference file #1746

wachsylon opened this issue Nov 6, 2024 · 6 comments

Comments

@wachsylon
Copy link

I created references with kerchunk. Because of missing storage space, I was thinking of just compressing the files which are referenced with lz4. My question is: Could I leave the references as is and just pass the compress="infer" option somewhere in the reference file system so that it is used for opening the referenced file? I guess it would make sense to cache the uncompressed referenced files so that they can be reused if multiple chunks are in one reference file. Does that work somehow?

If this is not feasable, could we kerchunk files that are compressed with lz4 oder zstandard similar to zip and tar? afai understood, cat ranges is also possible on the parquet zstandard compressed tables.

If both approaches are possible, what would you recommend?

@wachsylon
Copy link
Author

btw, i tried to pass this option with

cd={"compression": "infer"}
fs = fsspec.filesystem(
    "reference", 
    fo="test.parq", 
    target_options=cd, 
    remote_options=cd,
    storage_options=cd,
    **cd
)

but none of these settings make a difference. i added a print for the kwargs in the cat_ranges function and the dict is empty.

@martindurant
Copy link
Member

If you are using parquet storage, the references are already compressed internally by the design of the parquet file format. It uses the Zstd algorithm, which appears to be a good choice for this kind of data. Further compression would not be useful.

@wachsylon
Copy link
Author

wachsylon commented Nov 7, 2024

I think there is a missunderstanding. I try to be more clear:
Originally:
File1 <- reference-table
Anywhere in the reference file system when accessing the data throug references:
fs.open(File1)

After compression:
File1compr <- reference table

where file1compr has the same name as before and would also have the same chunks after decompressing.

Why can't I pass option to the fs.open? Like
fs.open(file1compr,**kwargs)
with kwargs=dict(compression="infer"). I would like to leave the table as it is.

@martindurant
Copy link
Member

The question is, is this parquet? If yes, I don't think there's a code path to add arguments to open(); but, again, I am doubtful that compression gains you much for this case.

@wachsylon
Copy link
Author

Parquet or json should not be relevant to my issue. It is about the files that are referenced inside the jsons or parquets in the pathcolumn. I gain storage space if I compress these referenced files.

I thought I have to specify remote_options to pass sth to the fs that is used to work with the files. But it seems like these options are not passed correctly.

One related thing that I find suspicious is that there are functions that accept **kwargs and which call again functions that accept **kwargs but do not pass them when calling them. E.g.:

out.append(self.cat_file(p, s, e))

@martindurant
Copy link
Member

the files that are referenced inside

Ah, sorry I misunderstood you. In typical use with zarr, the compression of data blocks is handled by zarr not fsspec, which is why this idea didn't come up before. The storage_options are indeed used to configure the filesystem with which to get the contents of each reference, and not used in open(); in fact, it uses cat/cat_ranges, which has no compression option at all.

Is your use case zarr? Then, you could add the lz4 codec to your .zarray spec. However, it will only work if the compression is for whole blocks, not for a file containing blocks. This is because lz4 (and other compressors) do not support random access within the compressed stream. When you open a file with compression and seek(), you actually need to stream through the data to get to the requested location.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants