-
Notifications
You must be signed in to change notification settings - Fork 362
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LazyReference: Passing options to .open
of the reference file
#1746
Comments
btw, i tried to pass this option with
but none of these settings make a difference. i added a |
If you are using parquet storage, the references are already compressed internally by the design of the parquet file format. It uses the Zstd algorithm, which appears to be a good choice for this kind of data. Further compression would not be useful. |
I think there is a missunderstanding. I try to be more clear: After compression: where file1compr has the same name as before and would also have the same chunks after decompressing. Why can't I pass option to the |
The question is, is this parquet? If yes, I don't think there's a code path to add arguments to open(); but, again, I am doubtful that compression gains you much for this case. |
Parquet or json should not be relevant to my issue. It is about the files that are referenced inside the jsons or parquets in the I thought I have to specify One related thing that I find suspicious is that there are functions that accept filesystem_spec/fsspec/spec.py Line 836 in 9a16171
|
Ah, sorry I misunderstood you. In typical use with zarr, the compression of data blocks is handled by zarr not fsspec, which is why this idea didn't come up before. The storage_options are indeed used to configure the filesystem with which to get the contents of each reference, and not used in open(); in fact, it uses cat/cat_ranges, which has no compression option at all. Is your use case zarr? Then, you could add the lz4 codec to your .zarray spec. However, it will only work if the compression is for whole blocks, not for a file containing blocks. This is because lz4 (and other compressors) do not support random access within the compressed stream. When you open a file with compression and seek(), you actually need to stream through the data to get to the requested location. |
I created references with kerchunk. Because of missing storage space, I was thinking of just compressing the files which are referenced with
lz4
. My question is: Could I leave the references as is and just pass thecompress="infer"
option somewhere in the reference file system so that it is used for opening the referenced file? I guess it would make sense to cache the uncompressed referenced files so that they can be reused if multiple chunks are in one reference file. Does that work somehow?If this is not feasable, could we kerchunk files that are compressed with
lz4
oderzstandard
similar to zip and tar? afai understood, cat ranges is also possible on the parquet zstandard compressed tables.If both approaches are possible, what would you recommend?
The text was updated successfully, but these errors were encountered: