-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using seek_points()
to obtain valid decompression ranges.
#114
Comments
@martindurant you may also be interested in this. |
I think the point is, that for gzip you must also have the compressor state at the seek point, which is why indexed_gzip stores the 32kB window along with every point.
then the |
Ah, OK. Here is a link to a sample gzipped image in S3. Do you think it would be possible to obtain the compressor state on the fly, say by always downloading the 32KB of data directly preceding the In my specific use case, the uncompressed byte ranges I want to obtain data for are known ahead of time. In this case I'm hoping that I don't need to store full index, just the index information I need to get that byte range. |
I have the same hope as you: that we can pick and choose which seek points get saved, immediately before points of interest, to keep the index file size down. I don't know if it's possible to simply read 32kB before the seek point and setting the state like that, or if the 32kB saved in the index currently is somehow different. I suspect the latter, else why save it? To demonstrate that indexing does work for the remote file right now: import indexed_gzip as igzip
import fsspec
h = fsspec.filesystem("https")
u = "https://ffwilliams2-shenanigans.s3.us-west-2.amazonaws.com/bursts/s1a-iw2-slc-vv-20200604t022253-20200604t022318-032861-03ce65-005.tiff.gz"
f = h.open(u)
i = igzip.IndexedGzipFile(f)
i.build_full_index() # reads whole file, I think, or at least chunks all the way through
i.export_index("tiff.iindex")
fsspec.utils.setup_logging(logger_name="fsspec.http")
f = h.open(u)
2023-01-30 10:21:01,665 - fsspec.http - DEBUG - _file_info -- Retrieve file size for https://ffwilliams2-shenanigans.s3.us-west-2.amazonaws.com/bursts/s1a-iw2-slc-vv-20200604t022253-20200604t022318-032861-03ce65-005.tiff.gz
i2 = igzip.IndexedGzipFile(f)
i2.import_index("tiff.iindex")
i2.seek(100_000_000)
i2.read(20)
2023-01-30 10:22:25,659 - fsspec.http - DEBUG - async_fetch_range -- Fetch range for <File-like object HTTPFileSystem, https://ffwilliams2-shenanigans.s3.us-west-2.amazonaws.com/bursts/s1a-iw2-slc-vv-20200604t022253-20200604t022318-032861-03ce65-005.tiff.gz>: 62814387-68057268
2023-01-30 10:22:25,659 - fsspec.http - DEBUG - async_fetch_range -- https://ffwilliams2-shenanigans.s3.us-west-2.amazonaws.com/bursts/s1a-iw2-slc-vv-20200604t022253-20200604t022318-032861-03ce65-005.tiff.gz : bytes=62814387-68057267
2023-01-30 10:22:27,095 - fsspec.http - DEBUG - async_fetch_range -- Fetch range for <File-like object HTTPFileSystem, https://ffwilliams2-shenanigans.s3.us-west-2.amazonaws.com/bursts/s1a-iw2-slc-vv-20200604t022253-20200604t022318-032861-03ce65-005.tiff.gz>: 68057268-73300148
2023-01-30 10:22:27,095 - fsspec.http - DEBUG - async_fetch_range -- https://ffwilliams2-shenanigans.s3.us-west-2.amazonaws.com/bursts/s1a-iw2-slc-vv-20200604t022253-20200604t022318-032861-03ce65-005.tiff.gz : bytes=68057268-73300147
b'\x00\x14\x00T\x00D\x00;\x00B\x00G\x00K\x007\x00>\x00\xb9' So in this case it needed two range requests to get any data with the standard readahead buffer size at the default 5MB. |
Hi @forrestfwilliams @martindurant, this is one of the main usage scenarios - pre-generate an index, and then use that index on subsequent reads to improve access speed. The The
This isn't possible at the moment, but should be easy to implement as I suggested in #112
Seek points can't be located just anywhere - they need to be located at deflate block boundaries, which are usually somewhat arbitrarily placed throughout a deflate stream (although I have to confess that I'm not familiar with the different ways in which deflate streams are generated). |
Yes, understood, but is it necessary to store the 32kB per point in the index file, or can that in theory be re-read from the file? |
No - we need 32kb of uncompressed data to initialise for inflation - this is passed to the zlib |
I thought so, thanks for the clarification. That makes it all the more important to be able to pick seek points, as well as is possible. This is on my long-term map; along with other block-based compression codecs for kerchunk, but gzip is by far the most common and therefor important. |
For more background (and for my own interest, as I only ever learned the bare minimum to get this library working), the details on why we need that 32kB are in the DEFLATE RFC section 3.2 - basically, the encoding dictionary used to compress a section of data is dynamically [re-]generated from the previous 32kB:
There is also the possibility of coming across deflate streams which use a pre-set dictionary, although I'm assuming that these are not very common. |
I may be able to dedicate some time to this, and to other required changes (specifically this), at some point in the near future. But I can't make any guarantees I'm afraid - this library is low priority for me at the moment, as it covers my own usage scenarios just fine. I'm happy to consult/suggest/review PRs though 👍 |
Looking at the export_index format here would it be valid to:
I'm not sure if removing undesired |
@forrestfwilliams I think that would actually work just fine! Although as I mentioned in #112, I don't think it would be particularly difficult to enhance |
@pauldmccarthy sounds good! I'll try developing the approach I developed since I don't have the C skills to confidently implement the approach you described in #112. Reading through DEFLATE RFC section 3.2 it appears that there are special cases where you wouldn't need window data because there are no back-references in the DEFLATE block. Are these types of special cases identifiable via |
@forrestfwilliams No, |
It occurs to me that, since the window data we store is uncompressed blocks and the largest component of index files, that index files should be very amenable to compression, at least as well as the original data was :)
wikipedia agrees on this. However "stored" (uncompressed) blocks might be common for some data like random floats and I don't suppose those strictly need the uncompressed window either. Obviously, data which has a lot of those should not be compressed in the first place! |
@martindurant I've created some utilities that allow you to directly parse the index files created by I'm unsure if these utilities belong in |
That would be up to @pauldmccarthy - maybe a PR would be appropriate? I don't think there's any reason to fork the repo unless we need to make substantial changes to the code that @pauldmccarthy is not happy to oversee. Until we are happy with a full production-ready workflow it won't matter anyway; and this, for me, means use in conjunction to kerchunk (which will require more work yet). |
I will of course try out your code when I have the time! Since zran explicitly exports/imports from a real local file, it makes sense to do the full scan and then grab the points we actually want. The initial point spacing can be quite small then. _zran_get_point_at does not seem to need the value of the spacing only the list of points, so I expect it should work out well. |
Sounds good, from my testing it looks like ~8KB ( Index data size is the bigger issue. It balloons from 10MB to 563M (330 to 17998 seek points) for this test case, so you definitely want to subset to a relevant set of points. |
Somewhere in the C code I thought I saw a check that the window size (32kB) should be smaller than the point spacing. In any case, my target is more like ~20-100MB buffers in >GB files (or much bigger is tar.gz), so 1MB spacing would be plenty good enough. As I said, the index file ought to be compressible too. Before going too far down the road, we should also consider how to store the indexes of the multiple files in a tar.gz (one index, multiple files) or ZIP (index per member file). I am wondering if the binary data might not store nicely in a zarr structure: record arrays for the point details and fixed-width bytes for the windows. Then you can chunk your data and apply one of several good compressors, and only read the index pieces that you need to access some part of the target file. Just a thought... |
Hmm... I think one index per |
cc @milesgranger thought you would find this interesting |
Hi, I'm chiming in because I want to implement many of the stated index improvement ideas in mxmlnkn/rapidgzip#17.
As rapidgzip is completely written from scratch and comes with its own inflate implementation, it is possible to implement this. I have already tried out the simpler idea of checking for the farthest back-reference in order to simply truncate windows. But, checking each symbol in the window for actual usage should save even more space. And, the idea would then be to replace all unused window bytes with zero and then compress them with deflate. While the masking of window values would not require a new index file format, compressing each window would. I would be glad for any input or requests regarding the specification of such a format. I doubt that I could simply increment the version of the zran index because I would have to add a "compressed window size" member and probably also an uncompressed window size member for good measure. With compression, the window data would not be at offset N*W. But maybe there is no requirement for downward compatibility and I can change the index format completely in version 3? |
I cannot speak for @pauldmccarthy on how widely adopted this code has been so far. Obviously, the more production use out there, the more you would want to care about maintaining compatibility. However from my point of view, kerchunk has not yet started work on generalised gzip/zip byte ranges. I am planning it for later this summer, when I get the time. Since we will typically be working with remote file and need to store the indexes elsewhere or inline with references, anything that can be done to reduce the amount of stored data will help greatly. Plus, we would not be particularly constrained by the index file format, but would be setting the index data directly at run time (this is the python wrapper, after all!). |
So if I understand this correctly, for your use case, it would already suffice to simply compress the whole index file (with gzip or zstd etc.) because you can decompress it at run time before importing it? That wouldn't even require any support from indexed_gzip. I think there are two downsides to importing the whole index uncompressed, though:
The second point ties in closely with this feature idea: mxmlnkn/rapidgzip#10 Would that help your use case or am I misunderstanding? |
The index points could be stored in a number of ways. For example, kerchunk now supports parquet storage, allowing for partitioning of the reference data to a fixed number of references per file. This was optimised for datasets with very large number of references, rather than having big data blocks per reference point. But no, mmap would not work for us, since the references are not normally stored on the local filesystem. |
Hi @mxmlnkn @martindurant, there are no promises of forward compatibility w.r.t. the index file format - in fact, there is a guard in the code which will cause it to abort if given a file newer than what it is able to support. So I'm personally not opposed to the index file format being changed. Unfortunately I don't have much time for this project these days, but I'm more than happy to advise and review PRs 👍 |
Thanks for your input. I also did a quick check that implements the "count the actually used window symbols". Assuming there are no errors in the code, the results look very promising with respect to memory savings. Results for Silesia:
wikidata.json.gz
It looks like this could save 80-98% of data. But doing this kind of accounting will slow down decompression even further and adds more complexity. For example, the output above is only for blocks that decompress to more than 32 KiB so that I know that the window will not be used directly anymore. But there are also blocks that decompress to less than 32 KiB, which also need to be handled. And then, some heuristic is necessary to find good seek points even if they aren't exactly distributed every 4 MiB. I guess in some approximation, seek points that do not require any window at all can always be created because they are basically free, e.g. before non-compressed blocks. |
I would like to use
indexed_gzip
to download only the relevant portions of a.gz
file via a rangedGET
request. I understand that you cannot useindexed_gzip
to download and decompress from an arbitrary point (see #112). However I am hoping that it is possible to use the index generated byindexed_gzip
and made accessible via theseek_points
method to download and decompress a small portion of the larger file that contains the data I'm interested in. This is what I have so far:This function performs as expected when passing the arguments
get_data(file_path, 0, 1)
, but when not starting from the first index location (e.g.,get_data(file_path, 1, 2)
) the function fails in thezlib
decompression step with the message:zlib.error: Error -3 while decompressing data: invalid block type
.I'm guessing that the root of this issue is that I do not fully understand how
zlib
decompression works and what the required data formatting is. If you have any suggestion on how to modify this function to achieve my goal, I'd appreciate it!The text was updated successfully, but these errors were encountered: