Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using seek_points() to obtain valid decompression ranges. #114

Open
forrestfwilliams opened this issue Jan 27, 2023 · 28 comments
Open

Using seek_points() to obtain valid decompression ranges. #114

forrestfwilliams opened this issue Jan 27, 2023 · 28 comments

Comments

@forrestfwilliams
Copy link

I would like to use indexed_gzip to download only the relevant portions of a .gz file via a ranged GET request. I understand that you cannot use indexed_gzip to download and decompress from an arbitrary point (see #112). However I am hoping that it is possible to use the index generated by indexed_gzip and made accessible via the seek_points method to download and decompress a small portion of the larger file that contains the data I'm interested in. This is what I have so far:

import zlib

import indexed_gzip as igzip
import numpy as np


def get_data(gz_path: str, index1: int, index2: int) -> bytes:
    with igzip.IndexedGzipFile(gz_path) as f:
        f.build_full_index()
        seek_points = list(f.seek_points())

    array = np.array(seek_points)
    start = array[index1, 1]
    stop = array[index2, 1]

    # stand-in for a ranged GET request~~~~ #
    with open(gz_path, 'rb') as f:
        f.seek(start)
        compressed = f.read(stop - start)
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #

    decompressed = zlib.decompressobj(-1 * zlib.MAX_WBITS).decompress(compressed)
    return decompressed

This function performs as expected when passing the arguments get_data(file_path, 0, 1), but when not starting from the first index location (e.g., get_data(file_path, 1, 2)) the function fails in the zlib decompression step with the message: zlib.error: Error -3 while decompressing data: invalid block type.

I'm guessing that the root of this issue is that I do not fully understand how zlib decompression works and what the required data formatting is. If you have any suggestion on how to modify this function to achieve my goal, I'd appreciate it!

@forrestfwilliams
Copy link
Author

@martindurant you may also be interested in this.

@martindurant
Copy link

I think the point is, that for gzip you must also have the compressor state at the seek point, which is why indexed_gzip stores the 32kB window along with every point.
However, if you open the file like

with fsspec.open("http://...") as f:

then the f file-like object can be used directly with indexed_gzip. You can export the index data to a file, and load it again with that file later to get the random access range requests you are after. If you have a .gz in mind, I can write down the whole workflow.

@forrestfwilliams
Copy link
Author

forrestfwilliams commented Jan 30, 2023

Ah, OK. Here is a link to a sample gzipped image in S3. Do you think it would be possible to obtain the compressor state on the fly, say by always downloading the 32KB of data directly preceding the seek_point and initializing indexed_gzip with it? My goal here is to reduce the amount of index information we need to store as much as possible.

In my specific use case, the uncompressed byte ranges I want to obtain data for are known ahead of time. In this case I'm hoping that I don't need to store full index, just the index information I need to get that byte range.

@martindurant
Copy link

I have the same hope as you: that we can pick and choose which seek points get saved, immediately before points of interest, to keep the index file size down. I don't know if it's possible to simply read 32kB before the seek point and setting the state like that, or if the 32kB saved in the index currently is somehow different. I suspect the latter, else why save it?

To demonstrate that indexing does work for the remote file right now:

import indexed_gzip as igzip
import fsspec

h = fsspec.filesystem("https")
u = "https://ffwilliams2-shenanigans.s3.us-west-2.amazonaws.com/bursts/s1a-iw2-slc-vv-20200604t022253-20200604t022318-032861-03ce65-005.tiff.gz"
f = h.open(u)
i = igzip.IndexedGzipFile(f)
i.build_full_index()  # reads whole file, I think, or at least chunks all the way through
i.export_index("tiff.iindex")

fsspec.utils.setup_logging(logger_name="fsspec.http")
f = h.open(u)
2023-01-30 10:21:01,665 - fsspec.http - DEBUG - _file_info -- Retrieve file size for https://ffwilliams2-shenanigans.s3.us-west-2.amazonaws.com/bursts/s1a-iw2-slc-vv-20200604t022253-20200604t022318-032861-03ce65-005.tiff.gz

i2 = igzip.IndexedGzipFile(f)
i2.import_index("tiff.iindex")
i2.seek(100_000_000)
i2.read(20)
2023-01-30 10:22:25,659 - fsspec.http - DEBUG - async_fetch_range -- Fetch range for <File-like object HTTPFileSystem, https://ffwilliams2-shenanigans.s3.us-west-2.amazonaws.com/bursts/s1a-iw2-slc-vv-20200604t022253-20200604t022318-032861-03ce65-005.tiff.gz>: 62814387-68057268
2023-01-30 10:22:25,659 - fsspec.http - DEBUG - async_fetch_range -- https://ffwilliams2-shenanigans.s3.us-west-2.amazonaws.com/bursts/s1a-iw2-slc-vv-20200604t022253-20200604t022318-032861-03ce65-005.tiff.gz : bytes=62814387-68057267
2023-01-30 10:22:27,095 - fsspec.http - DEBUG - async_fetch_range -- Fetch range for <File-like object HTTPFileSystem, https://ffwilliams2-shenanigans.s3.us-west-2.amazonaws.com/bursts/s1a-iw2-slc-vv-20200604t022253-20200604t022318-032861-03ce65-005.tiff.gz>: 68057268-73300148
2023-01-30 10:22:27,095 - fsspec.http - DEBUG - async_fetch_range -- https://ffwilliams2-shenanigans.s3.us-west-2.amazonaws.com/bursts/s1a-iw2-slc-vv-20200604t022253-20200604t022318-032861-03ce65-005.tiff.gz : bytes=68057268-73300147
b'\x00\x14\x00T\x00D\x00;\x00B\x00G\x00K\x007\x00>\x00\xb9'

So in this case it needed two range requests to get any data with the standard readahead buffer size at the default 5MB.

@pauldmccarthy
Copy link
Owner

pauldmccarthy commented Jan 30, 2023

Hi @forrestfwilliams @martindurant, this is one of the main usage scenarios - pre-generate an index, and then use that index on subsequent reads to improve access speed. The IndexedGzipFile and zran modules can be used with Python file-likes, although with some limitations (e.g.).

The build_full_index method does a full pass through the data, creating seek points at approximately spacing intervals.

I have the same hope as you: that we can pick and choose which seek points get saved, immediately before points of interest, to keep the index file size down.

This isn't possible at the moment, but should be easy to implement as I suggested in #112

I don't know if it's possible to simply read 32kB before the seek point and setting the state like that, or if the 32kB saved in the index currently is somehow different. I suspect the latter, else why save it?

Seek points can't be located just anywhere - they need to be located at deflate block boundaries, which are usually somewhat arbitrarily placed throughout a deflate stream (although I have to confess that I'm not familiar with the different ways in which deflate streams are generated).

@martindurant
Copy link

Seek points can't be located just anywhere

Yes, understood, but is it necessary to store the 32kB per point in the index file, or can that in theory be re-read from the file?

@pauldmccarthy
Copy link
Owner

Yes, understood, but is it necessary to store the 32kB per point in the index file, or can that in theory be re-read from the file?

No - we need 32kb of uncompressed data to initialise for inflation - this is passed to the zlib inflateSetDictionary function.

@martindurant
Copy link

I thought so, thanks for the clarification. That makes it all the more important to be able to pick seek points, as well as is possible. This is on my long-term map; along with other block-based compression codecs for kerchunk, but gzip is by far the most common and therefor important.

@pauldmccarthy
Copy link
Owner

For more background (and for my own interest, as I only ever learned the bare minimum to get this library working), the details on why we need that 32kB are in the DEFLATE RFC section 3.2 - basically, the encoding dictionary used to compress a section of data is dynamically [re-]generated from the previous 32kB:

As noted above, encoded data blocks in the "deflate" format consist of sequences of symbols drawn from three conceptually distinct alphabets: either literal bytes, from the alphabet of byte values (0..255), or <length, backward distance> pairs, where the length is drawn from (3..258) and the distance is drawn from (1..32,768).

There is also the possibility of coming across deflate streams which use a pre-set dictionary, although I'm assuming that these are not very common.

@pauldmccarthy
Copy link
Owner

pauldmccarthy commented Jan 30, 2023

I may be able to dedicate some time to this, and to other required changes (specifically this), at some point in the near future. But I can't make any guarantees I'm afraid - this library is low priority for me at the moment, as it covers my own usage scenarios just fine. I'm happy to consult/suggest/review PRs though 👍

@forrestfwilliams
Copy link
Author

Looking at the export_index format here would it be valid to:

  1. build_full_index for a data file and export_index
  2. Parse this file as binary, remove undesired seek_points (and associated windows) and update header
  3. Save this altered index to a new file
  4. Import the altered index file and use it to read from the data file

I'm not sure if removing undesired seek_points would affect the validity of the index.

@pauldmccarthy
Copy link
Owner

@forrestfwilliams I think that would actually work just fine! Although as I mentioned in #112, I don't think it would be particularly difficult to enhance zran.c to build an index with pre-specified seek points.

@forrestfwilliams
Copy link
Author

forrestfwilliams commented Jan 30, 2023

@pauldmccarthy sounds good! I'll try developing the approach I developed since I don't have the C skills to confidently implement the approach you described in #112.

Reading through DEFLATE RFC section 3.2 it appears that there are special cases where you wouldn't need window data because there are no back-references in the DEFLATE block. Are these types of special cases identifiable via indexed_gzip, say by looking at whether the seek points have a data flag value of 1 or 0?

@pauldmccarthy
Copy link
Owner

Reading through DEFLATE RFC section 3.2 it appears that there are special cases where you wouldn't need window data because there are no back-references in the DEFLATE block. Are these types of special cases identifiable via indexed_gzip, say by looking at whether the seek points have a data flag value of 1 or 0?

@forrestfwilliams No, indexed_gzip is not privy to that information as it is not exposed by zlib; the seek point data flag in an exported index is only set to 0 for seek points at the beginning of streams. I would hazard a guess that the dymanic/adaptive format is probably much more common anyway.

@martindurant
Copy link

It occurs to me that, since the window data we store is uncompressed blocks and the largest component of index files, that index files should be very amenable to compression, at least as well as the original data was :)

the dymanic/adaptive format is probably much more common

wikipedia agrees on this. However "stored" (uncompressed) blocks might be common for some data like random floats and I don't suppose those strictly need the uncompressed window either. Obviously, data which has a lot of those should not be compressed in the first place!

@forrestfwilliams
Copy link
Author

forrestfwilliams commented Jan 31, 2023

@martindurant I've created some utilities that allow you to directly parse the index files created by export_index and subset them to your desired seek_points. Here are two basic tests for these functions as well. The tests require you to have your own gz and gzidx files, but they should pass once you provide these.

I'm unsure if these utilities belong in indexed_gzip or in some external library. Thoughts?

@martindurant
Copy link

That would be up to @pauldmccarthy - maybe a PR would be appropriate? I don't think there's any reason to fork the repo unless we need to make substantial changes to the code that @pauldmccarthy is not happy to oversee. Until we are happy with a full production-ready workflow it won't matter anyway; and this, for me, means use in conjunction to kerchunk (which will require more work yet).

@martindurant
Copy link

I will of course try out your code when I have the time! Since zran explicitly exports/imports from a real local file, it makes sense to do the full scan and then grab the points we actually want. The initial point spacing can be quite small then. _zran_get_point_at does not seem to need the value of the spacing only the list of points, so I expect it should work out well.

@forrestfwilliams
Copy link
Author

forrestfwilliams commented Jan 31, 2023

Sounds good, from my testing it looks like ~8KB (spacing = 2**16) is the smallest spacing achievable. Comparable to a ~4MB spacing (spacing = 2**22) it takes ~20% longer to build the index for a 1.2 GB (uncompressed) file. This is a pretty small price to pay since you would only need to pay it once per file.

Index data size is the bigger issue. It balloons from 10MB to 563M (330 to 17998 seek points) for this test case, so you definitely want to subset to a relevant set of points.

@martindurant
Copy link

Somewhere in the C code I thought I saw a check that the window size (32kB) should be smaller than the point spacing. In any case, my target is more like ~20-100MB buffers in >GB files (or much bigger is tar.gz), so 1MB spacing would be plenty good enough. As I said, the index file ought to be compressible too.

Before going too far down the road, we should also consider how to store the indexes of the multiple files in a tar.gz (one index, multiple files) or ZIP (index per member file).

I am wondering if the binary data might not store nicely in a zarr structure: record arrays for the point details and fixed-width bytes for the windows. Then you can chunk your data and apply one of several good compressors, and only read the index pieces that you need to access some part of the target file. Just a thought...

@forrestfwilliams
Copy link
Author

Hmm... I think one index per tar.gz and one index per zip member makes sense. Yeah zarr would be a good format for this type of data. How do you envision kerchunk interacting with this data?

@martindurant
Copy link

cc @milesgranger thought you would find this interesting

@mxmlnkn
Copy link
Contributor

mxmlnkn commented Jun 27, 2023

Hi, I'm chiming in because I want to implement many of the stated index improvement ideas in mxmlnkn/rapidgzip#17.

Reading through DEFLATE RFC section 3.2 it appears that there are special cases where you wouldn't need window data because there are no back-references in the DEFLATE block. Are these types of special cases identifiable via indexed_gzip, say by looking at whether the seek points have a data flag value of 1 or 0?

As rapidgzip is completely written from scratch and comes with its own inflate implementation, it is possible to implement this. I have already tried out the simpler idea of checking for the farthest back-reference in order to simply truncate windows. But, checking each symbol in the window for actual usage should save even more space. And, the idea would then be to replace all unused window bytes with zero and then compress them with deflate.

While the masking of window values would not require a new index file format, compressing each window would. I would be glad for any input or requests regarding the specification of such a format. I doubt that I could simply increment the version of the zran index because I would have to add a "compressed window size" member and probably also an uncompressed window size member for good measure. With compression, the window data would not be at offset N*W. But maybe there is no requirement for downward compatibility and I can change the index format completely in version 3?

@martindurant
Copy link

I cannot speak for @pauldmccarthy on how widely adopted this code has been so far. Obviously, the more production use out there, the more you would want to care about maintaining compatibility.

However from my point of view, kerchunk has not yet started work on generalised gzip/zip byte ranges. I am planning it for later this summer, when I get the time. Since we will typically be working with remote file and need to store the indexes elsewhere or inline with references, anything that can be done to reduce the amount of stored data will help greatly. Plus, we would not be particularly constrained by the index file format, but would be setting the index data directly at run time (this is the python wrapper, after all!).

@mxmlnkn
Copy link
Contributor

mxmlnkn commented Jun 27, 2023

So if I understand this correctly, for your use case, it would already suffice to simply compress the whole index file (with gzip or zstd etc.) because you can decompress it at run time before importing it? That wouldn't even require any support from indexed_gzip. I think there are two downsides to importing the whole index uncompressed, though:

  1. Main memory usage is higher and might even be limiting for very large files (1+ TB uncompressed like this one)
  2. If you want to reduce data transfers for accessing only a small subrange, then you still have to load the whole index even though it might have sufficed to load the offsets and then only load that one window that is actually needed to start decompression from the desired offset.

The second point ties in closely with this feature idea: mxmlnkn/rapidgzip#10 Would that help your use case or am I misunderstanding?

@martindurant
Copy link

The index points could be stored in a number of ways. For example, kerchunk now supports parquet storage, allowing for partitioning of the reference data to a fixed number of references per file. This was optimised for datasets with very large number of references, rather than having big data blocks per reference point.

But no, mmap would not work for us, since the references are not normally stored on the local filesystem.

@pauldmccarthy
Copy link
Owner

pauldmccarthy commented Jun 29, 2023

Hi @mxmlnkn @martindurant, there are no promises of forward compatibility w.r.t. the index file format - in fact, there is a guard in the code which will cause it to abort if given a file newer than what it is able to support. So I'm personally not opposed to the index file format being changed. Unfortunately I don't have much time for this project these days, but I'm more than happy to advise and review PRs 👍

@mxmlnkn
Copy link
Contributor

mxmlnkn commented Jun 29, 2023

Thanks for your input. I also did a quick check that implements the "count the actually used window symbols". Assuming there are no errors in the code, the results look very promising with respect to memory savings. Results for Silesia:

Used window symbols: 6276 (19.1528 %)
Used window symbols: 6455 (19.6991 %)
Used window symbols: 6789 (20.7184 %)
Used window symbols: 5923 (18.0756 %)
Used window symbols: 6615 (20.1874 %)
Used window symbols: 7015 (21.4081 %)
Used window symbols: 7760 (23.6816 %)
Used window symbols: 6687 (20.4071 %)
Used window symbols: 7089 (21.6339 %)
Used window symbols: 7103 (21.6766 %)
Used window symbols: 7356 (22.4487 %)
Used window symbols: 6723 (20.517 %)
Used window symbols: 7307 (22.2992 %)
Used window symbols: 7674 (23.4192 %)
Used window symbols: 7648 (23.3398 %)
Used window symbols: 6785 (20.7062 %)
Used window symbols: 6139 (18.7347 %)
Used window symbols: 6956 (21.228 %)
Used window symbols: 7293 (22.2565 %)
Used window symbols: 7395 (22.5677 %)
Used window symbols: 7065 (21.5607 %)
Used window symbols: 6860 (20.9351 %)
Used window symbols: 7605 (23.2086 %)
Used window symbols: 6938 (21.1731 %)
Used window symbols: 6488 (19.7998 %)
Used window symbols: 7816 (23.8525 %)
Used window symbols: 6816 (20.8008 %)
Used window symbols: 7083 (21.6156 %)
Used window symbols: 7354 (22.4426 %)
Used window symbols: 6823 (20.8221 %)
Used window symbols: 7016 (21.4111 %)
[...]
Used window symbols: 1759 (5.36804 %)
Used window symbols: 3749 (11.441 %)
Used window symbols: 465 (1.41907 %)
Used window symbols: 400 (1.2207 %)
Used window symbols: 1995 (6.08826 %)
Used window symbols: 2444 (7.4585 %)
Used window symbols: 1544 (4.71191 %)
Used window symbols: 2535 (7.73621 %)
Used window symbols: 4404 (13.4399 %)
Used window symbols: 551 (1.68152 %)
Used window symbols: 1533 (4.67834 %)
Used window symbols: 1031 (3.14636 %)
Used window symbols: 3219 (9.82361 %)
Used window symbols: 4365 (13.3209 %)
Used window symbols: 2529 (7.7179 %)
Used window symbols: 2144 (6.54297 %)
Used window symbols: 3118 (9.51538 %)
Used window symbols: 875 (2.67029 %)
Used window symbols: 656 (2.00195 %)
Used window symbols: 493 (1.50452 %)
Used window symbols: 3505 (10.6964 %)
Used window symbols: 8755 (26.7181 %)
[...]
Used window symbols: 4215 (12.8632 %)
Used window symbols: 4200 (12.8174 %)
Used window symbols: 4262 (13.0066 %)
Used window symbols: 4165 (12.7106 %)
Used window symbols: 4132 (12.6099 %)
Used window symbols: 4226 (12.8967 %)
Used window symbols: 4246 (12.9578 %)
Used window symbols: 4241 (12.9425 %)
Used window symbols: 4057 (12.381 %)
Used window symbols: 3881 (11.8439 %)
Used window symbols: 4084 (12.4634 %)
Used window symbols: 4151 (12.6678 %)
Used window symbols: 4213 (12.8571 %)
Used window symbols: 4208 (12.8418 %)
Used window symbols: 4127 (12.5946 %)
Used window symbols: 4236 (12.9272 %)
Used window symbols: 4102 (12.5183 %)
Used window symbols: 3987 (12.1674 %)
Used window symbols: 4158 (12.6892 %)
Used window symbols: 4194 (12.7991 %)
Used window symbols: 4062 (12.3962 %)
Used window symbols: 3834 (11.7004 %)
Used window symbols: 4305 (13.1378 %)
Used window symbols: 4341 (13.2477 %)
Used window symbols: 4213 (12.8571 %)
Used window symbols: 4315 (13.1683 %)
Used window symbols: 4026 (12.2864 %)
Used window symbols: 4321 (13.1866 %)
Used window symbols: 4275 (13.0463 %)
Used window symbols: 3517 (10.733 %)
Used window symbols: 4150 (12.6648 %)
Used window symbols: 4207 (12.8387 %)
Used window symbols: 4340 (13.2446 %)
Used window symbols: 4372 (13.3423 %)
Used window symbols: 4124 (12.5854 %)
Used window symbols: 4306 (13.1409 %)
Used window symbols: 4265 (13.0157 %)
Used window symbols: 3392 (10.3516 %)
Used window symbols: 4101 (12.5153 %)
Used window symbols: 4218 (12.8723 %)
Used window symbols: 4077 (12.442 %)
    Used window symbols: 3469 (10.5865 %)
Validated CRC32 0x1bb20219 for gzip stream!

wikidata.json.gz

Used window symbols: 3563 (10.8734 %)
Used window symbols: 6877 (20.9869 %)
Used window symbols: 6368 (19.4336 %)
Used window symbols: 3953 (12.0636 %)
Used window symbols: 4578 (13.9709 %)
Used window symbols: 5168 (15.7715 %)
Used window symbols: 3739 (11.4105 %)
Used window symbols: 4666 (14.2395 %)
Used window symbols: 3550 (10.8337 %)
Used window symbols: 4737 (14.4562 %)
[...]
Used window symbols: 6895 (21.0419 %)
Used window symbols: 6308 (19.2505 %)
Used window symbols: 4892 (14.9292 %)
Used window symbols: 3910 (11.9324 %)
Used window symbols: 4179 (12.7533 %)
Used window symbols: 4240 (12.9395 %)
Used window symbols: 6461 (19.7174 %)
Used window symbols: 5772 (17.6147 %)
Used window symbols: 6434 (19.635 %)
Used window symbols: 6706 (20.4651 %)
Used window symbols: 5852 (17.8589 %)
Used window symbols: 7716 (23.5474 %)
Used window symbols: 4012 (12.2437 %)
Used window symbols: 5446 (16.6199 %)
Used window symbols: 7760 (23.6816 %)
Used window symbols: 3215 (9.8114 %)
[...]

It looks like this could save 80-98% of data.

But doing this kind of accounting will slow down decompression even further and adds more complexity. For example, the output above is only for blocks that decompress to more than 32 KiB so that I know that the window will not be used directly anymore. But there are also blocks that decompress to less than 32 KiB, which also need to be handled.

And then, some heuristic is necessary to find good seek points even if they aren't exactly distributed every 4 MiB. I guess in some approximation, seek points that do not require any window at all can always be created because they are basically free, e.g. before non-compressed blocks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants