Description
I would like to use indexed_gzip
to download only the relevant portions of a .gz
file via a ranged GET
request. I understand that you cannot use indexed_gzip
to download and decompress from an arbitrary point (see #112). However I am hoping that it is possible to use the index generated by indexed_gzip
and made accessible via the seek_points
method to download and decompress a small portion of the larger file that contains the data I'm interested in. This is what I have so far:
import zlib
import indexed_gzip as igzip
import numpy as np
def get_data(gz_path: str, index1: int, index2: int) -> bytes:
with igzip.IndexedGzipFile(gz_path) as f:
f.build_full_index()
seek_points = list(f.seek_points())
array = np.array(seek_points)
start = array[index1, 1]
stop = array[index2, 1]
# stand-in for a ranged GET request~~~~ #
with open(gz_path, 'rb') as f:
f.seek(start)
compressed = f.read(stop - start)
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #
decompressed = zlib.decompressobj(-1 * zlib.MAX_WBITS).decompress(compressed)
return decompressed
This function performs as expected when passing the arguments get_data(file_path, 0, 1)
, but when not starting from the first index location (e.g., get_data(file_path, 1, 2)
) the function fails in the zlib
decompression step with the message: zlib.error: Error -3 while decompressing data: invalid block type
.
I'm guessing that the root of this issue is that I do not fully understand how zlib
decompression works and what the required data formatting is. If you have any suggestion on how to modify this function to achieve my goal, I'd appreciate it!