Skip to content

Using seek_points() to obtain valid decompression ranges. #114

Open
@forrestfwilliams

Description

@forrestfwilliams

I would like to use indexed_gzip to download only the relevant portions of a .gz file via a ranged GET request. I understand that you cannot use indexed_gzip to download and decompress from an arbitrary point (see #112). However I am hoping that it is possible to use the index generated by indexed_gzip and made accessible via the seek_points method to download and decompress a small portion of the larger file that contains the data I'm interested in. This is what I have so far:

import zlib

import indexed_gzip as igzip
import numpy as np


def get_data(gz_path: str, index1: int, index2: int) -> bytes:
    with igzip.IndexedGzipFile(gz_path) as f:
        f.build_full_index()
        seek_points = list(f.seek_points())

    array = np.array(seek_points)
    start = array[index1, 1]
    stop = array[index2, 1]

    # stand-in for a ranged GET request~~~~ #
    with open(gz_path, 'rb') as f:
        f.seek(start)
        compressed = f.read(stop - start)
    # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ #

    decompressed = zlib.decompressobj(-1 * zlib.MAX_WBITS).decompress(compressed)
    return decompressed

This function performs as expected when passing the arguments get_data(file_path, 0, 1), but when not starting from the first index location (e.g., get_data(file_path, 1, 2)) the function fails in the zlib decompression step with the message: zlib.error: Error -3 while decompressing data: invalid block type.

I'm guessing that the root of this issue is that I do not fully understand how zlib decompression works and what the required data formatting is. If you have any suggestion on how to modify this function to achieve my goal, I'd appreciate it!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions