Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add compression for the index format #17

Closed
mxmlnkn opened this issue May 29, 2023 · 2 comments
Closed

Add compression for the index format #17

mxmlnkn opened this issue May 29, 2023 · 2 comments
Labels
enhancement New feature or request

Comments

@mxmlnkn
Copy link
Owner

mxmlnkn commented May 29, 2023

For files with very large compression ratios, the index might actually grow larger than the original file because the seek points are 32 KiB of uncompressed data for each chunk (every 4 MiB per default). I.e., for compression ratios >128, the seek points take up as much as the original file. However, in these cases, the seek points would also compress very well even without back-references to the previous data. Therefore, compression should be done. There could be:

  • Compression of the whole index file: Probably not advisable because it would make Access large indexes directly from the file system #10 impossible.
  • Compression of each seek point: Compression might not be optimal but should be sufficient, especially the Huffman-Coding should help.
  • Compression bunches of seek points, e.g., bunches of 10: More complex to implement.
    • This would make resuming decompression more difficult. If each window is compressed independently, then we could simply prepend the compressed window in front of the actual compressed chunk data and then we could simply skip the first window size amount of decompressed bytes. This feels almost trivial to implement.
    • In tests with wikidata.json, it didn't improve compression significantly, less than 1%. So for now, don't do this.

Other unrelated compression:

  • Shave off bytes by analyzing what is actually referenced in the coming data. It is very likely, that the very next back-reference will not reference the data 32 KiB back, i.e., the start of the window.
    • The farthest distance alone is often up to 90% of the window, and therefore doesn't help compression much but see below.
  • In combination with compression, we might go even further and replace unused values in the window with zero in order to increase the compressibility. However, keeping track of which values are used and which aren't, would become more expensive.
    • For wikidata.json only ~15% of the symbols in the window are actually referenced. However, after compression, this pruning of the other 85% "only" improves the compressed output by a bit more than 2x., compression itself, yields ~8x reduction in size.
    • For 4GiB-base.gz, this pruning shaves off much more even after compression, because the windows don't compress well. In total, the index is now ~ 400 kB instead of 25 MB.

All of these optimizations would make the format incompatible with that of indexed_gzip. This is problematic, especially in combination with ratarmount, which currently stores indexes in the indexed_gzip format. However, the shaving off of bytes could be implemented by simply replacing them with zeros in order to make them compress more easily! Then, ratarmount could implement a compression for the whole index file or even in chunks and provide a Python file object abstraction to rapidgzip when #10 has been implemented. This would preserve the compatibility of ratarmount indexes agnostic of the backend (indexed_gzip, rapidgzip) when implemented correctly.

Still, for standalone use, the compression per seek point would still be useful.

@mxmlnkn
Copy link
Owner Author

mxmlnkn commented Sep 23, 2023

Format Outline:

Index Format Outline:

 - Magic Bytes: Don't have to be super short, it's negligible compared to a window anyway.
 - Format Version
 - Flags
   - Offset Format:
     - Continous: Encoded chunk size is given by the distance to the next chunk. This is true for the decompressed size anyway
     - Patches: Chunks are stored as (decompressed offset, encoded offset, encoded size, window offset) tuples.
                Note that this means that in general encoded patches could be overlapping.
                For zip, this is used for zip bombs. Therefore it might be forbidden in the future,
                although I don't see why it should. This index format isn't meant for extracting to disk
                anyway, so a decompression bomb like this would only affect runtime, which seems ok to me
                because rapidgzip is meant for multi-gigabyte or even multi-terabyte files anyway.
                It is meant to support, e.g., making all the files inside a zip file accessible as a single
                seekable file object.
     - Continuous without windows: Intended for bzip2 index files
     - Patches without windows:
       -> maybe instead of a weird enum, use flags to enable encoded size and window offset tuple members
       -> same for CRC32 checksum
         -> Use enum for checknum instead: None, CRC32, Adler, SHA?...
         -> Store size of checksum member so that unknown checksums can simply be ignored.
            Else, we wouldn't know how long a member in the chunk database is
     - Flag whether chunks are sorted. Could be helpful to enable bisection search based on the encoded offset,
       for whatever reason. Else, kinda like a checksum
     - Flag whether windows are sorted? Not sure why that could be wanted.
       To decide whether access would be random or not? Simply as a checksum, i.e., if they are not actually sorted, then error out.
   - Window Compression:
     The windows should either be not compressed or compressed with the same algorithm as the accompanied archive.
     This is for dependency reasons. Decoders who only want to use this for LZ4 files should not be required to
     support all possible compression schemes.
     - None
     - Deflate
     - gzip (in contrast to Deflate this effects that the windows are also checksummed)
     - Deflate + Shared Dictionary (?) 
     - LZ4
     - ZSTD
     - bzip2
 - Compressed File Size
 - Decompressed File Size
 - Number of Chunks
 - Chunk offsets / size list in the format according to "Offset Format". Must be sorted by decoded offset.
   All offsets should be unsigned 64-bit.
   The encoded size probably could be 32-bit but who knows, simply be safe and make it 64-bit
   Each entry should be fixed size in order to theoretically facilitate bisection search instead of
   having to read all entries.
   - (decoded offset, encoded offset, encoded size, window offset, window size?, CRC32 checksum for this chunk!)
   - (decoded offset, encoded offset, encoded size, window offset, window size?, CRC32 checksum for this chunk!)
   - ...
 - Windows: Raw concatenated windows of varying size. In order to get the required window in O(1), offsets into
            this stream are also stored with the chunk offsets. Because of that window offset, these windows
            could be ordered randomly or windows might even be reused by different chunks. This is allowed,
            although I suspect that the usual case would be one window per chunk and windows are in the same
            order as the chunks.
            Note that 
   - Window 1:
     -
     - Size of this window block to facilitate jumping to the next window
       Theoretically, the window offsets above would suffice. Therefore, this is more a 
   - Window 2 
   - ...

General Notes:

 - All multi-byte numbers in the format described here are stored with
   the least-significant byte first (at the lower memory address).
 - encoded/compressed offsets and sizes are always in bits, decompressed in bytes

@mxmlnkn
Copy link
Owner Author

mxmlnkn commented May 27, 2024

Kinda fixed with gztool index format support. There still might be yet another index format that works cross-compression and fixes some small issues with gztool indexes.

@mxmlnkn mxmlnkn closed this as completed May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant