Description
I'm looking into producing a proper CRAMv4.tex document as unlike v3.1 we have some other changes to the layout (32-bit vs 64-bit, dedup of read names, better variable sized integer encoding, explicitly signed values, MD/NM/RG locator tags, TLEN flag, etc).
One thing that occurs to me is there is an awful lot of unused things in CRAM 3. Right now I very occasionally use BETA (for alignment pos when data is not coordinate sorted), but that could be removed, and my only use of HUFFMAN is a mistreatment for storing constant values (where I construct a huffman tree with one node - the root - giving a single code with 0 bits).
Even from day 1 CRAM had a bunch of unused encodings. We did cull those for CRAM 3.0, but more can go. We don't use them currently as they all get mixed together, making compression (normally) poor.
There are two logical ways forward.
-
We keep the bit-based codecs but define them to be able to output to any block, not just CORE.
-
We completely remove all the bit-based encoders and the CORE block too. Everything becomes external, byte_array_len or byte_array_stop and is byte based, with the compression codecs themselves handling the encoding.
I favour 2. Although it offers less flexibility, we haven't really exploited the bit-based codecs. In theory it could be better to do bit-based packing to an external block and then run an "external" compression tool, but we either have to bit-pack with 1, 2, or 4 bits to keep data byte aligned or we need to be using bit-oriented codecs which traditionally are much slower than byte oriented ones. (Hence we don't have any right now.)
Other candidates for simplification
- Remove the "Array" type and just make it explicit: int nvals + int[nvals]. Less fluff.
- Remove slice entirely and fix at 1 slice per container. Does that flexibility ever get used? Any real benefit?
- Replace the encoding combo with just encoding and define type specific codes: eg EXTERNAL -> EXTERNAL_UINT, EXTERNAL_SINT, EXTERNAL_BYTE? Unsure if it buys us much, but it's one level of indirection that could go.
- Ditch block size as uint32 and make it variable sized integer like everything else. It's a bizarre outlier right now.
- Strictly enforce one data series = one (or more) block? If so then block IDs could themselves be fixed, instead of needing to look them up in a table. We lose flexibility, but coping with mixed data is a PITA for efficient working and AFAIK it's no longer used that way.
Metadata
Metadata
Assignees
Type
Projects
Status