-
Notifications
You must be signed in to change notification settings - Fork 179
Rework the CRAM bit encoding example and clarify text #820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
CRAMv3.tex
Outdated
For example, we may be reading bits using a BETA encoding whose parameters | ||
indicate each value is 6 bits. | ||
So we read the next 6 bits into a 32-bit integer to get a value | ||
between 0 and 31. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
between 0 and 31. | |
between 0 and 63. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sigh! A rookie mistake. Thanks
The bit stream itself does not explicitly store the number of bits | ||
per value, and it will vary by context, so we must know this by other means. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This answers #812.
The examples that follow are useful but would be better defined with their respective codecs. E.g., § 13.5 "Beta encoding: codec ID 6" should define the resulting value from the bit stream is a 32-bit integer, meaning the length value is limited to 32 (or 31 if the value is always supposed be non-negative?). This is the context required to meaningfully read from the bit stream.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, CRAM is defining the encoded format and it's rather agnostic as to the type of integer used to hold the decoded value. I think this is a good thing as being overly prescriptive forbids us from moving to a larger integer size. E.g. SAM dictates 32-bit for some values when frankly it's just ASCII. It's totally agnostic to bit sizes and we could happily use CRAM for storing data aligned against long chromosomes. Htslib actually permits this, rallying against the precise wording of the SAM spec, as a workaround for deficiencies in BAM, although I'm not aware of anyone using it (despite it being pretty performant).
I'd argue you don't necessarily have to know if it's 32-bit or 64-bit if you just use a 64-bit data type to decode it as it'll work regardless. (Well 63-bit + sign, as frankly the difference between 63-bit and 64-bit is highly unlikely to ever crop up in real world data.)
7cb8ea0
to
fa12e46
Compare
Given it's working in bits, the example is much clearer if we describe the bits instead of hex, especially distinguishing bits set/unset from bits-yet-to-use. I also reworked the note at the end of the section as it was quite hard to follow. I gave it real examples of BETA and HUFFMAN to clarify what is meant by the decoder needing to know the number of bits to consume. Fixes samtools#812.
fa12e46
to
0d7f877
Compare
Given it's working in bits, the example is much clearer if we describe the bits instead of hex, especially distinguishing bits set/unset from bits-yet-to-use.
I also reworked the note at the end of the section as it was quite hard to follow. I gave it real examples of BETA and HUFFMAN to clarify what is meant by the decoder needing to know the number of bits to consume.
Fixes #812.