Skip to content

Rework the CRAM bit encoding example and clarify text #820

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 15, 2025

Conversation

jkbonfield
Copy link
Contributor

Given it's working in bits, the example is much clearer if we describe the bits instead of hex, especially distinguishing bits set/unset from bits-yet-to-use.

I also reworked the note at the end of the section as it was quite hard to follow. I gave it real examples of BETA and HUFFMAN to clarify what is meant by the decoder needing to know the number of bits to consume.

Fixes #812.

Copy link

Changed PDFs as of 7cb8ea0: CRAMv3 (diff).

@jmarshall jmarshall added the cram label Mar 25, 2025
CRAMv3.tex Outdated
For example, we may be reading bits using a BETA encoding whose parameters
indicate each value is 6 bits.
So we read the next 6 bits into a 32-bit integer to get a value
between 0 and 31.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
between 0 and 31.
between 0 and 63.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sigh! A rookie mistake. Thanks

Comment on lines +176 to +177
The bit stream itself does not explicitly store the number of bits
per value, and it will vary by context, so we must know this by other means.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This answers #812.

The examples that follow are useful but would be better defined with their respective codecs. E.g., § 13.5 "Beta encoding: codec ID 6" should define the resulting value from the bit stream is a 32-bit integer, meaning the length value is limited to 32 (or 31 if the value is always supposed be non-negative?). This is the context required to meaningfully read from the bit stream.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, CRAM is defining the encoded format and it's rather agnostic as to the type of integer used to hold the decoded value. I think this is a good thing as being overly prescriptive forbids us from moving to a larger integer size. E.g. SAM dictates 32-bit for some values when frankly it's just ASCII. It's totally agnostic to bit sizes and we could happily use CRAM for storing data aligned against long chromosomes. Htslib actually permits this, rallying against the precise wording of the SAM spec, as a workaround for deficiencies in BAM, although I'm not aware of anyone using it (despite it being pretty performant).

I'd argue you don't necessarily have to know if it's 32-bit or 64-bit if you just use a 64-bit data type to decode it as it'll work regardless. (Well 63-bit + sign, as frankly the difference between 63-bit and 64-bit is highly unlikely to ever crop up in real world data.)

Copy link

github-actions bot commented Apr 7, 2025

Changed PDFs as of fa12e46: CRAMv3 (diff).

Given it's working in bits, the example is much clearer if we describe
the bits instead of hex, especially distinguishing bits set/unset from
bits-yet-to-use.

I also reworked the note at the end of the section as it was quite
hard to follow.  I gave it real examples of BETA and HUFFMAN to
clarify what is meant by the decoder needing to know the number of
bits to consume.

Fixes samtools#812.
@jkbonfield jkbonfield merged commit 0d7f877 into samtools:master Apr 15, 2025
1 check passed
Copy link

Changed PDFs as of 0d7f877: CRAMv3 (diff).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cram: Define bit stream value
3 participants