Rework the CRAM bit encoding example and clarify text #820

jkbonfield · 2025-03-25T16:52:15Z

Given it's working in bits, the example is much clearer if we describe the bits instead of hex, especially distinguishing bits set/unset from bits-yet-to-use.

I also reworked the note at the end of the section as it was quite hard to follow. I gave it real examples of BETA and HUFFMAN to clarify what is meant by the decoder needing to know the number of bits to consume.

Fixes #812.

github-actions · 2025-03-25T16:54:24Z

Changed PDFs as of 7cb8ea0: CRAMv3 (diff).

zaeleus · 2025-03-28T19:58:45Z

CRAMv3.tex

+For example, we may be reading bits using a BETA encoding whose parameters
+indicate each value is 6 bits.
+So we read the next 6 bits into a 32-bit integer to get a value
+between 0 and 31.


Suggested change

between 0 and 31.

between 0 and 63.

Sigh! A rookie mistake. Thanks

zaeleus · 2025-03-28T20:32:06Z

CRAMv3.tex

+The bit stream itself does not explicitly store the number of bits
+per value, and it will vary by context, so we must know this by other means.


This answers #812.

The examples that follow are useful but would be better defined with their respective codecs. E.g., § 13.5 "Beta encoding: codec ID 6" should define the resulting value from the bit stream is a 32-bit integer, meaning the length value is limited to 32 (or 31 if the value is always supposed be non-negative?). This is the context required to meaningfully read from the bit stream.

Well, CRAM is defining the encoded format and it's rather agnostic as to the type of integer used to hold the decoded value. I think this is a good thing as being overly prescriptive forbids us from moving to a larger integer size. E.g. SAM dictates 32-bit for some values when frankly it's just ASCII. It's totally agnostic to bit sizes and we could happily use CRAM for storing data aligned against long chromosomes. Htslib actually permits this, rallying against the precise wording of the SAM spec, as a workaround for deficiencies in BAM, although I'm not aware of anyone using it (despite it being pretty performant).

I'd argue you don't necessarily have to know if it's 32-bit or 64-bit if you just use a 64-bit data type to decode it as it'll work regardless. (Well 63-bit + sign, as frankly the difference between 63-bit and 64-bit is highly unlikely to ever crop up in real world data.)

github-actions · 2025-04-07T13:57:36Z

Changed PDFs as of fa12e46: CRAMv3 (diff).

Given it's working in bits, the example is much clearer if we describe the bits instead of hex, especially distinguishing bits set/unset from bits-yet-to-use. I also reworked the note at the end of the section as it was quite hard to follow. I gave it real examples of BETA and HUFFMAN to clarify what is meant by the decoder needing to know the number of bits to consume. Fixes samtools#812.

github-actions · 2025-04-15T14:44:45Z

Changed PDFs as of 0d7f877: CRAMv3 (diff).

jmarshall added the cram label Mar 25, 2025

zaeleus reviewed Mar 28, 2025

View reviewed changes

jkbonfield force-pushed the CRAM-bit-values branch from 7cb8ea0 to fa12e46 Compare April 7, 2025 13:55

zaeleus approved these changes Apr 7, 2025

View reviewed changes

jkbonfield force-pushed the CRAM-bit-values branch from fa12e46 to 0d7f877 Compare April 15, 2025 14:42

jkbonfield merged commit 0d7f877 into samtools:master Apr 15, 2025
1 check passed

jkbonfield temporarily deployed to github-pages April 15, 2025 14:44 — with GitHub Pages Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rework the CRAM bit encoding example and clarify text #820

Rework the CRAM bit encoding example and clarify text #820

Uh oh!

jkbonfield commented Mar 25, 2025

Uh oh!

github-actions bot commented Mar 25, 2025

Uh oh!

zaeleus Mar 28, 2025

Uh oh!

jkbonfield Apr 7, 2025

Uh oh!

zaeleus Mar 28, 2025

Uh oh!

jkbonfield Apr 7, 2025

Uh oh!

github-actions bot commented Apr 7, 2025

Uh oh!

Uh oh!

github-actions bot commented Apr 15, 2025

Uh oh!

Uh oh!

		The bit stream itself does not explicitly store the number of bits
		per value, and it will vary by context, so we must know this by other means.

Rework the CRAM bit encoding example and clarify text #820

Rework the CRAM bit encoding example and clarify text #820

Uh oh!

Conversation

jkbonfield commented Mar 25, 2025

Uh oh!

github-actions bot commented Mar 25, 2025

Uh oh!

zaeleus Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

jkbonfield Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

zaeleus Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

jkbonfield Apr 7, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 7, 2025

Uh oh!

Uh oh!

github-actions bot commented Apr 15, 2025

Uh oh!

Uh oh!