Skip to content

Fix the rANS 4x8 order-1 frequency encoding example. #819

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 7, 2025

Conversation

jkbonfield
Copy link
Contributor

The text erroneously divided 43 into 11,11,11,10 when it would be 10,10,10,13.

Instead of 4x11 minus 1 I chose a new example of 4x11 plus 1, so the word boundaries ended at the same points.

Fixes #817

Copy link

Changed PDFs as of 6259dcd: CRAMcodecs (diff).

CRAMcodecs.tex Outdated
Comment on lines 455 to 458
a & a & 614 \\
& b & 1638 \\
& c & 819 \\
& d & 1024 \\
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
a & a & 614 \\
& b & 1638 \\
& c & 819 \\
& d & 1024 \\
a & a & 614 \\
& b & 1639 \\
& c & 819 \\
& d & 1023 \\

I know the spec doesn't define the adjustment to the normalized sum,1 but consider being consistent with the (implicit) method used in the order-0 example, i.e., the symbol with the highest frequency is adjusted.

Footnotes

  1. § 2.1 "Frequency table" (2023-03-15): "Given rounding differences when renormalising to a fixed sum, it is up to the encoder to decide how to distribute any remainder or remove excess frequencies."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to change it. I think neither of these are actually the values that came out of htscodecs (until I modified it a bit anyway) as it was dividing by 21 instead of 20 initially due to an extra line for incrementing the first (or last, I forget) character frequency.

Infact I'm wondering if that's a bug in hist1_4. I see the line that bumped T0[a] to 21 for my little test sequence was from moving something out of a loop:

        while (in < in_end) {
            F0[l][c = *in++]++;
-           T0[l]++;
            l = c;
        }
+       T0[l]++;

I don't see why I did this and it looks wrong! I wonder if I can come up with some strings that break it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I can see my code above is a weird quirk, but harmless. It's correct to include that last char in T0, but T0 is not the same as the sum of frequencies for transitions from a->b as that inevitably misses one char. Anyway, it's a harmless thing and it's all an approximation anyway. If the last char is something we've never seen before, not including it in T0 still leads to a correct decode table (as we don't ever have last-char to next-char occuring anyway).

That segway aside, the numbers do work out correctly in my code as I'm strangely scaling to 4096 and then performing the correction to 4095 (which was erroneous and I removed that foible when I did the 4x16 frequency encoding, but it was probably to avoid some overflow in the initial implementation).

Before norm:

T[a]=20
    F[a][a]=3
    F[a][b]=8
    F[a][c]=4
    F[a][d]=5

Multiplicative factor is 204.8 if we divide 4096 by 20, which gives the numbers I have above, exactly summing to 4095 by luck! If we divide 4095/20 we get multiplicative factor of 204.75 which gives 614, 1638, 819, 1023, and then as you say we increase the 1638 to 1639 instead. Arguably this would be a better implementation so I'll change it, but none of this have any material impact on the spec as we are defining the format of the frequency table along with the rules to decode it. It's totally up to the implementation to decide how normalisation should work and what I'm currently doing is a really poor strategy compared to a more meaningful redistribution strategy.

Copy link

Changed PDFs as of 5f24b21: CRAMcodecs (diff).

CRAMcodecs.tex Outdated
Comment on lines 493 to 496
0x61 0x82 0x66 # a <614>
0x62 0x02 0x86 0x66 # b <+2: c,d> <1638>
0x83 0x33 # c (implicit) <819>
0x84 0x00 # d (implicit) <1024>
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
0x61 0x82 0x66 # a <614>
0x62 0x02 0x86 0x66 # b <+2: c,d> <1638>
0x83 0x33 # c (implicit) <819>
0x84 0x00 # d (implicit) <1024>
0x61 0x82 0x66 # a <614>
0x62 0x02 0x86 0x67 # b <+2: c,d> <1639>
0x83 0x33 # c (implicit) <819>
0x83 0xff # d (implicit) <1023>

Copy link

github-actions bot commented Apr 7, 2025

Changed PDFs as of 562cc61: CRAMcodecs (diff).

The text erroneously divided 43 into 11,11,11,10 when it would be
10,10,10,13.

Instead of 4x11 minus 1 I chose a new example of 4x11 plus 1, so the
word boundaries ended at the same points.

Fixes samtools#817
Copy link

github-actions bot commented Apr 7, 2025

Changed PDFs as of 563e8ab: CRAMcodecs (diff).

@jkbonfield jkbonfield merged commit 563e8ab into samtools:master Apr 7, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cram-codecs: rANS 4x8 order-1 frequency table example does not follow description
2 participants