-
Notifications
You must be signed in to change notification settings - Fork 179
Fix the rANS 4x8 order-1 frequency encoding example. #819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Changed PDFs as of 6259dcd: CRAMcodecs (diff). |
CRAMcodecs.tex
Outdated
a & a & 614 \\ | ||
& b & 1638 \\ | ||
& c & 819 \\ | ||
& d & 1024 \\ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a & a & 614 \\ | |
& b & 1638 \\ | |
& c & 819 \\ | |
& d & 1024 \\ | |
a & a & 614 \\ | |
& b & 1639 \\ | |
& c & 819 \\ | |
& d & 1023 \\ |
I know the spec doesn't define the adjustment to the normalized sum,1 but consider being consistent with the (implicit) method used in the order-0 example, i.e., the symbol with the highest frequency is adjusted.
Footnotes
-
§ 2.1 "Frequency table" (2023-03-15): "Given rounding differences when renormalising to a fixed sum, it is up to the encoder to decide how to distribute any remainder or remove excess frequencies." ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm happy to change it. I think neither of these are actually the values that came out of htscodecs (until I modified it a bit anyway) as it was dividing by 21 instead of 20 initially due to an extra line for incrementing the first (or last, I forget) character frequency.
Infact I'm wondering if that's a bug in hist1_4
. I see the line that bumped T0[a] to 21 for my little test sequence was from moving something out of a loop:
while (in < in_end) {
F0[l][c = *in++]++;
- T0[l]++;
l = c;
}
+ T0[l]++;
I don't see why I did this and it looks wrong! I wonder if I can come up with some strings that break it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I can see my code above is a weird quirk, but harmless. It's correct to include that last char in T0, but T0 is not the same as the sum of frequencies for transitions from a->b as that inevitably misses one char. Anyway, it's a harmless thing and it's all an approximation anyway. If the last char is something we've never seen before, not including it in T0 still leads to a correct decode table (as we don't ever have last-char to next-char occuring anyway).
That segway aside, the numbers do work out correctly in my code as I'm strangely scaling to 4096 and then performing the correction to 4095 (which was erroneous and I removed that foible when I did the 4x16 frequency encoding, but it was probably to avoid some overflow in the initial implementation).
Before norm:
T[a]=20
F[a][a]=3
F[a][b]=8
F[a][c]=4
F[a][d]=5
Multiplicative factor is 204.8 if we divide 4096 by 20, which gives the numbers I have above, exactly summing to 4095 by luck! If we divide 4095/20 we get multiplicative factor of 204.75 which gives 614, 1638, 819, 1023, and then as you say we increase the 1638 to 1639 instead. Arguably this would be a better implementation so I'll change it, but none of this have any material impact on the spec as we are defining the format of the frequency table along with the rules to decode it. It's totally up to the implementation to decide how normalisation should work and what I'm currently doing is a really poor strategy compared to a more meaningful redistribution strategy.
6259dcd
to
5f24b21
Compare
Changed PDFs as of 5f24b21: CRAMcodecs (diff). |
CRAMcodecs.tex
Outdated
0x61 0x82 0x66 # a <614> | ||
0x62 0x02 0x86 0x66 # b <+2: c,d> <1638> | ||
0x83 0x33 # c (implicit) <819> | ||
0x84 0x00 # d (implicit) <1024> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0x61 0x82 0x66 # a <614> | |
0x62 0x02 0x86 0x66 # b <+2: c,d> <1638> | |
0x83 0x33 # c (implicit) <819> | |
0x84 0x00 # d (implicit) <1024> | |
0x61 0x82 0x66 # a <614> | |
0x62 0x02 0x86 0x67 # b <+2: c,d> <1639> | |
0x83 0x33 # c (implicit) <819> | |
0x83 0xff # d (implicit) <1023> |
5f24b21
to
562cc61
Compare
Changed PDFs as of 562cc61: CRAMcodecs (diff). |
The text erroneously divided 43 into 11,11,11,10 when it would be 10,10,10,13. Instead of 4x11 minus 1 I chose a new example of 4x11 plus 1, so the word boundaries ended at the same points. Fixes samtools#817
562cc61
to
563e8ab
Compare
Changed PDFs as of 563e8ab: CRAMcodecs (diff). |
The text erroneously divided 43 into 11,11,11,10 when it would be 10,10,10,13.
Instead of 4x11 minus 1 I chose a new example of 4x11 plus 1, so the word boundaries ended at the same points.
Fixes #817