-
Notifications
You must be signed in to change notification settings - Fork 24
Switch to asymmetric randstrobes #492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
By always using the first syncmer as base Is-new-baseline: yes
... as it is now always true. The results change because the hash contains one more bit. Is-new-baseline: yes
Since the first strobe is now always the main one, partial_start is always equal to start and partial_end is always equal to start + k.
We needed this to keep track of earlier hits because newer hits could involve the same syncmer, and we wanted to avoid outputting the hit twice. Since a partial hit is now always the full hit reduced to the first syncmer, this cannot happen anymore.
We lost some accuracy when switching to asymmetric randstrobes, and with this commit, we recover all of it. That is, use of asymmetric randstrobes becomes as accurate as using symmetric randstrobes. The lost accuracy was due to filtering working differently: With symmetric randstrobes, the decision of whether to filter a hit or not was based on how often the randstrobe *or its reversed version* occur in the reference (which is the desired behavior). With randstrobes becoming asymmetric, the decision became based on only how often the forward version occurs. This leads to a directional bias and a significant loss in accuracy (very apparent on the highly repetitive chrY of CHM13). Here, we restore the old, less biased filtering behavior by explicitly adding up the number of occurrences of the forward and reversed randstrobe and basing the filtering decision on that total count. This involves some extra hash computations and additional index lookups, but overall, strobealign actually becomes faster (up to 30% of chrY for certain read lengths, more moderate improvements for the other references), likely because fewer "spurious" hits are produced Is-new-baseline: yes
|
Wow, impressive speed-ups AND accuracy improvement on SIM3 CHM13. Then we will probably observe even better accuracy improvements on SIM5. Very nice work! Approved!! Update: I already had a look at the commits in this branch as stated over email. One idea I had was to potentially further speed up filtering checks by checking if not filtered using prefix vector like: Then checking the full vector if needed. One can also upper bound the FW and RC counts before the other calls as This does not have to be investigated/incorporated before merging this PR though. |
|
I more closely looked into the runtime improvement. First, I noticed that the runtimes for Updated plots: ends-se-time.pdf This PR is still an improvement, but now mainly for the short reads up to 100 bp (and also on maize). I also did some profiling on sim5 maize 150 to see which parts get faster and which get slower.
|
|
Alright! Seems like it is what we expect then, which is good to confirm. As for the benchmark on maize sim5, the extension alignment time will be quite dominant since nearly all the map sites will try extension because of frequent indels (esp for longer reads because we to a semi-global alignment). These are the datasets where minimap2 (due to its piece-wise extension) catch up with our runtime. As seen in the plots, the mapping-only and extension alignment time curves are very far from each other. However, when looking at at the sim0, the curves are quite close. While any further big runtime gains will come from an advancement in extension alignment (on most datasets), the 4% spent in find_hits on sim5 might be 8% on sim0 (150nt) and may even be a bigger fraction for, e.g., 50nt reads. So it might still be worth to, at some point, looking into if there is an easy way to speed up find hits as discussed over email, but I don't think its a high priority task unless you want to investigate it. |
Is-new-baseline: yes
1ce2252 to
8ae88ef
Compare
Switch to asymmetric randstrobes
This makes strobealign use asymmetric randstrobe hashes (by always using the first syncmer as "base") instead of symmetric ones. The accuracy is identical to what we get with symmetric hashes, except for very short reads, where it becomes slightly higher. We also get better runtime and simplified code. Details below.
Accuracy
We have earlier had the problem that switching to asymmetric hashes reduced accuracy somewhat. Commit 6f30807 solves this.
The lost accuracy was due to filtering working differently: With symmetric randstrobes, the decision of whether to filter a hit or not was based on how often the randstrobe or its reversed version occur in the reference (which is the desired behavior). With randstrobes becoming asymmetric, the decision became based on only how often the forward version occurs. This leads to a directional bias and a significant loss in accuracy (very apparent on the
highly repetitive chrY of CHM13).
Here, we restore the old, less biased filtering behavior by explicitly adding up the number of occurrences of the forward and reversed randstrobe and basing the filtering decision on that total count.
See runtime measurements in ends-se-accuracy.pdf (MCS-BR4 is this PR).
Speed
Counting how often the "reversed randstrobe" occurs involves additional index lookups, but overall, strobealign actually becomes faster, with no to moderate improvements for CHM13 and drosophila and significant improvements for the other references (maximum speedup is ~30% for chrY). This is likely due to fewer spurious hits (as was the intention), but I haven’t measured which parts exactly get faster.
See ends-se-time.pdf
Code simplification
QueryRandstrobeattributesfirst_strobe_is_main,partial_startandpartial_endare gone because the first strobe is now always main (thus partial_start is alwaysequal to start and partial_end is always equal to start + k).
partial_queriedvector and thePartialHitstruct are gone. We needed this to keep track of earlier hits because newer hits could involve the same syncmer, and we wanted to avoid outputting the hit twice. But since a partial hit is now always the full hit reduced to its first syncmer, this cannot happen anymore.