Switch to asymmetric randstrobes #492

marcelm · 2025-03-28T08:59:11Z

This makes strobealign use asymmetric randstrobe hashes (by always using the first syncmer as "base") instead of symmetric ones. The accuracy is identical to what we get with symmetric hashes, except for very short reads, where it becomes slightly higher. We also get better runtime and simplified code. Details below.

Accuracy

We have earlier had the problem that switching to asymmetric hashes reduced accuracy somewhat. Commit 6f30807 solves this.

The lost accuracy was due to filtering working differently: With symmetric randstrobes, the decision of whether to filter a hit or not was based on how often the randstrobe or its reversed version occur in the reference (which is the desired behavior). With randstrobes becoming asymmetric, the decision became based on only how often the forward version occurs. This leads to a directional bias and a significant loss in accuracy (very apparent on the
highly repetitive chrY of CHM13).

Here, we restore the old, less biased filtering behavior by explicitly adding up the number of occurrences of the forward and reversed randstrobe and basing the filtering decision on that total count.

See runtime measurements in ends-se-accuracy.pdf (MCS-BR4 is this PR).

Speed

Counting how often the "reversed randstrobe" occurs involves additional index lookups, but overall, strobealign actually becomes faster, with no to moderate improvements for CHM13 and drosophila and significant improvements for the other references (maximum speedup is ~30% for chrY). This is likely due to fewer spurious hits (as was the intention), but I haven’t measured which parts exactly get faster.

See ends-se-time.pdf

Code simplification

The QueryRandstrobe attributes first_strobe_is_main, partial_start and partial_end are gone because the first strobe is now always main (thus partial_start is always
equal to start and partial_end is always equal to start + k).
The partial_queried vector and the PartialHit struct are gone. We needed this to keep track of earlier hits because newer hits could involve the same syncmer, and we wanted to avoid outputting the hit twice. But since a partial hit is now always the full hit reduced to its first syncmer, this cannot happen anymore.

By always using the first syncmer as base Is-new-baseline: yes

... as it is now always true. The results change because the hash contains one more bit. Is-new-baseline: yes

Since the first strobe is now always the main one, partial_start is always equal to start and partial_end is always equal to start + k.

We needed this to keep track of earlier hits because newer hits could involve the same syncmer, and we wanted to avoid outputting the hit twice. Since a partial hit is now always the full hit reduced to the first syncmer, this cannot happen anymore.

We lost some accuracy when switching to asymmetric randstrobes, and with this commit, we recover all of it. That is, use of asymmetric randstrobes becomes as accurate as using symmetric randstrobes. The lost accuracy was due to filtering working differently: With symmetric randstrobes, the decision of whether to filter a hit or not was based on how often the randstrobe *or its reversed version* occur in the reference (which is the desired behavior). With randstrobes becoming asymmetric, the decision became based on only how often the forward version occurs. This leads to a directional bias and a significant loss in accuracy (very apparent on the highly repetitive chrY of CHM13). Here, we restore the old, less biased filtering behavior by explicitly adding up the number of occurrences of the forward and reversed randstrobe and basing the filtering decision on that total count. This involves some extra hash computations and additional index lookups, but overall, strobealign actually becomes faster (up to 30% of chrY for certain read lengths, more moderate improvements for the other references), likely because fewer "spurious" hits are produced Is-new-baseline: yes

ksahlin · 2025-03-28T09:16:49Z

Wow, impressive speed-ups AND accuracy improvement on SIM3 CHM13. Then we will probably observe even better accuracy improvements on SIM5.

Very nice work! Approved!!

Update:

I already had a look at the commits in this branch as stated over email. One idea I had was to potentially further speed up filtering checks by checking if not filtered using prefix vector like:

def is_not_filtered_using_prefix_vector(query_hash):
    top_N_hash  = query_hash >> (64 - bits);
    return True if randstrobe_start_indices[top_N_hash +1 ] - randstrobe_start_indices[top_N_hash]  < filter_cutoff else False

Then checking the full vector if needed.

One can also upper bound the FW and RC counts before the other calls as
Tot_upper_bound_count = (randstrobe_start_indices[top_N_hash +1 ] - randstrobe_start_indices[top_N_hash]) + randstrobe_start_indices[top_N_hash_RC +1 ] - randstrobe_start_indices[top_N_hash_RC].

This does not have to be investigated/incorporated before merging this PR though.

marcelm · 2025-03-31T16:43:26Z

I more closely looked into the runtime improvement.

First, I noticed that the runtimes for main I report in the plots above are actually without #489. That is, they are a bit optimistic because they show the combined improvement from both #498 and this PR.

Updated plots: ends-se-time.pdf

This PR is still an improvement, but now mainly for the short reads up to 100 bp (and also on maize).

I also did some profiling on sim5 maize 150 to see which parts get faster and which get slower.

Runtime for find_hits goes from 2.7% to 4% of the total, so this is indeed slower than before, but because this is still a small percentage, any improvement we make here will note have a large impact.
10% more alignments are computed, which also leads to a slowdown (I cannot tell exactly, but it seems runtime goes from sth. like 37% of the total to ~43% of the total)
merge_matches_into_nams goes from 12.3% to 7.3%, so this appears to be where some of the speedup comes from (which for this read length 150 exactly offsets the slowdown).

ksahlin · 2025-03-31T17:41:13Z

Alright! Seems like it is what we expect then, which is good to confirm.

As for the benchmark on maize sim5, the extension alignment time will be quite dominant since nearly all the map sites will try extension because of frequent indels (esp for longer reads because we to a semi-global alignment). These are the datasets where minimap2 (due to its piece-wise extension) catch up with our runtime. As seen in the plots, the mapping-only and extension alignment time curves are very far from each other. However, when looking at at the sim0, the curves are quite close.

While any further big runtime gains will come from an advancement in extension alignment (on most datasets), the 4% spent in find_hits on sim5 might be 8% on sim0 (150nt) and may even be a bigger fraction for, e.g., 50nt reads.

So it might still be worth to, at some point, looking into if there is an easy way to speed up find hits as discussed over email, but I don't think its a high priority task unless you want to investigate it.

Is-new-baseline: yes

Switch to asymmetric randstrobes

marcelm added 5 commits March 28, 2025 06:17

Switch to asymmetric randstrobe hashes

203d4de

By always using the first syncmer as base Is-new-baseline: yes

Get rid of first_strobe_is_main flag

f2f4d5d

... as it is now always true. The results change because the hash contains one more bit. Is-new-baseline: yes

Remove now unnecessary partial_start and partial_end attributes

0e0b0aa

Since the first strobe is now always the main one, partial_start is always equal to start and partial_end is always equal to start + k.

PartialHit is no longer needed

e15f32f

We needed this to keep track of earlier hits because newer hits could involve the same syncmer, and we wanted to avoid outputting the hit twice. Since a partial hit is now always the full hit reduced to the first syncmer, this cannot happen anymore.

Fix incorrect find function used

8ae88ef

Is-new-baseline: yes

marcelm force-pushed the asymmetric-randstrobes branch from 1ce2252 to 8ae88ef Compare April 6, 2025 13:56

marcelm merged commit 4fafa73 into main Apr 8, 2025
11 checks passed

rebjannik pushed a commit to rebjannik/strobealign that referenced this pull request May 17, 2025

Merge pull request ksahlin#492 from ksahlin/asymmetric-randstrobes

c9dd884

Switch to asymmetric randstrobes

marcelm deleted the asymmetric-randstrobes branch May 27, 2025 11:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Switch to asymmetric randstrobes #492

Switch to asymmetric randstrobes #492

Uh oh!

marcelm commented Mar 28, 2025

Uh oh!

ksahlin commented Mar 28, 2025 •

edited

Loading

Uh oh!

marcelm commented Mar 31, 2025

Uh oh!

ksahlin commented Mar 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Switch to asymmetric randstrobes #492

Switch to asymmetric randstrobes #492

Uh oh!

Conversation

marcelm commented Mar 28, 2025

Accuracy

Speed

Code simplification

Uh oh!

ksahlin commented Mar 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcelm commented Mar 31, 2025

Uh oh!

ksahlin commented Mar 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ksahlin commented Mar 28, 2025 •

edited

Loading