Skip to content

Conversation

@marcelm
Copy link
Collaborator

@marcelm marcelm commented Mar 28, 2025

This makes strobealign use asymmetric randstrobe hashes (by always using the first syncmer as "base") instead of symmetric ones. The accuracy is identical to what we get with symmetric hashes, except for very short reads, where it becomes slightly higher. We also get better runtime and simplified code. Details below.

Accuracy

We have earlier had the problem that switching to asymmetric hashes reduced accuracy somewhat. Commit 6f30807 solves this.

The lost accuracy was due to filtering working differently: With symmetric randstrobes, the decision of whether to filter a hit or not was based on how often the randstrobe or its reversed version occur in the reference (which is the desired behavior). With randstrobes becoming asymmetric, the decision became based on only how often the forward version occurs. This leads to a directional bias and a significant loss in accuracy (very apparent on the
highly repetitive chrY of CHM13).

Here, we restore the old, less biased filtering behavior by explicitly adding up the number of occurrences of the forward and reversed randstrobe and basing the filtering decision on that total count.

See runtime measurements in ends-se-accuracy.pdf (MCS-BR4 is this PR).

Speed

Counting how often the "reversed randstrobe" occurs involves additional index lookups, but overall, strobealign actually becomes faster, with no to moderate improvements for CHM13 and drosophila and significant improvements for the other references (maximum speedup is ~30% for chrY). This is likely due to fewer spurious hits (as was the intention), but I haven’t measured which parts exactly get faster.

See ends-se-time.pdf

Code simplification

  • The QueryRandstrobe attributes first_strobe_is_main, partial_start and partial_end are gone because the first strobe is now always main (thus partial_start is always
    equal to start and partial_end is always equal to start + k).
  • The partial_queried vector and the PartialHit struct are gone. We needed this to keep track of earlier hits because newer hits could involve the same syncmer, and we wanted to avoid outputting the hit twice. But since a partial hit is now always the full hit reduced to its first syncmer, this cannot happen anymore.

marcelm added 5 commits March 28, 2025 06:17
By always using the first syncmer as base

Is-new-baseline: yes
... as it is now always true.

The results change because the hash contains one more bit.

Is-new-baseline: yes
Since the first strobe is now always the main one, partial_start is always
equal to start and partial_end is always equal to start + k.
We needed this to keep track of earlier hits because newer hits could
involve the same syncmer, and we wanted to avoid outputting the hit twice.
Since a partial hit is now always the full hit reduced to the first syncmer,
this cannot happen anymore.
We lost some accuracy when switching to asymmetric randstrobes, and with this
commit, we recover all of it. That is, use of asymmetric randstrobes becomes
as accurate as using symmetric randstrobes.

The lost accuracy was due to filtering working differently: With symmetric
randstrobes, the decision of whether to filter a hit or not was based on how
often the randstrobe *or its reversed version* occur in the reference (which
is the desired behavior). With randstrobes becoming asymmetric, the decision
became based on only how often the forward version occurs. This leads to a
directional bias and a significant loss in accuracy (very apparent on the
highly repetitive chrY of CHM13).

Here, we restore the old, less biased filtering behavior by explicitly
adding up the number of occurrences of the forward and reversed randstrobe
and basing the filtering decision on that total count.

This involves some extra hash computations and additional index lookups, but
overall, strobealign actually becomes faster (up to 30% of chrY for certain
read lengths, more moderate improvements for the other references), likely
because fewer "spurious" hits are produced

Is-new-baseline: yes
@ksahlin
Copy link
Owner

ksahlin commented Mar 28, 2025

Wow, impressive speed-ups AND accuracy improvement on SIM3 CHM13. Then we will probably observe even better accuracy improvements on SIM5.

Very nice work! Approved!!

Update:

I already had a look at the commits in this branch as stated over email. One idea I had was to potentially further speed up filtering checks by checking if not filtered using prefix vector like:

def is_not_filtered_using_prefix_vector(query_hash):
    top_N_hash  = query_hash >> (64 - bits);
    return True if randstrobe_start_indices[top_N_hash +1 ] - randstrobe_start_indices[top_N_hash]  < filter_cutoff else False

Then checking the full vector if needed.

One can also upper bound the FW and RC counts before the other calls as
Tot_upper_bound_count = (randstrobe_start_indices[top_N_hash +1 ] - randstrobe_start_indices[top_N_hash]) + randstrobe_start_indices[top_N_hash_RC +1 ] - randstrobe_start_indices[top_N_hash_RC].

This does not have to be investigated/incorporated before merging this PR though.

@marcelm
Copy link
Collaborator Author

marcelm commented Mar 31, 2025

I more closely looked into the runtime improvement.

First, I noticed that the runtimes for main I report in the plots above are actually without #489. That is, they are a bit optimistic because they show the combined improvement from both #498 and this PR.

Updated plots: ends-se-time.pdf

This PR is still an improvement, but now mainly for the short reads up to 100 bp (and also on maize).

I also did some profiling on sim5 maize 150 to see which parts get faster and which get slower.

  • Runtime for find_hits goes from 2.7% to 4% of the total, so this is indeed slower than before, but because this is still a small percentage, any improvement we make here will note have a large impact.
  • 10% more alignments are computed, which also leads to a slowdown (I cannot tell exactly, but it seems runtime goes from sth. like 37% of the total to ~43% of the total)
  • merge_matches_into_nams goes from 12.3% to 7.3%, so this appears to be where some of the speedup comes from (which for this read length 150 exactly offsets the slowdown).

@ksahlin
Copy link
Owner

ksahlin commented Mar 31, 2025

Alright! Seems like it is what we expect then, which is good to confirm.

As for the benchmark on maize sim5, the extension alignment time will be quite dominant since nearly all the map sites will try extension because of frequent indels (esp for longer reads because we to a semi-global alignment). These are the datasets where minimap2 (due to its piece-wise extension) catch up with our runtime. As seen in the plots, the mapping-only and extension alignment time curves are very far from each other. However, when looking at at the sim0, the curves are quite close.

While any further big runtime gains will come from an advancement in extension alignment (on most datasets), the 4% spent in find_hits on sim5 might be 8% on sim0 (150nt) and may even be a bigger fraction for, e.g., 50nt reads.

So it might still be worth to, at some point, looking into if there is an easy way to speed up find hits as discussed over email, but I don't think its a high priority task unless you want to investigate it.

Is-new-baseline: yes
@marcelm marcelm force-pushed the asymmetric-randstrobes branch from 1ce2252 to 8ae88ef Compare April 6, 2025 13:56
@marcelm marcelm merged commit 4fafa73 into main Apr 8, 2025
11 checks passed
rebjannik pushed a commit to rebjannik/strobealign that referenced this pull request May 17, 2025
@marcelm marcelm deleted the asymmetric-randstrobes branch May 27, 2025 11:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants