Switch to multi-context seeds hash function by marcelm · Pull Request #447 · ksahlin/strobealign

marcelm · 2024-10-01T16:06:30Z

Again, this is split out from #426.

This adds the --aux-len command-line parameter and changes the hash function. Multi-context seeds are not used during lookup. That is, while the contents of the index change, there should be no changes in output (and there aren’t as far as I can tell).

Interestingly, this version of the hash function makes strobealign slightly faster, about 1-4% depending on the dataset. I’m not sure what is going on, but I’ll take it.

This only changes the hash function. Multi-context seeds are not used during lookup.

ksahlin · 2024-10-01T20:56:25Z

If I understand correctly, the only change in this PR is to use

    if (hash1 > hash2) {
        std::swap(hash1, hash2);
    }
    return ((hash1 >> aux_len) << aux_len) ^ (hash2 >> (64 - aux_len));

instead of return hash1 + hash2; as the randstrobe hash?

And you observe a slight speedup from this change? I guess it could happen if return hash1 + hash2 creates some collisions? (should be very rare). It doesn't seem that + would be a more expensive operation than the other version by itself..

ksahlin · 2024-10-01T20:57:04Z

approved, btw.

marcelm · 2024-10-02T06:30:01Z

If I understand correctly, the only change in this PR is to use
    if (hash1 > hash2) {
        std::swap(hash1, hash2);
    }
    return ((hash1 >> aux_len) << aux_len) ^ (hash2 >> (64 - aux_len));
instead of return hash1 + hash2; as the randstrobe hash?

Yes, that’s the only relevant change.

And you observe a slight speedup from this change? I guess it could happen if return hash1 + hash2 creates some collisions? (should be very rare). It doesn't seem that + would be a more expensive operation than the other version by itself..

Yes, the addition is a single machine instruction whereas the new function has more than 10 instructions or so.

I just realized that this is probably faster because the randstrobes in the index become ordered a way that makes us use the cache better.

Let’s say you have a query with syncmers A, B, C and strobemers are paired up as A-C, B-C. If C has the lowest hash value of all three, it will be the main hash in both cases. Since entries in the index are primarily sorted by main hash, entries for A-C and B-C are very close in the index. So after A-C has been looked up, B-C is likely already in the cache.

Here is what perf says when I measure the cache misses (and other stats).

Old hash function:

Performance counter stats for '...':

    14,694,060,241      cache-references                                                   
     1,670,029,387      cache-misses                     #   11.365 % of all cache refs    
   752,100,535,203      cycles                                                             
 1,863,007,791,894      instructions                     #    2.48  insn per cycle         
   260,485,816,961      branches                                                           

      43.124037916 seconds time elapsed

     167.170358000 seconds user
       0.343364000 seconds sys

New hash function:

Performance counter stats for '...':

    14,960,029,504      cache-references                                                   
     1,403,230,152      cache-misses                     #    9.380 % of all cache refs    
   733,275,948,339      cycles                                                             
 1,872,996,951,027      instructions                     #    2.55  insn per cycle         
   261,331,524,007      branches                                                           

      42.163663935 seconds time elapsed

     163.632210000 seconds user
       0.379490000 seconds sys

So the number of instructions, cycles and branches all go up with the new function, which is what we expect, but the number of cache misses goes down, which is what in this case has the greatest effect.

ksahlin · 2024-10-02T09:16:22Z

I see, clever! perf is something I should make use of.

ksahlin · 2024-10-02T09:43:56Z

Would it then make sense to sort the query seeds by hash before lookups, or will sorting and 'desorting' for find_nams (if needed) kill the gains?

ksahlin · 2024-10-02T10:11:49Z

Each strobe will occur as main two times (in expectation) in each direction if ignoring singleton syncmers in the ends (i.e., a better approximation for longer reads).

The number of times the same main is seen by chance (as currently) depends on the window parameters (thus read length). But I guess the fraction of same main occurring twice in a row is very low, particularly for longer reads. For the shortest it may be more common as window starts at the immediate downstream window. Did you higher speedup for the shorter reads?

Extending the thought: Then we might as well sort FW and RC seeds together before lookup - increasing expected repetitive stretch of same main strobe to four.

I have completely ignored eventual downstream complications (when merging matches) in the above reasoning for now. as find nams requires sorted w.r.t. query matches.

marcelm · 2024-10-02T10:18:04Z

I see, clever! perf is something I should make use of.

It’s a Linux-only tool, but there’s probably something similar for macOS.

Would it then make sense to sort the query seeds by hash before lookups, or will sorting and 'desorting' for find_nams (if needed) kill the gains?

For an individual query, probably not so much. Since there are relatively few randstrobes per query, you will still read from very different parts of memory even if you sort by hash. Accessing monotonically increasing memory addresses per se doesn’t help; an advantage exists only when you access the same cache line more than once. (Cache lines are something like 64 bytes large.)

Maybe obtaining all randstrobes for a large number of reads beforehand and then sorting them by hash could give an advantage.

We could also use prefetching (as suggested in #203).

marcelm · 2024-10-02T10:27:18Z

The number of times the same main is seen by chance (as currently) depends on the window parameters (thus read length). But I guess the fraction of same main occurring twice in a row is very low, particularly for longer reads.

Note that it doesn’t have to be consecutive: Even if the second access is somewhere downstream, it’s still likely in the cache.

For the shortest it may be more common as window starts at the immediate downstream window. Did you higher speedup for the shorter reads?

I cannot say because the speedup is just 1-4%, so it’s hard to measure.

Extending the thought: Then we might as well sort FW and RC seeds together before lookup - increasing expected repetitive stretch of same main strobe to four.

That could work, but it’s probably not necessary as I’m guessing that the entries in the index from the forward pass over the query are probably still in the cache during the reverse pass.

Also, sorting will probably eat up any benefits (I have a data point for this in my next PR).

ksahlin · 2024-10-03T09:06:05Z

Adding here a note from email conversation:

Sorting on hash value would also mean we don’t have to store a partial look up vector/set which is a commit that increase the runtime with about 6% (#426 (comment)). In that case a variable holding the previous main hash would be sufficient.

marcelm · 2024-10-03T13:44:14Z

I don’t think this will help cache efficiency, but not having to keep track of partial lookups would reduce runtime by a couple of percent.

I think we could even do this without any extra cost: We already need to sort the matches by query coordinate in merge_hits_into_nams. If we instead sort the hits in find_nams (as you suggest), we shouldn’t have to sort again.

marcelm · 2024-10-03T13:46:08Z

Hm, you meant sorting by hash value, but I meant something slightly different. I’ll think about it.

Switch to multi-context seeds hash function

Itolstoganov and others added 4 commits October 1, 2024 17:26

Switch to multi-context seeds hash function

8622dfc

This only changes the hash function. Multi-context seeds are not used during lookup.

Slightly shorter help for --aux-len

226aba1

Fix max_number_of_references

e5879ae

Bump .sti file format version

a8a4958

marcelm mentioned this pull request Oct 1, 2024

Multi-context seeds plus fixes and optimized parameters #426

Closed

8 tasks

marcelm merged commit e6808d2 into main Oct 2, 2024

marcelm deleted the mcs-hashfunc branch October 2, 2024 06:41

rebjannik pushed a commit to rebjannik/strobealign that referenced this pull request May 17, 2025

Merge pull request ksahlin#447 from ksahlin/mcs-hashfunc

5cb527d

Switch to multi-context seeds hash function

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to multi-context seeds hash function#447

Switch to multi-context seeds hash function#447
marcelm merged 4 commits intomainfrom
mcs-hashfunc

marcelm commented Oct 1, 2024

Uh oh!

ksahlin commented Oct 1, 2024 •

edited

Loading

Uh oh!

ksahlin commented Oct 1, 2024

Uh oh!

marcelm commented Oct 2, 2024 •

edited

Loading

Uh oh!

ksahlin commented Oct 2, 2024

Uh oh!

ksahlin commented Oct 2, 2024

Uh oh!

ksahlin commented Oct 2, 2024

Uh oh!

marcelm commented Oct 2, 2024

Uh oh!

marcelm commented Oct 2, 2024

Uh oh!

ksahlin commented Oct 3, 2024

Uh oh!

marcelm commented Oct 3, 2024

Uh oh!

marcelm commented Oct 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

marcelm commented Oct 1, 2024

Uh oh!

ksahlin commented Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ksahlin commented Oct 1, 2024

Uh oh!

marcelm commented Oct 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ksahlin commented Oct 2, 2024

Uh oh!

ksahlin commented Oct 2, 2024

Uh oh!

ksahlin commented Oct 2, 2024

Uh oh!

marcelm commented Oct 2, 2024

Uh oh!

marcelm commented Oct 2, 2024

Uh oh!

ksahlin commented Oct 3, 2024

Uh oh!

marcelm commented Oct 3, 2024

Uh oh!

marcelm commented Oct 3, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ksahlin commented Oct 1, 2024 •

edited

Loading

marcelm commented Oct 2, 2024 •

edited

Loading