Skip to content

Switch to multi-context seeds hash function#447

Merged
marcelm merged 4 commits intomainfrom
mcs-hashfunc
Oct 2, 2024
Merged

Switch to multi-context seeds hash function#447
marcelm merged 4 commits intomainfrom
mcs-hashfunc

Conversation

@marcelm
Copy link
Collaborator

@marcelm marcelm commented Oct 1, 2024

Again, this is split out from #426.

This adds the --aux-len command-line parameter and changes the hash function. Multi-context seeds are not used during lookup. That is, while the contents of the index change, there should be no changes in output (and there aren’t as far as I can tell).

Interestingly, this version of the hash function makes strobealign slightly faster, about 1-4% depending on the dataset. I’m not sure what is going on, but I’ll take it.

@ksahlin
Copy link
Owner

ksahlin commented Oct 1, 2024

If I understand correctly, the only change in this PR is to use

    if (hash1 > hash2) {
        std::swap(hash1, hash2);
    }
    return ((hash1 >> aux_len) << aux_len) ^ (hash2 >> (64 - aux_len));

instead of return hash1 + hash2; as the randstrobe hash?

And you observe a slight speedup from this change? I guess it could happen if return hash1 + hash2 creates some collisions? (should be very rare). It doesn't seem that + would be a more expensive operation than the other version by itself..

@ksahlin
Copy link
Owner

ksahlin commented Oct 1, 2024

approved, btw.

@marcelm
Copy link
Collaborator Author

marcelm commented Oct 2, 2024

If I understand correctly, the only change in this PR is to use

    if (hash1 > hash2) {
        std::swap(hash1, hash2);
    }
    return ((hash1 >> aux_len) << aux_len) ^ (hash2 >> (64 - aux_len));

instead of return hash1 + hash2; as the randstrobe hash?

Yes, that’s the only relevant change.

And you observe a slight speedup from this change? I guess it could happen if return hash1 + hash2 creates some collisions? (should be very rare). It doesn't seem that + would be a more expensive operation than the other version by itself..

Yes, the addition is a single machine instruction whereas the new function has more than 10 instructions or so.

I just realized that this is probably faster because the randstrobes in the index become ordered a way that makes us use the cache better.

Let’s say you have a query with syncmers A, B, C and strobemers are paired up as A-C, B-C. If C has the lowest hash value of all three, it will be the main hash in both cases. Since entries in the index are primarily sorted by main hash, entries for A-C and B-C are very close in the index. So after A-C has been looked up, B-C is likely already in the cache.

Here is what perf says when I measure the cache misses (and other stats).

Old hash function:

Performance counter stats for '...':

    14,694,060,241      cache-references                                                   
     1,670,029,387      cache-misses                     #   11.365 % of all cache refs    
   752,100,535,203      cycles                                                             
 1,863,007,791,894      instructions                     #    2.48  insn per cycle         
   260,485,816,961      branches                                                           

      43.124037916 seconds time elapsed

     167.170358000 seconds user
       0.343364000 seconds sys

New hash function:

Performance counter stats for '...':

    14,960,029,504      cache-references                                                   
     1,403,230,152      cache-misses                     #    9.380 % of all cache refs    
   733,275,948,339      cycles                                                             
 1,872,996,951,027      instructions                     #    2.55  insn per cycle         
   261,331,524,007      branches                                                           

      42.163663935 seconds time elapsed

     163.632210000 seconds user
       0.379490000 seconds sys

So the number of instructions, cycles and branches all go up with the new function, which is what we expect, but the number of cache misses goes down, which is what in this case has the greatest effect.

@marcelm marcelm merged commit e6808d2 into main Oct 2, 2024
@marcelm marcelm deleted the mcs-hashfunc branch October 2, 2024 06:41
@ksahlin
Copy link
Owner

ksahlin commented Oct 2, 2024

I see, clever! perf is something I should make use of.

@ksahlin
Copy link
Owner

ksahlin commented Oct 2, 2024

Would it then make sense to sort the query seeds by hash before lookups, or will sorting and 'desorting' for find_nams (if needed) kill the gains?

@ksahlin
Copy link
Owner

ksahlin commented Oct 2, 2024

Each strobe will occur as main two times (in expectation) in each direction if ignoring singleton syncmers in the ends (i.e., a better approximation for longer reads).

The number of times the same main is seen by chance (as currently) depends on the window parameters (thus read length). But I guess the fraction of same main occurring twice in a row is very low, particularly for longer reads. For the shortest it may be more common as window starts at the immediate downstream window. Did you higher speedup for the shorter reads?

Extending the thought: Then we might as well sort FW and RC seeds together before lookup - increasing expected repetitive stretch of same main strobe to four.

I have completely ignored eventual downstream complications (when merging matches) in the above reasoning for now. as find nams requires sorted w.r.t. query matches.

@marcelm
Copy link
Collaborator Author

marcelm commented Oct 2, 2024

I see, clever! perf is something I should make use of.

It’s a Linux-only tool, but there’s probably something similar for macOS.

Would it then make sense to sort the query seeds by hash before lookups, or will sorting and 'desorting' for find_nams (if needed) kill the gains?

For an individual query, probably not so much. Since there are relatively few randstrobes per query, you will still read from very different parts of memory even if you sort by hash. Accessing monotonically increasing memory addresses per se doesn’t help; an advantage exists only when you access the same cache line more than once. (Cache lines are something like 64 bytes large.)

Maybe obtaining all randstrobes for a large number of reads beforehand and then sorting them by hash could give an advantage.

We could also use prefetching (as suggested in #203).

@marcelm
Copy link
Collaborator Author

marcelm commented Oct 2, 2024

The number of times the same main is seen by chance (as currently) depends on the window parameters (thus read length). But I guess the fraction of same main occurring twice in a row is very low, particularly for longer reads.

Note that it doesn’t have to be consecutive: Even if the second access is somewhere downstream, it’s still likely in the cache.

For the shortest it may be more common as window starts at the immediate downstream window. Did you higher speedup for the shorter reads?

I cannot say because the speedup is just 1-4%, so it’s hard to measure.

Extending the thought: Then we might as well sort FW and RC seeds together before lookup - increasing expected repetitive stretch of same main strobe to four.

That could work, but it’s probably not necessary as I’m guessing that the entries in the index from the forward pass over the query are probably still in the cache during the reverse pass.

Also, sorting will probably eat up any benefits (I have a data point for this in my next PR).

@ksahlin
Copy link
Owner

ksahlin commented Oct 3, 2024

Adding here a note from email conversation:

Sorting on hash value would also mean we don’t have to store a partial look up vector/set which is a commit that increase the runtime with about 6% (#426 (comment)). In that case a variable holding the previous main hash would be sufficient.

@marcelm
Copy link
Collaborator Author

marcelm commented Oct 3, 2024

I don’t think this will help cache efficiency, but not having to keep track of partial lookups would reduce runtime by a couple of percent.

I think we could even do this without any extra cost: We already need to sort the matches by query coordinate in merge_hits_into_nams. If we instead sort the hits in find_nams (as you suggest), we shouldn’t have to sort again.

@marcelm
Copy link
Collaborator Author

marcelm commented Oct 3, 2024

Hm, you meant sorting by hash value, but I meant something slightly different. I’ll think about it.

rebjannik pushed a commit to rebjannik/strobealign that referenced this pull request May 17, 2025
Switch to multi-context seeds hash function
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants