Conversation
This only changes the hash function. Multi-context seeds are not used during lookup.
|
If I understand correctly, the only change in this PR is to use instead of And you observe a slight speedup from this change? I guess it could happen if |
|
approved, btw. |
Yes, that’s the only relevant change.
Yes, the addition is a single machine instruction whereas the new function has more than 10 instructions or so. I just realized that this is probably faster because the randstrobes in the index become ordered a way that makes us use the cache better. Let’s say you have a query with syncmers A, B, C and strobemers are paired up as A-C, B-C. If C has the lowest hash value of all three, it will be the main hash in both cases. Since entries in the index are primarily sorted by main hash, entries for A-C and B-C are very close in the index. So after A-C has been looked up, B-C is likely already in the cache. Here is what Old hash function: New hash function: So the number of instructions, cycles and branches all go up with the new function, which is what we expect, but the number of cache misses goes down, which is what in this case has the greatest effect. |
|
I see, clever! |
|
Would it then make sense to sort the query seeds by hash before lookups, or will sorting and 'desorting' for find_nams (if needed) kill the gains? |
|
Each strobe will occur as main two times (in expectation) in each direction if ignoring singleton syncmers in the ends (i.e., a better approximation for longer reads). The number of times the same main is seen by chance (as currently) depends on the window parameters (thus read length). But I guess the fraction of same main occurring twice in a row is very low, particularly for longer reads. For the shortest it may be more common as window starts at the immediate downstream window. Did you higher speedup for the shorter reads? Extending the thought: Then we might as well sort FW and RC seeds together before lookup - increasing expected repetitive stretch of same main strobe to four. I have completely ignored eventual downstream complications (when merging matches) in the above reasoning for now. as find nams requires sorted w.r.t. query matches. |
It’s a Linux-only tool, but there’s probably something similar for macOS.
For an individual query, probably not so much. Since there are relatively few randstrobes per query, you will still read from very different parts of memory even if you sort by hash. Accessing monotonically increasing memory addresses per se doesn’t help; an advantage exists only when you access the same cache line more than once. (Cache lines are something like 64 bytes large.) Maybe obtaining all randstrobes for a large number of reads beforehand and then sorting them by hash could give an advantage. We could also use prefetching (as suggested in #203). |
Note that it doesn’t have to be consecutive: Even if the second access is somewhere downstream, it’s still likely in the cache.
I cannot say because the speedup is just 1-4%, so it’s hard to measure.
That could work, but it’s probably not necessary as I’m guessing that the entries in the index from the forward pass over the query are probably still in the cache during the reverse pass. Also, sorting will probably eat up any benefits (I have a data point for this in my next PR). |
|
Adding here a note from email conversation: Sorting on hash value would also mean we don’t have to store a partial look up vector/set which is a commit that increase the runtime with about 6% (#426 (comment)). In that case a variable holding the previous main hash would be sufficient. |
|
I don’t think this will help cache efficiency, but not having to keep track of partial lookups would reduce runtime by a couple of percent. I think we could even do this without any extra cost: We already need to sort the matches by query coordinate in |
|
Hm, you meant sorting by hash value, but I meant something slightly different. I’ll think about it. |
Switch to multi-context seeds hash function
Again, this is split out from #426.
This adds the
--aux-lencommand-line parameter and changes the hash function. Multi-context seeds are not used during lookup. That is, while the contents of the index change, there should be no changes in output (and there aren’t as far as I can tell).Interestingly, this version of the hash function makes strobealign slightly faster, about 1-4% depending on the dataset. I’m not sure what is going on, but I’ll take it.