Conversation
src/randstrobes.cpp
Outdated
| static inline digest_hash_t digest_hash(syncmer_hash_t hash1, syncmer_hash_t hash2, size_t digest_size) { | ||
| syncmer_hash_t main_hash = std::min(hash1, hash2); | ||
| syncmer_hash_t aux_hash = std::max(hash1, hash2); | ||
| return ((main_hash >> digest_size) << digest_size) ^ (aux_hash >> (64 - digest_size)); |
There was a problem hiding this comment.
Need to pass information of which strobe (1 or 2) that was saved as the base value. probably here
There was a problem hiding this comment.
Added this information to the last bit of the hash
| ) { | ||
| int min_diff = std::numeric_limits<int>::max(); | ||
| for (const auto hash = index.get_hash(position); index.get_hash(position) == hash; ++position) { | ||
| int ref_start = index.get_strobe1_position(position); |
There was a problem hiding this comment.
if using multi-context, we are not sure if it is index.get_strobe1_position(position) or position of the second strobe here. Will yield inconsistent NAMs
bf8abf0 to
8c3dfe4
Compare
0aea3fc to
c4a7f61
Compare
# Conflicts: # src/indexparameters.cpp
b89d123 to
77b5933
Compare
marcelm
left a comment
There was a problem hiding this comment.
This replaces randstrobes with multi-context seeds.
For practical reasons, I’d like to argue that we still have randstrobes, we just compute their hash differently. If we didn’t have randstrobes anymore, we’d have to rename a lot of functions, files and variables. So I see it as "replaces randstrobe hashes with multi-context hashes" or something like that.
Where
hash1andhash2are the syncmer hashes, withhash1being the smaller hash.
Is this so that randstrobe hashes are symmetric as before? That is, would that be different when we switch to asymmetric hashes?
The
64 - digest_sizeprefix of the multi-context hash corresponding tohash1is denoted as the main hash, and thedigest_sizesuffix corresponding tohash2is denoted as auxiliary hash.
This PR introduces quite a bit of terminology that is sometimes overlapping and sometimes used inconsistently, and I wonder whether that could be simplified a bit.
- main hash
- auxiliary hash
- digest, digest size (inconsistency: command-line option is
--digest, but what is meant is actually "digest length"). What is a digest anyway? - subhash (used in a comment somewhere)
- multi-context hash
- digest hash
- partial hit
I guess that some of this will resolve itself if we switch to asymmetric randstrobe hashes, so it is not super important to fix this at the moment.
src/index.cpp
Outdated
| } | ||
| stats.tot_strobemer_count = total_randstrobes; | ||
|
|
||
| logger.info() << "Digest parameter is : " << parameters.randstrobe.digest; |
There was a problem hiding this comment.
| logger.info() << "Digest parameter is : " << parameters.randstrobe.digest; | |
| logger.info() << "Digest size: " << parameters.randstrobe.digest; |
There was a problem hiding this comment.
Replaced "digest" with "aux_len" here and in other places.
src/dumpstrobes.cpp
Outdated
| c_set ? c : IndexParameters::DEFAULT, | ||
| max_seed_len_set ? max_seed_len : IndexParameters::DEFAULT | ||
| max_seed_len_set ? max_seed_len : IndexParameters::DEFAULT, | ||
| digest ? digest: IndexParameters::DEFAULT |
There was a problem hiding this comment.
| digest ? digest: IndexParameters::DEFAULT | |
| digest ? digest : IndexParameters::DEFAULT |
src/index.hpp
Outdated
| return end(); | ||
| } | ||
|
|
||
| //Returns the first entry that matches the first strobe subhash (if using multi-context seeds) |
There was a problem hiding this comment.
Is using multi-context seeds optional with this PR? If not, then please remove the text in parentheses
There was a problem hiding this comment.
It is not optional, but I wanted to leave the possibility of changing between hashes open at some point. But it does not seem to have any benefits. Fixed the comment.
src/indexparameters.cpp
Outdated
| Profile{ 50, 70, 18, -4, -2, 1, 24}, | ||
| Profile{ 75, 90, 20, -4, -3, 2, 24}, | ||
| Profile{100, 110, 20, -4, -2, 2, 24}, | ||
| Profile{125, 135, 20, -4, -1, 4, 24}, | ||
| Profile{150, 175, 20, -4, 1, 7, 24}, | ||
| Profile{250, 375, 22, -4, 2, 12, 24}, | ||
| Profile{400, max, 23, -6, 2, 12, 24}, |
There was a problem hiding this comment.
Since it’s always 24, it doesn’t need to be part of the profile.
src/nam.cpp
Outdated
| if (diff <= min_diff) { | ||
| hits_per_ref[index.reference_index(position)].push_back(Hit{query_start, query_end, ref_start, ref_end}); | ||
| min_diff = diff; | ||
| if (not is_partial) { |
There was a problem hiding this comment.
Did you notice you were using a digraph?
| if (not is_partial) { | |
| if (!is_partial) { |
There was a problem hiding this comment.
Please split this up into two functions: One for the partial and one for the non-partial case. (A function that has completely different behavior based on a boolean parameter should be split up into two functions.)
There was a problem hiding this comment.
Good point! Splitted the function.
src/randstrobes.hpp
Outdated
|
|
||
| using syncmer_hash_t = uint64_t; | ||
| using randstrobe_hash_t = uint64_t; | ||
| using digest_hash_t = uint64_t; |
There was a problem hiding this comment.
I don’t think this typedef is necessary. digest_hash_t is only used once as the return type of the digest_hash function, but everywhere else, it is still called randstrobe_hash_t. If there was a reason to use a name other than randstrobe_hash_t, we would rename randstrobe_hash_t.
src/index.hpp
Outdated
| } | ||
|
|
||
| bool is_partial_filtered(bucket_index_t position) const { | ||
| uint shift = parameters.randstrobe.digest; |
There was a problem hiding this comment.
Prefer explicitly sized types, that is, don’t use uint, use whichever type the digest parameter has.
There was a problem hiding this comment.
Replaced all uint occurences
src/randstrobes.cpp
Outdated
| return hash1 + hash2; | ||
| } | ||
|
|
||
| static inline digest_hash_t digest_hash(syncmer_hash_t hash1, syncmer_hash_t hash2, size_t digest_size) { |
There was a problem hiding this comment.
Instead of adding a digest_hash function, the randstrobe_hash function should be modified. (Currently, the randstrobe_hash function is unused.)
src/randstrobes.cpp
Outdated
| RandstrobeIterator randstrobe_fwd_iter{syncmers, parameters.randstrobe}; | ||
| while (randstrobe_fwd_iter.has_next()) { | ||
| auto randstrobe = randstrobe_fwd_iter.next(); | ||
| uint partial_start = randstrobe.is_first_main ? randstrobe.strobe1_pos : randstrobe.strobe2_pos; |
src/randstrobes.hpp
Outdated
| const unsigned w_max; | ||
| const uint64_t q; | ||
| const unsigned int max_dist; | ||
| uint digest; |
Sure, edited the PR description.
Yes, the only reason for selecting minimal hash is to keep the hash symmetric.
I tried to alleviate that by removing all mentions of "digest" (by which I meant the same thing as the "auxiliary" part of the multi-context hash) and "subhash". |
|
Is this PR ready for a larger benchmarking? |
Co-authored-by: Marcel Martin <mail@marcelm.net>
|
The randstrobe iterator for queries stops when there is still This is expected behaviour for our current seeds. For example, However, I think mcs can be boosted further by adding the remaining syncmers 'in the ends of reads' as seeds. This means 2*w_min more seeds for a read (fw and rc ends). I haven't thought carefully about the best way to change it in the code, but perhaps changing the |
|
I implemented and tested this briefly:
The results only very slightly improved/nearly unchanged on a 'no variation' drosophila genome in SE mapping mode, but they improve substantially for a high variation simulation (see below for read lengths 100 and 150). This also suggest that we may be overfitting our optimization to high quality simulations without much variation. Needs to be tested on larger genome(s) obv. |
… the -b parameter value
I am a bit worried about this. Doesn’t this fit the pattern that we often see worse variant detection rates for "optimized" mapping parameters? Shall I perhaps run the parameter optimization on data with higher variation? |
Sounds like a good suggestion to me! From what you told me, it should come with relatively little extra computation time since the same index can be used for read sets with different levels of variation w.r.t. the reference genome. Below were the variant rates I used for SIM1-4. But I think SIM4 is the only setting that really tests aligning arouind variants properly. I would maybe even set |
Sorry if you got the wrong impression, but since I don’t store the index on disk, everything has to be recomputed. I’ll run this on a cluster somewhere to get the results faster this time.
Good, I’ll use that and would like to call it |
Yes, sounds good! |
|
This PR has been reworked and became #426, which now supersedes it. I’ve read through the comments here and think we have addressed all relevant ones, so I’m closing this PR. |
This replaces randstrobe hashes with multi-context hashes. Multi-context hashes are defined as follows
((hash1 >> digest_size) << digest_size) ^ (hash2 >> (64 - digest_size))Where
hash1andhash2are the syncmer hashes, withhash1being the smaller hash. The64 - digest_sizeprefix of the multi-context hash corresponding tohash1is denoted as the main hash, and thedigest_sizesuffix corresponding tohash2is denoted as auxiliary hash.The prefix of the size (
64 - digest_size) corresponding to the smaller syncmer is used for partial search. If a given full 64-bit query hash was not found in the index, the seeds with the same main hash are looked up instead and added as the matches. These partial matches are used for the NAM construction in the same way as the full matches.In order to calculate the reference range of the partial match correctly, we need to know which strobe was used as main. This information is stored in the
m_packedfield ofRefRandstrobe, replacing one of the bits previously reserved for the reference index.List of changes:
find_namsfunction if full matches were not found. Partial matches have a larger abundance threshold than full matches due to higher seed repetitiveness.add_to_hits_per_ref.partial_findfunction that looks for the seeds with the same main hash as the query was added to theStrobemerIndex.StrobemerIndex.m_packedfield of theRefRandstrobeis now set iff the first strobe was used as the main part of the multi-context seed.QueryRandstrobeclass.--digestparameter was added to regulate the auxiliary part of the multi-context hash with the default value of 24.