Add multi-context seeds by Itolstoganov · Pull Request #388 · ksahlin/strobealign

Itolstoganov · 2024-01-16T22:37:41Z

This replaces randstrobe hashes with multi-context hashes. Multi-context hashes are defined as follows

((hash1 >> digest_size) << digest_size) ^ (hash2 >> (64 - digest_size))

Where hash1 and hash2 are the syncmer hashes, with hash1 being the smaller hash. The 64 - digest_size prefix of the multi-context hash corresponding to hash1 is denoted as the main hash, and the digest_size suffix corresponding to hash2 is denoted as auxiliary hash.

The prefix of the size (64 - digest_size) corresponding to the smaller syncmer is used for partial search. If a given full 64-bit query hash was not found in the index, the seeds with the same main hash are looked up instead and added as the matches. These partial matches are used for the NAM construction in the same way as the full matches.

In order to calculate the reference range of the partial match correctly, we need to know which strobe was used as main. This information is stored in the m_packed field of RefRandstrobe, replacing one of the bits previously reserved for the reference index.

List of changes:

The randstrobe hash was replaced with the multi-context hash.
Partial match search was added to the find_nams function if full matches were not found. Partial matches have a larger abundance threshold than full matches due to higher seed repetitiveness.
Partial hit processing added to add_to_hits_per_ref.
The partial_find function that looks for the seeds with the same main hash as the query was added to the StrobemerIndex.
Some utility functions (for working with the main part of the hash) were added to the StrobemerIndex.
9th right-hand bit of the m_packed field of the RefRandstrobe is now set iff the first strobe was used as the main part of the multi-context seed.
Fields that store the start and end positions of the main strobe on the query, and the indicator of the main strobe are added to the QueryRandstrobe class.
The --digest parameter was added to regulate the auxiliary part of the multi-context hash with the default value of 24.

ksahlin · 2024-01-19T11:07:54Z

src/randstrobes.cpp

+static inline digest_hash_t digest_hash(syncmer_hash_t hash1, syncmer_hash_t hash2, size_t digest_size) {
+    syncmer_hash_t main_hash = std::min(hash1, hash2);
+    syncmer_hash_t aux_hash = std::max(hash1, hash2);
+    return ((main_hash >> digest_size) << digest_size) ^ (aux_hash >> (64 - digest_size));


Need to pass information of which strobe (1 or 2) that was saved as the base value. probably here

Added this information to the last bit of the hash

ksahlin · 2024-01-19T11:09:10Z

src/nam.cpp

 ) {
    int min_diff = std::numeric_limits<int>::max();
    for (const auto hash = index.get_hash(position); index.get_hash(position) == hash; ++position) {
        int ref_start = index.get_strobe1_position(position);


if using multi-context, we are not sure if it is index.get_strobe1_position(position) or position of the second strobe here. Will yield inconsistent NAMs

…xtension

# Conflicts: # src/indexparameters.cpp

marcelm

This replaces randstrobes with multi-context seeds.

For practical reasons, I’d like to argue that we still have randstrobes, we just compute their hash differently. If we didn’t have randstrobes anymore, we’d have to rename a lot of functions, files and variables. So I see it as "replaces randstrobe hashes with multi-context hashes" or something like that.

Where hash1 and hash2 are the syncmer hashes, with hash1 being the smaller hash.

Is this so that randstrobe hashes are symmetric as before? That is, would that be different when we switch to asymmetric hashes?

The 64 - digest_size prefix of the multi-context hash corresponding to hash1 is denoted as the main hash, and the digest_size suffix corresponding to hash2 is denoted as auxiliary hash.

This PR introduces quite a bit of terminology that is sometimes overlapping and sometimes used inconsistently, and I wonder whether that could be simplified a bit.

main hash
auxiliary hash
digest, digest size (inconsistency: command-line option is --digest, but what is meant is actually "digest length"). What is a digest anyway?
subhash (used in a comment somewhere)
multi-context hash
digest hash
partial hit

I guess that some of this will resolve itself if we switch to asymmetric randstrobe hashes, so it is not super important to fix this at the moment.

marcelm · 2024-03-12T08:50:44Z

src/index.cpp

    }
    stats.tot_strobemer_count = total_randstrobes;

+    logger.info() << "Digest parameter is : " << parameters.randstrobe.digest;


Suggested change

logger.info() << "Digest parameter is : " << parameters.randstrobe.digest;

logger.info() << "Digest size: " << parameters.randstrobe.digest;

Replaced "digest" with "aux_len" here and in other places.

marcelm · 2024-03-12T08:51:04Z

src/dumpstrobes.cpp

        c_set ? c : IndexParameters::DEFAULT,
-        max_seed_len_set ? max_seed_len : IndexParameters::DEFAULT
+        max_seed_len_set ? max_seed_len : IndexParameters::DEFAULT,
+        digest ? digest: IndexParameters::DEFAULT


Suggested change

digest ? digest: IndexParameters::DEFAULT

digest ? digest : IndexParameters::DEFAULT

marcelm · 2024-03-12T08:54:07Z

src/index.hpp

        return end();
    }

+    //Returns the first entry that matches the first strobe subhash (if using multi-context seeds)


Is using multi-context seeds optional with this PR? If not, then please remove the text in parentheses

It is not optional, but I wanted to leave the possibility of changing between hashes open at some point. But it does not seem to have any benefits. Fixed the comment.

marcelm · 2024-03-12T09:03:01Z

src/indexparameters.cpp

+        Profile{ 50,  70, 18, -4, -2,  1, 24},
+        Profile{ 75,  90, 20, -4, -3,  2, 24},
+        Profile{100, 110, 20, -4, -2,  2, 24},
+        Profile{125, 135, 20, -4, -1,  4, 24},
+        Profile{150, 175, 20, -4,  1,  7, 24},
+        Profile{250, 375, 22, -4,  2, 12, 24},
+        Profile{400, max, 23, -6,  2, 12, 24},


Since it’s always 24, it doesn’t need to be part of the profile.

marcelm · 2024-03-12T09:09:33Z

src/nam.cpp

-        if (diff <= min_diff) {
-            hits_per_ref[index.reference_index(position)].push_back(Hit{query_start, query_end, ref_start, ref_end});
-            min_diff = diff;
+    if (not is_partial) {


Did you notice you were using a digraph?

Suggested change

if (not is_partial) {

if (!is_partial) {

Please split this up into two functions: One for the partial and one for the non-partial case. (A function that has completely different behavior based on a boolean parameter should be split up into two functions.)

Good point! Splitted the function.

marcelm · 2024-03-12T09:59:18Z

src/randstrobes.hpp


 using syncmer_hash_t = uint64_t;
 using randstrobe_hash_t = uint64_t;
+using digest_hash_t = uint64_t;


I don’t think this typedef is necessary. digest_hash_t is only used once as the return type of the digest_hash function, but everywhere else, it is still called randstrobe_hash_t. If there was a reason to use a name other than randstrobe_hash_t, we would rename randstrobe_hash_t.

marcelm · 2024-03-12T10:29:42Z

src/index.hpp

    }

+    bool is_partial_filtered(bucket_index_t position) const {
+        uint shift = parameters.randstrobe.digest;


Prefer explicitly sized types, that is, don’t use uint, use whichever type the digest parameter has.

Replaced all uint occurences

marcelm · 2024-03-12T10:33:41Z

src/randstrobes.cpp

    return hash1 + hash2;
 }

+static inline digest_hash_t digest_hash(syncmer_hash_t hash1, syncmer_hash_t hash2, size_t digest_size) {


Instead of adding a digest_hash function, the randstrobe_hash function should be modified. (Currently, the randstrobe_hash function is unused.)

marcelm · 2024-03-12T10:34:14Z

src/randstrobes.cpp

    RandstrobeIterator randstrobe_fwd_iter{syncmers, parameters.randstrobe};
    while (randstrobe_fwd_iter.has_next()) {
        auto randstrobe = randstrobe_fwd_iter.next();
+        uint partial_start = randstrobe.is_first_main ? randstrobe.strobe1_pos : randstrobe.strobe2_pos;


marcelm · 2024-03-12T10:34:34Z

src/randstrobes.hpp

    const unsigned w_max;
    const uint64_t q;
    const unsigned int max_dist;
+    uint digest;


Itolstoganov · 2024-03-14T13:40:09Z

For practical reasons, I’d like to argue that we still have randstrobes, we just compute their hash differently. If we didn’t have randstrobes anymore, we’d have to rename a lot of functions, files and variables. So I see it as "replaces randstrobe hashes with multi-context hashes" or something like that.

Sure, edited the PR description.

Where hash1 and hash2 are the syncmer hashes, with hash1 being the smaller hash.
Is this so that randstrobe hashes are symmetric as before? That is, would that be different when we switch to asymmetric hashes?

Yes, the only reason for selecting minimal hash is to keep the hash symmetric.

This PR introduces quite a bit of terminology that is sometimes overlapping and sometimes used inconsistently, and I wonder whether that could be simplified a bit.

I tried to alleviate that by removing all mentions of "digest" (by which I meant the same thing as the "auxiliary" part of the multi-context hash) and "subhash".

src/index.cpp

ksahlin · 2024-03-15T13:38:11Z

Is this PR ready for a larger benchmarking?

Co-authored-by: Marcel Martin <mail@marcelm.net>

…seeds.

ksahlin · 2024-04-17T07:11:49Z

The randstrobe iterator for queries stops when there is still w_min syncmers left in the read by checking

    bool has_next() {
        return strobe1_index + w_min < syncmers.size();

This is expected behaviour for our current seeds. For example, w_min=1 for 50 and 75, and w_min=4 for 100.

However, I think mcs can be boosted further by adding the remaining syncmers 'in the ends of reads' as seeds. This means 2*w_min more seeds for a read (fw and rc ends).

I haven't thought carefully about the best way to change it in the code, but perhaps changing the bool has_next() to return strobe1_index < syncmers.size(); together with adding a case to return randstrobes with base/main hash as strobe1.hash and auxillary hash 0 in Randstrobe RandstrobeIterator when i < w_start?

ksahlin · 2024-04-17T07:45:57Z

I implemented and tested this briefly:

Changed to

    bool has_next() {
        return strobe1_index < syncmers.size();
    }

Added in RandstrobeIterator::get

    if (syncmers.size() < w_start) {
        return Randstrobe{
                randstrobe_hash(strobe1.hash, strobe2.hash, aux_len),
                static_cast<uint32_t>(strobe1.position),
                static_cast<uint32_t>(strobe2.position), true
        };
    }

The results only very slightly improved/nearly unchanged on a 'no variation' drosophila genome in SE mapping mode, but they improve substantially for a high variation simulation (see below for read lengths 100 and 150). This also suggest that we may be overfitting our optimization to high quality simulations without much variation.

ref,read_length,tool,mode,%-aligned,accuracy,time(sec),Mem(MB)
droso_above10k_variation,100,strobealign-mcs_SE,align,48.815,44.2065,704,1.45,778.4
droso_above10k_variation,100,strobealign-mcs-more-seeds_SE,align,48.9885,44.2975,704,1.55,778.36

droso_above10k_variation,150,strobealign-mcs_SE,align,49.0685,44.9805,839,2.47,781.87
droso_above10k_variation,150,strobealign-mcs-more-seeds_SE,align,49.258,45.1925,839,2.8,787.6

Needs to be tested on larger genome(s) obv.

… the -b parameter value

marcelm · 2024-04-30T15:12:04Z

This also suggest that we may be overfitting our optimization to high quality simulations without much variation.

I am a bit worried about this. Doesn’t this fit the pattern that we often see worse variant detection rates for "optimized" mapping parameters? Shall I perhaps run the parameter optimization on data with higher variation?

ksahlin · 2024-04-30T20:09:31Z

Shall I perhaps run the parameter optimization on data with higher variation?

Sounds like a good suggestion to me! From what you told me, it should come with relatively little extra computation time since the same index can be used for read sets with different levels of variation w.r.t. the reference genome.

Below were the variant rates I used for SIM1-4. But I think SIM4 is the only setting that really tests aligning arouind variants properly. I would maybe even set --sv-indel-rate 0.00002 --snp-rate 0.005 --small-indel-rate 0.001 --max-small-indel-size 100 to make it really challenging.

rule mason_simulate_variants:
    input:  ref = config["HG38"]
    output: sim_vcf =  config["ROOT_OUT"] + "/reads/{dataset}/variations.vcf",
    run:
        if wildcards.dataset == "SIM1":
            shell("mason_variator -ir {input.ref} -ov {output.sim_vcf}")
        elif wildcards.dataset == "SIM2":
            shell("mason_variator --sv-indel-rate 0.000001 --snp-rate 0.001 --small-indel-rate 0.00001 --max-small-indel-size 20  -ir {input.ref} -ov {output.sim_vcf}")
        elif wildcards.dataset == "SIM3":
            shell("mason_variator --sv-indel-rate 0.000005 --snp-rate 0.001 --small-indel-rate 0.0001 --max-small-indel-size 50   -ir {input.ref} -ov {output.sim_vcf}")
        elif wildcards.dataset == "SIM4":
            shell("mason_variator --sv-indel-rate 0.00001 --snp-rate 0.005 --small-indel-rate 0.0005 --max-small-indel-size 50   -ir {input.ref} -ov {output.sim_vcf}")

marcelm · 2024-05-02T07:57:48Z

From what you told me, it should come with relatively little extra computation time since the same index can be used for read sets with different levels of variation w.r.t. the reference genome.

Sorry if you got the wrong impression, but since I don’t store the index on disk, everything has to be recomputed. I’ll run this on a cluster somewhere to get the results faster this time.

Below were the variant rates I used for SIM1-4. But I think SIM4 is the only setting that really tests aligning arouind variants properly. I would maybe even set --sv-indel-rate 0.00002 --snp-rate 0.005 --small-indel-rate 0.001 --max-small-indel-size 100 to make it really challenging.

Good, I’ll use that and would like to call it SIM5; is that ok?

ksahlin · 2024-05-04T18:25:08Z

Good, I’ll use that and would like to call it SIM5; is that ok?

Yes, sounds good!

marcelm · 2024-10-07T13:48:38Z

This PR has been reworked and became #426, which now supersedes it. I’ve read through the comments here and think we have addressed all relevant ones, so I’m closing this PR.

ksahlin reviewed Jan 19, 2024

View reviewed changes

Itolstoganov force-pushed the multi-context-seeds branch from bf8abf0 to 8c3dfe4 Compare January 22, 2024 17:25

Itolstoganov added 3 commits February 5, 2024 15:19

Add multi-context seeds

6663858

Adjust queries for partial matches

2438de6

Symmetrize digest hash function

c4a7f61

Itolstoganov force-pushed the multi-context-seeds branch from 0aea3fc to c4a7f61 Compare February 5, 2024 22:27

marcelm mentioned this pull request Feb 21, 2024

Optimize parameters #397

Closed

Itolstoganov added 3 commits March 5, 2024 16:55

Initialize partial filter cutoff during index population. Fix index e…

dfdd7a6

…xtension

Fix dumpstrobes

45911fa

Merge remote-tracking branch 'origin/main' into multi-context-seeds

77b5933

# Conflicts: # src/indexparameters.cpp

Itolstoganov force-pushed the multi-context-seeds branch from b89d123 to 77b5933 Compare March 11, 2024 13:50

marcelm reviewed Mar 12, 2024

View reviewed changes

Itolstoganov added 3 commits March 14, 2024 12:50

Rename digest

8dc4d5a

Refactor add_to_nams, fix comment

aa0c2de

Fix types, rename hash function

ae63376

Rename main strobe predicates, randstrobe cleanup

a903e49

Itolstoganov requested a review from marcelm March 14, 2024 14:05

marcelm reviewed Mar 14, 2024

View reviewed changes

src/index.cpp Outdated Show resolved Hide resolved

Itolstoganov and others added 2 commits March 16, 2024 12:24

Fix parameter logging

f82bd54

Co-authored-by: Marcel Martin <mail@marcelm.net>

Possibly fixes redundant alignment sites for symmetric multi context …

5263222

…seeds.

Fixes that the mcs implementation gave different results depending on…

eee0cd8

… the -b parameter value

marcelm mentioned this pull request May 22, 2024

Multi-context seeds plus fixes and optimized parameters #426

Closed

8 tasks

marcelm closed this Oct 7, 2024

	logger.info() << "Digest parameter is : " << parameters.randstrobe.digest;
	logger.info() << "Digest size: " << parameters.randstrobe.digest;

	digest ? digest: IndexParameters::DEFAULT
	digest ? digest : IndexParameters::DEFAULT

Conversation

Itolstoganov commented Jan 16, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

marcelm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Itolstoganov commented Mar 14, 2024

Uh oh!

Uh oh!

ksahlin commented Mar 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ksahlin commented Apr 17, 2024

Uh oh!

ksahlin commented Apr 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

marcelm commented Apr 30, 2024

Uh oh!

ksahlin commented Apr 30, 2024

Uh oh!

marcelm commented May 2, 2024

Uh oh!

ksahlin commented May 4, 2024

Uh oh!

marcelm commented Oct 7, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Itolstoganov commented Jan 16, 2024 •

edited

Loading

ksahlin commented Mar 15, 2024 •

edited

Loading

ksahlin commented Apr 17, 2024 •

edited

Loading