Skip to content

Conversation

@tibvdm
Copy link
Contributor

@tibvdm tibvdm commented Aug 27, 2024

This PR introduces the "K-mer bounds table". This table is a small lookup that allows us to very easily match a short peptide sequence and provides us already with a better initial value for the minimum and maximum bounds when looking up a peptide in the suffix array.

See this detailed explanation for how this works exactly:

When searching for a peptide within the suffix array, we employ a binary search strategy across all the suffixes of the proteins in UniProtKB. In the 2024_04 release of UniProtKB, which contains more than 87 billion amino acids, we need to perform log2(87 * 10^9) ≈ 36,35 = 37 suffix comparisons, on average, per bound. Since we are identifying a range of suffixes that match with our peptide within the suffix array, two binary searches are required to determine the minimum and maximum bounds, resulting in a total of 74 comparisons per peptide. These comparisons are computationally costly and should be minimized. To reduce the number of searches, we precompute and store the initial bounds for all possible peptides of length k and shorter. This allows for a constant-time lookup, followed by a much smaller binary search step for each query.

For a k-mer table with factor k, there exist a total of 20k different k-mers. This can be seen as dividing the space of 87 billion amino acids into 20k compartments, where, on average, each compartment has size 2^37 / 20^k. By taking the logarithm log2(2^37 / 20^k) = 37 - k * log2(20), we can calculate how many text comparisons we still need to perform after consulting the k-mer bounds table. For a k of size 5, we only need 15 comparisons per bound instead of 37, resulting in significantly less work per search.

@pverscha pverscha changed the title Feature/sa bounds optimization Optimise initial bounds for maching peptides in Suffix Array Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants