-
Notifications
You must be signed in to change notification settings - Fork 16
Revise ancestor generation algorithm to improve performance #1012
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1012 +/- ##
==========================================
+ Coverage 93.39% 93.41% +0.01%
==========================================
Files 18 18
Lines 6483 6496 +13
Branches 1103 1107 +4
==========================================
+ Hits 6055 6068 +13
Misses 291 291
Partials 137 137
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
9d1535c
to
4ec662c
Compare
Not sure why you're touching _tsinfermodule.c here @duncanMR, but that needs to be dropped from the diff. Your editor probably got overzealous there. |
Apologies, my clang-format was configured incorrectly and changed the formatting of the whole file. I did have to modify |
88bccd1
to
ef07e87
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks great, but not obvious that we need the API changes to pass through derived count (definitely a premature optimisation if it can be computed easily).
ef07e87
to
7c48c88
Compare
@benjeffery I've simplified the changes based on Jerome's suggestions. I think the plan is to merge this in after 0.4.1 has been released, if that's okay with you? |
Yep, makes sense to me! 0.4.1 should be tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
A trivial point, but perhaps you want to change the plot in the large_scale.md doc file, to show the new matching profile, if it is much different. Here's what we currently have (I guess @benjeffery knows how it was generated): ![]() |
That plot was from a GeL chromosome, but any similar will do that illustrates the point. |
Good to merge @benjeffery? |
Merging! Thanks @duncanMR this is awesome! |
This pull request has been removed from the queue for the following reason: The pull request can't be updated You should update or rebase your pull request manually. If you want to requeue this pull request, you can post a |
7c48c88
to
5354616
Compare
@Mergifyio rebase |
✅ Branch has been successfully rebased |
5354616
to
f9a27ca
Compare
f9a27ca
to
40196b3
Compare
A major limitation of tsinfer 0.4 has been that ancestors with high frequency focal sites are excessively long. This reduces parallelism and wastes computation time in ancestor matching. We've explored a number of solutions (e.g. #911), but we've found a simple one that seems to work really well.
It's easiest to explain with an example:

Constructing an ancestor with focal site 5 and moving to the left, we start with a sample set of C-F. Since the focal AC is 4, we can only use sites with an AC of 5 or higher as inference sites. In the current implementation, we would ignore sites 1-4 entirely, but we are then missing an informative signal. Since B also carries the derived allele at site 3, and is the only other non-carrier at site 2, it seems that C has recombined into a clade with B and should be excluded from the sample set.
In the new approach, we still only insert a mutation into the ancestral haplotype if it is older than the focal site, but we use sites of all frequencies for determining when to exclude samples from the sample set. However, we only count conflicts at sites where there are carriers outside of the sample set (i.e. the derived AC in the full population > AC in the sample set).
Initial validation of the new approach looks promising. I ran a small out-of-Africa sim of 1mbp with 200 samples. Using the old validation code, I can match the inferred ancestors to simulated ones and compute
max(true_left - inferred_left, inferred_right - true_right)
, which I call the max overshoot. While the old algorithm increasingly overestimates the length of the ancestors as the focal frequency increases, the new algorithm does not. We don't seem to make many ancestors shorter than their true length either.I've done some testing on larger simulations with similar results, but there is still a lot more validation to do, e.g. with mispolarised alleles and sequencing errors. I did run @hyanwong's code for analysing sample-to-root edges from #903, and in simulations we don't seem to see any change. However, there are a lot more sample-to-root edges in real data with the new method, which is fixed by using a small mismatch ratio for sample matching.
The new approach does require more work from the CPU for ancestor generation, but it seems negligible in comparison to the gains in parallelism and speed of ancestor matching. Comparing the methods on chr20q of the 1000 Genomes Project, we see that fewer ancestor groups are needed and there is a 5X decrease in CPU time needed for matching. The decrease in wall time is more modest: it seems that we aren't using the 126 threads of this EPYC system as effectively in the new approach.
I've implemented the new algorithm in C and Python: the biggest change required is that the derived allele count of a site needs to be stored in the ancestor builder.