Revise ancestor generation algorithm to improve performance #1012

duncanMR · 2025-03-14T18:32:58Z

A major limitation of tsinfer 0.4 has been that ancestors with high frequency focal sites are excessively long. This reduces parallelism and wastes computation time in ancestor matching. We've explored a number of solutions (e.g. #911), but we've found a simple one that seems to work really well.

It's easiest to explain with an example:

Constructing an ancestor with focal site 5 and moving to the left, we start with a sample set of C-F. Since the focal AC is 4, we can only use sites with an AC of 5 or higher as inference sites. In the current implementation, we would ignore sites 1-4 entirely, but we are then missing an informative signal. Since B also carries the derived allele at site 3, and is the only other non-carrier at site 2, it seems that C has recombined into a clade with B and should be excluded from the sample set.

In the new approach, we still only insert a mutation into the ancestral haplotype if it is older than the focal site, but we use sites of all frequencies for determining when to exclude samples from the sample set. However, we only count conflicts at sites where there are carriers outside of the sample set (i.e. the derived AC in the full population > AC in the sample set).

Initial validation of the new approach looks promising. I ran a small out-of-Africa sim of 1mbp with 200 samples. Using the old validation code, I can match the inferred ancestors to simulated ones and compute max(true_left - inferred_left, inferred_right - true_right), which I call the max overshoot. While the old algorithm increasingly overestimates the length of the ancestors as the focal frequency increases, the new algorithm does not. We don't seem to make many ancestors shorter than their true length either.

I've done some testing on larger simulations with similar results, but there is still a lot more validation to do, e.g. with mispolarised alleles and sequencing errors. I did run @hyanwong's code for analysing sample-to-root edges from #903, and in simulations we don't seem to see any change. However, there are a lot more sample-to-root edges in real data with the new method, which is fixed by using a small mismatch ratio for sample matching.

The new approach does require more work from the CPU for ancestor generation, but it seems negligible in comparison to the gains in parallelism and speed of ancestor matching. Comparing the methods on chr20q of the 1000 Genomes Project, we see that fewer ancestor groups are needed and there is a 5X decrease in CPU time needed for matching. The decrease in wall time is more modest: it seems that we aren't using the 126 threads of this EPYC system as effectively in the new approach.

I've implemented the new algorithm in C and Python: the biggest change required is that the derived allele count of a site needs to be stored in the ancestor builder.

codecov · 2025-03-14T18:47:52Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.41%. Comparing base (ca2cf16) to head (40196b3).
Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1012      +/-   ##
==========================================
+ Coverage   93.39%   93.41%   +0.01%     
==========================================
  Files          18       18              
  Lines        6483     6496      +13     
  Branches     1103     1107       +4     
==========================================
+ Hits         6055     6068      +13     
  Misses        291      291              
  Partials      137      137

Flag	Coverage Δ
C	`93.41% <100.00%> (+0.01%)`	⬆️
python	`95.78% <100.00%> (?)`

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

jeromekelleher · 2025-03-17T10:02:21Z

Not sure why you're touching _tsinfermodule.c here @duncanMR, but that needs to be dropped from the diff. Your editor probably got overzealous there.

duncanMR · 2025-03-17T10:38:44Z

Not sure why you're touching _tsinfermodule.c here @duncanMR, but that needs to be dropped from the diff. Your editor probably got overzealous there.

Apologies, my clang-format was configured incorrectly and changed the formatting of the whole file. I did have to modify _tsinfermodule.c to add allow for the addition of derived counts when adding sites. For some reason my pre-commit isn't linting the C code correctly, just trying to fix that now. I will speak with Ben about the failing Windows tests.

jeromekelleher

Generally looks great, but not obvious that we need the API changes to pass through derived count (definitely a premature optimisation if it can be computed easily).

lib/ancestor_builder.c

duncanMR · 2025-03-18T17:10:50Z

@benjeffery I've simplified the changes based on Jerome's suggestions. I think the plan is to merge this in after 0.4.1 has been released, if that's okay with you?

benjeffery · 2025-03-18T18:49:43Z

Yep, makes sense to me! 0.4.1 should be tomorrow.

jeromekelleher

LGTM!

hyanwong · 2025-03-19T14:46:38Z

A trivial point, but perhaps you want to change the plot in the large_scale.md doc file, to show the new matching profile, if it is much different. Here's what we currently have (I guess @benjeffery knows how it was generated):

benjeffery · 2025-03-19T14:58:31Z

That plot was from a GeL chromosome, but any similar will do that illustrates the point.

jeromekelleher · 2025-03-19T20:45:57Z

Good to merge @benjeffery?

benjeffery · 2025-03-19T22:48:40Z

Merging! Thanks @duncanMR this is awesome!

mergify · 2025-03-19T22:48:41Z

This pull request has been removed from the queue for the following reason: pull request branch update failed.

The pull request can't be updated

You should update or rebase your pull request manually.

If you want to requeue this pull request, you can post a @mergifyio requeue comment.

benjeffery · 2025-03-20T01:23:49Z

@Mergifyio rebase

mergify · 2025-03-20T01:23:58Z

rebase

✅ Branch has been successfully rebased

duncanMR force-pushed the improve_ancgen branch from 9d1535c to 4ec662c Compare March 17, 2025 09:04

duncanMR force-pushed the improve_ancgen branch 2 times, most recently from 88bccd1 to ef07e87 Compare March 18, 2025 12:40

jeromekelleher reviewed Mar 18, 2025

View reviewed changes

lib/ancestor_builder.c Show resolved Hide resolved

duncanMR force-pushed the improve_ancgen branch from ef07e87 to 7c48c88 Compare March 18, 2025 16:58

jeromekelleher approved these changes Mar 19, 2025

View reviewed changes

benjeffery added the AUTOMERGE-REQUESTED label Mar 19, 2025

benjeffery force-pushed the improve_ancgen branch from 7c48c88 to 5354616 Compare March 19, 2025 22:50

benjeffery force-pushed the improve_ancgen branch from 5354616 to f9a27ca Compare March 20, 2025 01:23

duncanMR added 2 commits March 20, 2025 01:46

Revise ancestor generation algorithm to improve performance

c65408e

Simplify handling of derived counts

40196b3

mergify bot force-pushed the improve_ancgen branch from f9a27ca to 40196b3 Compare March 20, 2025 01:46

mergify bot merged commit 44827ed into tskit-dev:main Mar 20, 2025
12 checks passed

mergify bot removed the AUTOMERGE-REQUESTED label Mar 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revise ancestor generation algorithm to improve performance #1012

Revise ancestor generation algorithm to improve performance #1012

Uh oh!

duncanMR commented Mar 14, 2025 •

edited

Loading

Uh oh!

codecov bot commented Mar 14, 2025 •

edited

Loading

Uh oh!

jeromekelleher commented Mar 17, 2025

Uh oh!

duncanMR commented Mar 17, 2025

Uh oh!

jeromekelleher left a comment

Uh oh!

Uh oh!

duncanMR commented Mar 18, 2025

Uh oh!

benjeffery commented Mar 18, 2025

Uh oh!

jeromekelleher left a comment

Uh oh!

hyanwong commented Mar 19, 2025

Uh oh!

benjeffery commented Mar 19, 2025

Uh oh!

jeromekelleher commented Mar 19, 2025

Uh oh!

benjeffery commented Mar 19, 2025

Uh oh!

mergify bot commented Mar 19, 2025

Uh oh!

benjeffery commented Mar 20, 2025

Uh oh!

mergify bot commented Mar 20, 2025

Uh oh!

Uh oh!

Uh oh!

Revise ancestor generation algorithm to improve performance #1012

Revise ancestor generation algorithm to improve performance #1012

Uh oh!

Conversation

duncanMR commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jeromekelleher commented Mar 17, 2025

Uh oh!

duncanMR commented Mar 17, 2025

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

duncanMR commented Mar 18, 2025

Uh oh!

benjeffery commented Mar 18, 2025

Uh oh!

jeromekelleher left a comment

Choose a reason for hiding this comment

Uh oh!

hyanwong commented Mar 19, 2025

Uh oh!

benjeffery commented Mar 19, 2025

Uh oh!

jeromekelleher commented Mar 19, 2025

Uh oh!

benjeffery commented Mar 19, 2025

Uh oh!

mergify bot commented Mar 19, 2025

Uh oh!

benjeffery commented Mar 20, 2025

Uh oh!

mergify bot commented Mar 20, 2025

✅ Branch has been successfully rebased

Uh oh!

Uh oh!

Uh oh!

duncanMR commented Mar 14, 2025 •

edited

Loading

codecov bot commented Mar 14, 2025 •

edited

Loading