POA for default variant harmonization. #261

ACEnglish · 2025-02-16T03:12:58Z

ACEnglish
Feb 16, 2025
Maintainer

As of Truvari v5.2, we've changed the default phab --align method from MAFFT to POA.

We've done this for a few reasons. First, POA is a Python library, whereas MAFFT is a call to an external tool. While this isn't inherently a bad thing, removing the requirement to install a second tool facilitates the adoption of truvari phab|refine. Second, POA is faster than MAFFT. Third, in preparation for wider adoption of the latest, more comprehensive GIAB SV benchmarks, v5.1 of Truvari changed the refine default parameters to be friendlier to whole-genome SV benchmarking, whereas the original design of refine was for tandem repeat benchmarking. We've noticed that POA may perform better at harmonizing the types of SVs observed in whole-genome benchmarking. While which --align method is absolutely best is still up for debate (except WFA, which isn't an MSA alignment), below is some anecdotal evidence of the patterns we've seen from POA vs. MAFFT SV harmonization.

We ran Truvari bench+refine on a set of discovered SVs and an assembly-derived benchmark, once with --align mafft and once with --align poa. The resulting refine.variant_summary.json performance metrics were:

Metric	MAFFT	POA
TP-base	23,336	24,695
TP-comp	20,998	21,383
FP	3,167	2,782
FN	7,038	5,679
precision	0.8689	0.8848
recall	0.7682	0.8130
f1	0.8155	0.8474
base cnt	30,374	30,374
comp cnt	24,165	24,165

Because we don't know the 'true' precision/recall of these variants, it's difficult to say which alignment method is more accurate. For example, these variants may truly have 0.76 recall, but POA might harmonize the variants in such a way that Truvari finds more matches and therefore artificially inflates the recall. This unknown is one of the most difficult parts of creating a benchmarking tool — how does one benchmark the benchmarker? Still, POA is finding relatively more matches and higher recall, which developers of SV tools generally are happy to see.

This alone would be a pretty decent reason to move to POA. However, the following analysis was more convincing. The table below was constructed by finding regions in this benchmarking example that contained exclusively INS or DEL variants in the input base/comp VCFs (i.e., no INS and DEL). One would expect that after phab harmonization, the output variants should be all of the same SV type. In practice, this doesn't always happen:

svtype	out	MAFFT	POA
!INS	0	400	419
	1	9	4
	2	6	1
	3	4	0
	4	1	0
	5	3	0
	13	1	0
!DEL	0	1005	1080
	1	50	10
	2	22	0
	3	7	0
	4	1	0
	5	2	0
	6	2	0
	18	1	0

To explain this table, consider the second line of !INS 1 9 4, which says that for regions with no insertion in the input VCFs but with one in the output, MAFFT created nine where POA made only four. Theoretically, the alignments from either --align are correct; they're just choosing a different point in the alignment parameter space and creating separate variants. If we were to reconstruct the haplotypes from either aligner's variants, they would be equal. However, because we prefer the harmonized variants to be reflective/informative of the input variants, a region without insertions becoming harmonized into insertions is not optimal. If we were to dig into some of these regions, we'd find that the input deletions were shortened/split in a manner that required insertions to be added to the sets of variants.

Another way to check what these aligners do to the variants is to consider how many regions that went into phab came out as TN according to the refine.regions.txt. Truvari refine will only run phab on regions that could benefit from variant harmonization (≥1 FP/FN). For the region to end up with an annotation of TN, this means the phab harmonization resized/split the input ≥50bp SV to below 50bp, causing Truvari to no longer count the variant. This is not desirable. Again, we find fewer instances of POA creating these undesirable regions.

# True = regions put through phab harmonization
$ grep -w True mafft_result/refine.regions.txt | grep -c TN
317
$ grep -w True poa_result/refine.regions.txt | grep -c TN
16

Therefore, we have changed the default phab|refine --align method to poa. However, we still keep MAFFT as an option because there are almost certainly situations where it will perform better, and we'd see the above numbers flip in its favor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

POA for default variant harmonization. #261

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

POA for default variant harmonization. #261

Uh oh!

ACEnglish Feb 16, 2025 Maintainer

Replies: 0 comments

ACEnglish
Feb 16, 2025
Maintainer