Skip to content

Commit 9964d5b

Browse files
committed
Merge branch 'release/v5.3.0'
2 parents 7ff8091 + ae021e5 commit 9964d5b

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

65 files changed

+1462
-645
lines changed

README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
[![pylint](imgs/pylint.svg)](https://github.com/acenglish/truvari/actions/workflows/pylint.yml)
33
[![FuncTests](https://github.com/acenglish/truvari/actions/workflows/func_tests.yml/badge.svg?branch=develop&event=push)](https://github.com/acenglish/truvari/actions/workflows/func_tests.yml)
44
[![coverage](imgs/coverage.svg)](https://github.com/acenglish/truvari/actions/workflows/func_tests.yml)
5-
[![develop](https://img.shields.io/github/commits-since/acenglish/truvari/v5.1.1)](https://github.com/ACEnglish/truvari/compare/v5.1.1...develop)
5+
[![develop](https://img.shields.io/github/commits-since/acenglish/truvari/v5.2.0)](https://github.com/ACEnglish/truvari/compare/v5.2.0...develop)
66
[![Downloads](https://static.pepy.tech/badge/truvari)](https://pepy.tech/project/truvari)
77

88
![Logo](https://raw.githubusercontent.com/ACEnglish/truvari/develop/imgs/BoxScale1_DarkBG.png)
@@ -29,7 +29,7 @@ The current most common Truvari use case is for structural variation benchmarkin
2929
truvari bench -b base.vcf.gz -c comp.vcf.gz -f reference.fa -o output_dir/
3030
```
3131

32-
Find more matches by harmonizing phased varians using refine:
32+
Find more matches by harmonizing phased variants using refine:
3333
```
3434
truvari refine output_dir/
3535
```

docs/Updates.md

+25-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,30 @@
1-
# Truvari 5.1.1
1+
# Truvari 5.3
22
*in progress*
33

4+
* Fixed FP BNDs being dropped [details](https://github.com/ACEnglish/truvari/discussions/263).
5+
* Restore default `--sizemax` - Some callers make SVs that span the entire chromosome, which disrupts truvari's chunking strategy
6+
* `phab`
7+
* Can now harmonize samples' variants across any number of VCFs. This entails a UI change of no more `-b/-c`.
8+
* For a large jobs, a `--lowmem` flag:
9+
* turns on progress bars and process pool monitoring for better tracking of failed harmonization jobs
10+
* Much lower memory usage, and fewer failures
11+
* Api refactor to programmatically build haplotypes with `phab.VCFtoHaplotypes` and other phab functions
12+
* `bench`
13+
* Can now run on vcfs without SAMPLE columns (e.g. annotation files)
14+
* Fixed `anno grpaf` header tags. Some tools (e.g. IGV) don't link Type before Number.
15+
* Fixed edge case when iterating variants not inside regions
16+
17+
# Truvari 5.2.0
18+
*February 16, 2025*
19+
20+
* The default `--align` method for `phab` and `bench` switched to POA. See [discussion](https://github.com/ACEnglish/truvari/discussions/261) for details.
21+
* Fix bug in `--pick ac` where FN/FP variants were not being counted/output.
22+
* Fix `--dup-to-ins` Ticket [#258](https://github.com/ACEnglish/truvari/issues/258)
23+
* `ga4gh` now also writes a variant count summary json
24+
25+
# Truvari 5.1.1
26+
*February 5, 2025*
27+
428
* `bench`
529
* new automatic hook into the refine step via `truvari bench --refine`
630
* `refine`

docs/api/truvari.package.rst

-2
Original file line numberDiff line numberDiff line change
@@ -104,8 +104,6 @@ Extra Methods
104104

105105
.. autofunction:: performance_metrics
106106

107-
.. autofunction:: phab
108-
109107
.. autofunction:: reciprocal_overlap
110108

111109
.. autofunction:: restricted_float

docs/bench.md

+11-6
Original file line numberDiff line numberDiff line change
@@ -149,13 +149,18 @@ Refining bench output
149149
=====================
150150
As described in the [[refine wiki|refine]], a limitation of Truvari bench is 1-to-1 variant comparison. However, `truvari refine` can harmonize the variants to give them more consistent representations. A bed file named `candidate.refine.bed` is created by `truvari bench` and holds a set of regions which may benefit from refinement. To use it, simply run
151151
```bash
152-
truvari bench -b base.vcf.gz -c comp.vcf.gz -o result/
153-
truvari refine --regions result/candidate.refine.bed \
154-
--reference reference.fasta \
155-
--recount --use-region-coords \
156-
result/
152+
truvari bench -b base.vcf.gz -c comp.vcf.gz --reference reference.fasta -o result/
153+
truvari refine result/
154+
```
155+
156+
Refine has a few parameters detailed in the [[refine wiki|refine]]. But if you'd ike to run refinement automatically with defaults, you can simply use
157+
```bash
158+
truvari bench -b base.vcf.gz
159+
-c comp.vcf.gz
160+
--reference reference.fasta
161+
-o result/
162+
--refine
157163
```
158-
See [[refine wiki|refine]] for details.
159164

160165
Comparing Sequences of Variants
161166
===============================

docs/collapse.md

+13-1
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,19 @@ For some results, one may not want to collapse variants with conflicting genotyp
7676

7777
--intra
7878
=======
79-
When a single sample is run through multiple SV callers, one may wish to consolidate those results. After the `bcftools merge` of the VCFs, there will be one SAMPLE column per-input. With `--intra`, collapse will consolidate the sample information so that only a single sample column is present in the output. Since the multiple callers may have different genotypes or other FORMAT fields with conflicting information, `--intra` takes the first column from the VCF, then second, etc. For example, if we have an entry with:
79+
When a single sample is run through multiple SV callers, one may wish to consolidate those results. After the `bcftools merge` of the VCFs, there will be one SAMPLE column per-input. With `--intra`, collapse will consolidate the sample information so that only a single sample column is present in the output. This will also add a `FORMAT/SUPP` field which will indicates which columns had a present GT (e.g. 0/1 or 1/1). For example, these two calls would collapse:
80+
81+
```
82+
#Call FORMAT S1 S2
83+
Call1 GT:GQ:AD 1/1 ./.
84+
Call2 GT:GQ:AD ./. 1/1
85+
86+
# Into
87+
#Call FORMAT S1
88+
Call1 GT:GQ:AD:SUPP 1/1:3
89+
```
90+
91+
Since the multiple callers may have different genotypes or other FORMAT fields with conflicting information, `--intra` takes the first column from the VCF, then second, etc. For example, if we have an entry with:
8092
```
8193
FORMAT RESULT1 RESULT2
8294
GT:GQ:AD ./.:.:3,0 1/1:20:0,30

docs/phab.md

+38-32
Original file line numberDiff line numberDiff line change
@@ -5,17 +5,18 @@ Truvari's comparison engine can match variants using a wide range of thresholds.
55

66
This problem is easiest to conceptualize in the case of 'split' variants: imagine a pipeline calls a single 100bp DEL that can also be represented as two 50bp DELs. To match these variants, we would need to loosen our thresholds to `--pick multi --pctsim 0.50 --pctsize 0.50`. Plus, these thresholds leave no margin for error. If the variant caller erroneously deleted an extra base to make a 101bp DEL we would have to lower our thresholds even further. These thresholds are already too low because there's plenty of distinct alleles with >= 50% homology.
77

8-
So how do we deal with inconsistent representations? In an ideal world, we would simply get rid of them by harmonizing the variants. This is the aim of `truvari phab`
8+
How do we deal with inconsistent representations? In an ideal world, we would simply get rid of them by harmonizing the variants. This is the aim of `truvari phab` (pronounced 'fab')
99

1010
`truvari phab` is designed to remove variant representation inconsistencies through harmonization. By reconstructing haplotypes from variants, running multiple-sequence alignment of the haplotypes along with the reference, and then recalling variants, we expect to remove discordance between variant representations and simplify the work required to perform variant comparison.
1111

12-
Requirements
13-
------------
14-
Since `truvari phab` uses mafft v7.505 via a command-line call, it expects it to be in the environment path. Download mafft and have its executable available in the `$PATH` [mafft](https://mafft.cbrc.jp/alignment/software/)
12+
Alignment Methods
13+
-----------------
14+
By default, `phab` will make the haplotypes and use `abpoa` to perform a multiple sequence alignment between them and the reference to harmonize variants. Note that `abpoa` may be non-deterministic in some cases.
15+
16+
Another available MSA algorithm is an external call `mafft` which can have its parameters customized via e.g. `--mafft-params "--maxiterate 1000"`. While `mafft` is often a more accurate alignment technique, it isn't fast. This makes an external call to mafft v7.505. Therefore, phab expects the executable to be in the environment path. Download mafft and place its executable available in the `$PATH` [mafft](https://mafft.cbrc.jp/alignment/software/). Alternatively, you can use the Truvari [Docker container](Development#docker) which already has mafft ready for use.
1517

16-
Alternatively, you can use the Truvari [Docker container](Development#docker) which already has mafft ready for use.
18+
If you're willing to sacrifice accuracy for a huge speed increase, you can use `--align wfa`. However, this isn't an MSA technique and instead independently aligns each haplotype, which will not produce the most parsimonious set of variants.
1719

18-
Also, you can use wave front aligner (pyWFA) or partial order alignment (pyabpoa). While wfa is the fastest approach, it will independently align haplotypes and therefore may produce less parsimonous aligments. And while poa is more accurate than wfa and faster than mafft, it is less accurate than mafft.
1920

2021
Example
2122
-------
@@ -31,13 +32,13 @@ To start, let's use `truvari bench` to see how similar the variant calls are in
3132
```bash
3233
truvari bench --base phab_base.vcf.gz \
3334
--comp phab_comp.vcf.gz \
34-
--sizemin 1 --sizefilt 1 \
35+
--sizemin 0 \
3536
--bSample HG002 \
3637
--cSample syndip \
3738
--no-ref a \
3839
--output initial_bench
3940
```
40-
This will compare all variants greater than 1bp ( `-S 1 -s 1` which includes SNPs) from the `HG002` sample to the `syndip` sample. We're also excluding any uncalled or reference homozygous sites with `--no-ref a`. The report in `initial_bench/summary.txt` shows:
41+
This will compare all variants greater than 1bp ( `-s 0` includes SNPs, `-s 1` would be 1bp INDELs) from the `HG002` sample to the `syndip` sample. We're also excluding any uncalled or reference homozygous sites with `--no-ref a`. The report in `initial_bench/summary.txt` shows:
4142
```json
4243
{
4344
"TP-base": 5,
@@ -50,24 +51,23 @@ This will compare all variants greater than 1bp ( `-S 1 -s 1` which includes SNP
5051
}
5152
```
5253

53-
These variants are pretty poorly matched, especially considering the `HG002` and `syndip` samples are using the same sequencing experiment. We can also inspect the `initial_bench/fn.vcf.gz` and see a lot of these discordant calls are concentrated in a 200bp window. Let's use `truvari phab` to harmonize the variants in this region.
54+
These variants are pretty poorly matched, especially considering the `HG002` and `syndip` samples are using the same sequencing experiment. We can also inspect the `initial_bench/fn.vcf.gz` and see a lot of these discordant calls are concentrated in a 200bp window.
55+
56+
Let's use `truvari phab` to harmonize the variants in this region.
5457
```bash
55-
truvari phab --base phab_base.vcf.gz \
56-
--comp phab_comp.vcf.gz \
57-
--bSample HG002 \
58-
--cSample syndip \
59-
--reference phab_ref.fa \
58+
truvari phab --reference phab_ref.fa \
6059
--region chr1:700-900 \
61-
-o harmonized.vcf.gz
60+
--samples HG002,syndip \
61+
-o harmonized.vcf.gz \
62+
phab_base.vcf.gz phab_comp.vcf.gz
6263
```
6364

6465
In our `harmonized.vcf.gz` we can see there are now only 9 variants. Let's run `truvari bench` again on the output to see how well the variants match after being harmonized.
6566

6667
```bash
6768
truvari bench -b harmonized.vcf.gz \
6869
-c harmonized.vcf.gz \
69-
-S 1 -s 1 \
70-
--no-ref a \
70+
-s --no-ref a \
7171
--bSample HG002 \
7272
--cSample syndip \
7373
-o harmonized_bench/
@@ -95,15 +95,16 @@ $ bcftools query -f "[%GT ]\n" phab_result/output.vcf.gz | sort | uniq -c
9595
```
9696
(We can ignore the phasing differences (`0/1` vs. `1/0`). These pipelines reported the parental alleles in a different order)
9797

98-
MSA
99-
---
98+
MSA as Merging
99+
--------------
100+
101+
If you read the `truvari phab --help` , you may have noticed that multiple VCFs can be provided and there's a `--samples` parameter. This is by design so that we can harmonize the variants across any number of VCFs. Even a single, multi-sample VCF. By performing a multiple-sequence alignment across samples, we can better represent variation across a population.
100102

101-
If you read the `truvari phab --help` , you may have noticed that the `--comp` VCF is optional. This is by design so that we can also harmonize the variants inside a single VCF. By performing a multiple-sequence alignment across samples, we can better represent variation across a population. To see this in action, let's run `phab` on all 86 samples in the `repo_utils/test_files/phab_base.vcf.gz`
103+
To see this in action, let's run `phab` on all 86 samples in the `repo_utils/test_files/phab_base.vcf.gz`
102104
```bash
103-
truvari phab -b phab_base.vcf.gz \
104-
-f phab_ref.fa \
105+
truvari phab -f phab_ref.fa \
105106
-r chr1:700-900 \
106-
-o msa_example.vcf.gz
107+
-o msa_example.vcf.gz \ phab_base.vcf.gz
107108
```
108109

109110
As a simple check, we can count the number of variants before/after `phab`:
@@ -115,7 +116,6 @@ The `160` original variants given to `phab` became just `60`.
115116

116117
Better yet, these fewer variants occur on fewer positions:
117118
```bash
118-
119119
bcftools query -r chr1:700-900 -f "%POS\n" phab_base.vcf.gz | sort | uniq | wc -l
120120
bcftools query -r chr1:700-900 -f "%POS\n" msa_example.vcf.gz | sort | uniq | wc -l
121121
```
@@ -143,15 +143,21 @@ The allele-count (AC) shows a 15% reduction in singletons and removal of all var
143143
1 150 1 81
144144
```
145145

146-
(TODO: pull the adotto TR region annotations and run this example through `truvari anno trf`. I bet we'll get a nice spectrum of copy-diff of the same motif in the `phab` calls.)
147-
148-
`--align`
149-
=========
150-
By default, `phab` will make the haplotypes and use an external call `mafft` to perform a multiple sequence alignment between them and the reference to harmonize the variants. While this is the most accurate alignment technique, it isn't fast. If you're willing to sacrifice some accuracy for a huge speed increase, you can use `--align wfa`, which also doesn't require an external tool. Another option is `--align poa` which performs a partial order alignment which is faster than mafft but less accurate and slower than wfa but more accurate. However, `poa` appears to be non-deterministic which is not ideal for some benchmarking purposes.
151-
152146
Limitations
153147
-----------
154148
* Creating and aligning haplotypes is impractical for very long sequences and maybe practically impossible for entire human chromosomes. Therefore, `truvari phab` is recommended to only be run on sub-regions.
155-
* By giving the variants new representations, variant counts will likely change.
156-
* Early testing on `phab` is on phased variants. While it can run on unphased variants, we can't yet recommend it. If regions contain unphased Hets or overlapping variants, it becomes more difficult to build a consensus sequence. So you can try out unphased variants, but proceed with caution.
149+
* By giving the variants new representations, variant counts will likely change and all original variant metadata (e.g. `FORMAT/DP`) is lost.
150+
* `phab` should only be used on phased variants unless you really know what you're doing. While it can run on unphased variants, it isn't recommended. If regions contain unphased Hets or overlapping variants, it becomes more difficult to build an accurate haplotype from the variants, and in many cases the final constructed haplotype will not represent any sequence observed in a sample, which will create entirely false variants. So you can try out unphased variants, but proceed with caution.
151+
152+
Failed Jobs
153+
-----------
154+
Sometimes the haplotype alignment job can fail outside of python's control. For example, `abpoa` can log an error on jobs with a large number of haplotypes.
155+
```
156+
[SIMDMalloc] posix_memalign fail!
157+
Size: 274877906944, Error: ENOMEM
158+
```
159+
Phab will monitor these processes and catch their errors. Therefore, it is possible for phab to have an exit status of 0, even though some (or even all) of the harmonization tasks did not complete. The variants inside these failed regions will not be in the output vcf.
157160

161+
`--no-dedup`
162+
------------
163+
By default, phab will only harmonize one representation of each observed haplotype. This gives huge performance boosts and produces identical results for `--align wfa/poa`. However, `mafft` results are not identical between when deduplicating haplotypes because it considers the relative weight of sequences based on their frequency.

docs/requirements.txt

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
psutil>=7.0.0
12
pywfa>=0.5.1
23
sphinx>=7.2
34
sphinx_rtd_theme>=2

imgs/coverage.svg

+2-2
Loading

pyproject.toml

+6-5
Original file line numberDiff line numberDiff line change
@@ -13,17 +13,18 @@ license = { text = "MIT" }
1313
dynamic = ["version"]
1414
requires-python = ">=3.8"
1515
dependencies = [
16-
"pywfa>=0.5.1",
17-
"rich>=12.5.1",
16+
"bwapy>=0.1.4",
1817
"edlib>=1.3.9",
19-
"pysam>=0.22",
2018
"intervaltree>=3.1",
2119
"joblib>=1.2.0",
2220
"numpy>=1.24.4",
23-
"pytabix>=0.1",
24-
"bwapy>=0.1.4",
21+
"rich>=12.5.1",
2522
"pandas>=1.5.3",
23+
"psutil>=7.0.0",
2624
"pyabpoa>=1.4.3",
25+
"pysam>=0.22",
26+
"pytabix>=0.1",
27+
"pywfa>=0.5.1",
2728
]
2829

2930
[project.scripts]

0 commit comments

Comments
 (0)