You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/Updates.md
+25-1
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,30 @@
1
-
# Truvari 5.1.1
1
+
# Truvari 5.3
2
2
*in progress*
3
3
4
+
* Fixed FP BNDs being dropped [details](https://github.com/ACEnglish/truvari/discussions/263).
5
+
* Restore default `--sizemax` - Some callers make SVs that span the entire chromosome, which disrupts truvari's chunking strategy
6
+
*`phab`
7
+
* Can now harmonize samples' variants across any number of VCFs. This entails a UI change of no more `-b/-c`.
8
+
* For a large jobs, a `--lowmem` flag:
9
+
* turns on progress bars and process pool monitoring for better tracking of failed harmonization jobs
10
+
* Much lower memory usage, and fewer failures
11
+
* Api refactor to programmatically build haplotypes with `phab.VCFtoHaplotypes` and other phab functions
12
+
*`bench`
13
+
* Can now run on vcfs without SAMPLE columns (e.g. annotation files)
14
+
* Fixed `anno grpaf` header tags. Some tools (e.g. IGV) don't link Type before Number.
15
+
* Fixed edge case when iterating variants not inside regions
16
+
17
+
# Truvari 5.2.0
18
+
*February 16, 2025*
19
+
20
+
* The default `--align` method for `phab` and `bench` switched to POA. See [discussion](https://github.com/ACEnglish/truvari/discussions/261) for details.
21
+
* Fix bug in `--pick ac` where FN/FP variants were not being counted/output.
Copy file name to clipboardExpand all lines: docs/bench.md
+11-6
Original file line number
Diff line number
Diff line change
@@ -149,13 +149,18 @@ Refining bench output
149
149
=====================
150
150
As described in the [[refine wiki|refine]], a limitation of Truvari bench is 1-to-1 variant comparison. However, `truvari refine` can harmonize the variants to give them more consistent representations. A bed file named `candidate.refine.bed` is created by `truvari bench` and holds a set of regions which may benefit from refinement. To use it, simply run
Copy file name to clipboardExpand all lines: docs/collapse.md
+13-1
Original file line number
Diff line number
Diff line change
@@ -76,7 +76,19 @@ For some results, one may not want to collapse variants with conflicting genotyp
76
76
77
77
--intra
78
78
=======
79
-
When a single sample is run through multiple SV callers, one may wish to consolidate those results. After the `bcftools merge` of the VCFs, there will be one SAMPLE column per-input. With `--intra`, collapse will consolidate the sample information so that only a single sample column is present in the output. Since the multiple callers may have different genotypes or other FORMAT fields with conflicting information, `--intra` takes the first column from the VCF, then second, etc. For example, if we have an entry with:
79
+
When a single sample is run through multiple SV callers, one may wish to consolidate those results. After the `bcftools merge` of the VCFs, there will be one SAMPLE column per-input. With `--intra`, collapse will consolidate the sample information so that only a single sample column is present in the output. This will also add a `FORMAT/SUPP` field which will indicates which columns had a present GT (e.g. 0/1 or 1/1). For example, these two calls would collapse:
80
+
81
+
```
82
+
#Call FORMAT S1 S2
83
+
Call1 GT:GQ:AD 1/1 ./.
84
+
Call2 GT:GQ:AD ./. 1/1
85
+
86
+
# Into
87
+
#Call FORMAT S1
88
+
Call1 GT:GQ:AD:SUPP 1/1:3
89
+
```
90
+
91
+
Since the multiple callers may have different genotypes or other FORMAT fields with conflicting information, `--intra` takes the first column from the VCF, then second, etc. For example, if we have an entry with:
Copy file name to clipboardExpand all lines: docs/phab.md
+38-32
Original file line number
Diff line number
Diff line change
@@ -5,17 +5,18 @@ Truvari's comparison engine can match variants using a wide range of thresholds.
5
5
6
6
This problem is easiest to conceptualize in the case of 'split' variants: imagine a pipeline calls a single 100bp DEL that can also be represented as two 50bp DELs. To match these variants, we would need to loosen our thresholds to `--pick multi --pctsim 0.50 --pctsize 0.50`. Plus, these thresholds leave no margin for error. If the variant caller erroneously deleted an extra base to make a 101bp DEL we would have to lower our thresholds even further. These thresholds are already too low because there's plenty of distinct alleles with >= 50% homology.
7
7
8
-
So how do we deal with inconsistent representations? In an ideal world, we would simply get rid of them by harmonizing the variants. This is the aim of `truvari phab`
8
+
How do we deal with inconsistent representations? In an ideal world, we would simply get rid of them by harmonizing the variants. This is the aim of `truvari phab`(pronounced 'fab')
9
9
10
10
`truvari phab` is designed to remove variant representation inconsistencies through harmonization. By reconstructing haplotypes from variants, running multiple-sequence alignment of the haplotypes along with the reference, and then recalling variants, we expect to remove discordance between variant representations and simplify the work required to perform variant comparison.
11
11
12
-
Requirements
13
-
------------
14
-
Since `truvari phab` uses mafft v7.505 via a command-line call, it expects it to be in the environment path. Download mafft and have its executable available in the `$PATH`[mafft](https://mafft.cbrc.jp/alignment/software/)
12
+
Alignment Methods
13
+
-----------------
14
+
By default, `phab` will make the haplotypes and use `abpoa` to perform a multiple sequence alignment between them and the reference to harmonize variants. Note that `abpoa` may be non-deterministic in some cases.
15
+
16
+
Another available MSA algorithm is an external call `mafft` which can have its parameters customized via e.g. `--mafft-params "--maxiterate 1000"`. While `mafft` is often a more accurate alignment technique, it isn't fast. This makes an external call to mafft v7.505. Therefore, phab expects the executable to be in the environment path. Download mafft and place its executable available in the `$PATH`[mafft](https://mafft.cbrc.jp/alignment/software/). Alternatively, you can use the Truvari [Docker container](Development#docker) which already has mafft ready for use.
15
17
16
-
Alternatively, you can use the Truvari [Docker container](Development#docker)which already has mafft ready for use.
18
+
If you're willing to sacrifice accuracy for a huge speed increase, you can use `--align wfa`. However, this isn't an MSA technique and instead independently aligns each haplotype, which will not produce the most parsimonious set of variants.
17
19
18
-
Also, you can use wave front aligner (pyWFA) or partial order alignment (pyabpoa). While wfa is the fastest approach, it will independently align haplotypes and therefore may produce less parsimonous aligments. And while poa is more accurate than wfa and faster than mafft, it is less accurate than mafft.
19
20
20
21
Example
21
22
-------
@@ -31,13 +32,13 @@ To start, let's use `truvari bench` to see how similar the variant calls are in
31
32
```bash
32
33
truvari bench --base phab_base.vcf.gz \
33
34
--comp phab_comp.vcf.gz \
34
-
--sizemin 1 --sizefilt 1 \
35
+
--sizemin 0 \
35
36
--bSample HG002 \
36
37
--cSample syndip \
37
38
--no-ref a \
38
39
--output initial_bench
39
40
```
40
-
This will compare all variants greater than 1bp ( `-S 1 -s 1`which includes SNPs) from the `HG002` sample to the `syndip` sample. We're also excluding any uncalled or reference homozygous sites with `--no-ref a`. The report in `initial_bench/summary.txt` shows:
41
+
This will compare all variants greater than 1bp ( `-s 0` includes SNPs, `-s 1`would be 1bp INDELs) from the `HG002` sample to the `syndip` sample. We're also excluding any uncalled or reference homozygous sites with `--no-ref a`. The report in `initial_bench/summary.txt` shows:
41
42
```json
42
43
{
43
44
"TP-base": 5,
@@ -50,24 +51,23 @@ This will compare all variants greater than 1bp ( `-S 1 -s 1` which includes SNP
50
51
}
51
52
```
52
53
53
-
These variants are pretty poorly matched, especially considering the `HG002` and `syndip` samples are using the same sequencing experiment. We can also inspect the `initial_bench/fn.vcf.gz` and see a lot of these discordant calls are concentrated in a 200bp window. Let's use `truvari phab` to harmonize the variants in this region.
54
+
These variants are pretty poorly matched, especially considering the `HG002` and `syndip` samples are using the same sequencing experiment. We can also inspect the `initial_bench/fn.vcf.gz` and see a lot of these discordant calls are concentrated in a 200bp window.
55
+
56
+
Let's use `truvari phab` to harmonize the variants in this region.
54
57
```bash
55
-
truvari phab --base phab_base.vcf.gz \
56
-
--comp phab_comp.vcf.gz \
57
-
--bSample HG002 \
58
-
--cSample syndip \
59
-
--reference phab_ref.fa \
58
+
truvari phab --reference phab_ref.fa \
60
59
--region chr1:700-900 \
61
-
-o harmonized.vcf.gz
60
+
--samples HG002,syndip \
61
+
-o harmonized.vcf.gz \
62
+
phab_base.vcf.gz phab_comp.vcf.gz
62
63
```
63
64
64
65
In our `harmonized.vcf.gz` we can see there are now only 9 variants. Let's run `truvari bench` again on the output to see how well the variants match after being harmonized.
(We can ignore the phasing differences (`0/1` vs. `1/0`). These pipelines reported the parental alleles in a different order)
97
97
98
-
MSA
99
-
---
98
+
MSA as Merging
99
+
--------------
100
+
101
+
If you read the `truvari phab --help` , you may have noticed that multiple VCFs can be provided and there's a `--samples` parameter. This is by design so that we can harmonize the variants across any number of VCFs. Even a single, multi-sample VCF. By performing a multiple-sequence alignment across samples, we can better represent variation across a population.
100
102
101
-
If you read the `truvari phab --help` , you may have noticed that the `--comp` VCF is optional. This is by design so that we can also harmonize the variants inside a single VCF. By performing a multiple-sequence alignment across samples, we can better represent variation across a population. To see this in action, let's run `phab` on all 86 samples in the `repo_utils/test_files/phab_base.vcf.gz`
103
+
To see this in action, let's run `phab` on all 86 samples in the `repo_utils/test_files/phab_base.vcf.gz`
102
104
```bash
103
-
truvari phab -b phab_base.vcf.gz \
104
-
-f phab_ref.fa \
105
+
truvari phab -f phab_ref.fa \
105
106
-r chr1:700-900 \
106
-
-o msa_example.vcf.gz
107
+
-o msa_example.vcf.gz\ phab_base.vcf.gz
107
108
```
108
109
109
110
As a simple check, we can count the number of variants before/after `phab`:
@@ -115,7 +116,6 @@ The `160` original variants given to `phab` became just `60`.
115
116
116
117
Better yet, these fewer variants occur on fewer positions:
@@ -143,15 +143,21 @@ The allele-count (AC) shows a 15% reduction in singletons and removal of all var
143
143
1 150 1 81
144
144
```
145
145
146
-
(TODO: pull the adotto TR region annotations and run this example through `truvari anno trf`. I bet we'll get a nice spectrum of copy-diff of the same motif in the `phab` calls.)
147
-
148
-
`--align`
149
-
=========
150
-
By default, `phab` will make the haplotypes and use an external call `mafft` to perform a multiple sequence alignment between them and the reference to harmonize the variants. While this is the most accurate alignment technique, it isn't fast. If you're willing to sacrifice some accuracy for a huge speed increase, you can use `--align wfa`, which also doesn't require an external tool. Another option is `--align poa` which performs a partial order alignment which is faster than mafft but less accurate and slower than wfa but more accurate. However, `poa` appears to be non-deterministic which is not ideal for some benchmarking purposes.
151
-
152
146
Limitations
153
147
-----------
154
148
* Creating and aligning haplotypes is impractical for very long sequences and maybe practically impossible for entire human chromosomes. Therefore, `truvari phab` is recommended to only be run on sub-regions.
155
-
* By giving the variants new representations, variant counts will likely change.
156
-
* Early testing on `phab` is on phased variants. While it can run on unphased variants, we can't yet recommend it. If regions contain unphased Hets or overlapping variants, it becomes more difficult to build a consensus sequence. So you can try out unphased variants, but proceed with caution.
149
+
* By giving the variants new representations, variant counts will likely change and all original variant metadata (e.g. `FORMAT/DP`) is lost.
150
+
*`phab` should only be used on phased variants unless you really know what you're doing. While it can run on unphased variants, it isn't recommended. If regions contain unphased Hets or overlapping variants, it becomes more difficult to build an accurate haplotype from the variants, and in many cases the final constructed haplotype will not represent any sequence observed in a sample, which will create entirely false variants. So you can try out unphased variants, but proceed with caution.
151
+
152
+
Failed Jobs
153
+
-----------
154
+
Sometimes the haplotype alignment job can fail outside of python's control. For example, `abpoa` can log an error on jobs with a large number of haplotypes.
155
+
```
156
+
[SIMDMalloc] posix_memalign fail!
157
+
Size: 274877906944, Error: ENOMEM
158
+
```
159
+
Phab will monitor these processes and catch their errors. Therefore, it is possible for phab to have an exit status of 0, even though some (or even all) of the harmonization tasks did not complete. The variants inside these failed regions will not be in the output vcf.
157
160
161
+
`--no-dedup`
162
+
------------
163
+
By default, phab will only harmonize one representation of each observed haplotype. This gives huge performance boosts and produces identical results for `--align wfa/poa`. However, `mafft` results are not identical between when deduplicating haplotypes because it considers the relative weight of sequences based on their frequency.
0 commit comments