You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/Updates.md
+32-1
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,37 @@
1
-
# Truvari 5.0
1
+
# Truvari 5.1.1
2
2
*in progress*
3
3
4
+
*`bench`
5
+
* new automatic hook into the refine step via `truvari bench --refine`
6
+
*`refine`
7
+
* completely reworked UI in favor of easier whole-genome SV refinement. See wiki for details
8
+
* Now writes a consolidated `refine.base.vcf.gz` and `refine.comp.vcf.gz` for easier tracking of variants' final states.
9
+
* Default behavior count original variant representations instead of the `phab` variant representations
10
+
*`collapse`
11
+
* Add `--dup-to-ins`
12
+
* Fixed bug where regions with >100 variants would sometimes not have all variants compared
13
+
*`--chain` functionality now capped to do only 1 transitive match, preventing uncontrolled over-merging
14
+
*`ga4gh`
15
+
* New/renamed parameters as part of general improvement work
16
+
* Output suffixes are now `.base.vcf.gz` and `.comp.vcf.gz` for consistency.
17
+
*`stratify`
18
+
* 1--complement` now outputs a single line of total variant counts outside of the regions instead of arbitrarily assigning variants to their nearest region
19
+
20
+
* misc
21
+
* Fix BND bugs
22
+
*`pysam.VariantFile.allele_variant_types` falsely identified some BNDs as INDELs, causing incorrect filtering by Truvari
23
+
* SVs Decomposed to BNDs strandedness flipped to be more representative of original SV
24
+
* unroll seqsim checks all directions
25
+
* Match sorting breaks seq/size ties with start/end distance
26
+
* Long SV roll limit speeds - ≥500bp, rolling is turned off
27
+
*`truvari.VariantRecord.within` edge case fix
28
+
29
+
30
+
31
+
32
+
# Truvari 5.0
33
+
*January 9, 2025*
34
+
4
35
* Reference context sequence comparison is now deprecated and sequence similarity calculation improved by also checking lexicographically minimum rotation's similarity. [details](https://github.com/ACEnglish/truvari/wiki/bench#comparing-sequences-of-variants)
5
36
* Symbolic variants (`<DEL>`, `<INV>`, `<DUP>`) can now be resolved for sequence comparison when a `--reference` is provided. The function for resolving the sequences is largely similar to [this discussion](https://github.com/ACEnglish/truvari/discussions/216)
6
37
* Symbolic variants can now match to resolved variants, even with `--pctseq 0`, with or without the new sequence resolving procedure.
Copy file name to clipboardExpand all lines: docs/bench.md
+20-4
Original file line number
Diff line number
Diff line change
@@ -178,17 +178,33 @@ Truvari can replace the symbolic alt of resolved SVs in the output VCF with the
178
178
179
179
BND Comparison
180
180
==============
181
-
Breakend (BND) variants are compared by checking a few conditions using a single threshold of `--bnddist` which holds the maximum distance around a breakpoint position to search for a match. Similar to the `--refdist` parameter, truvari looks for overlaps between the `dist` 'buffered' boundaries (e.g. `overlaps( POS_base - dist, POS_base + dist, POS_comp - dist, POS_comp + dist)`
181
+
Breakend (BND) variants are compared by checking a few conditions using a single threshold of `--bnddist` which holds the maximum distance around a breakpoint position to search for a match. Similar to the `--refdist` parameter, truvari looks for overlaps between the `dist` 'buffered' boundaries (e.g. `overlaps( POS_base - dist, POS_base + dist, POS_comp - dist, POS_comp + dist)`Additionally, if the CIPOS and and CIEND info tags are available in the entry, the e.g. POS is further buffered by `-abs(CIPOS[0])` and `+(abs(CIPOS[1])`.
182
182
183
183
The baseline and comparison BNDs' POS and their joined position must both be within `--bnddist` to be a match candidate (i.e. no partial matches). Furthermore, the direction and strand of the two BNDs must match, for example `t[p[` (piece extending to the right of p is joined after t) only matches with `t[p[` and won't match to `[p[t` (reverse comp piece extending right of p is joined before t).
184
184
185
185
BND's are annotated in the truvari output with fields: StartDistance (baseline minus comparison POS); EndDistance (baseline minus comparison join position); TruScore which describes the percent of the allowed distance needed to find this match (`(1 - ((abs(StartDistance) + abs(EndDistance)) / 2) / (bnddist*2)) * 100`). For example, two BNDs 20bp apart with bnddist of 100 makes a score of 90.
186
186
187
-
Another complication for matching BNDs is that they may represent an event which could be 'resolved' in another VCFs. For example, a tandem duplication between `start-end` could be represented as two BNDs of `start to N[{chrom}:{end}[` and `end to ]{self.chrom}:{start}]N`. Therefore, truvari also attempts to compare symbolic alt SVs (ALT = `<DEL>`, `<INV>`, `<DUP>`) to a BND by decomposing the symbolic alt into its breakpoints. These decomposed BNDs are then each checked against a comparison BND and the highest TruScore match kept.
187
+
BND comparison can be turned off by setting `--bnddist -1`. Single-end BNDs (e.g. ALT=`TTT.`) are still ignored.
188
188
189
-
Note that DUPs are always decomposed to DUP:TANDEM breakpoints. Note that with `--pick single`, a decomposed SV will only match to one BND, so `--pick multi` is recommended to ensure all BNDs will match to a single decomposed SV.
189
+
Cross-Representation Matching
190
+
=============================
190
191
191
-
BND comparison can be turned off by setting `--bnddist -1`. Symbolic ALT decomposition can be turned off with `--no-decompose`. Single-end BNDs (e.g. ALT=`TTT.`) are still ignored.
192
+
Truvari considers there to be three possible representation styles of SVs.
193
+
194
+
1. Resolved: SVs with the full REF and ALT sequences, most frequently representing INS and DEL.
195
+
2. Symbolic: SVs without the REF or ALT sequences having an ALT of e.g. `<DEL>, <DUP>`, etc.
196
+
3. BNDs: SV breakends represented with the e.g. `t[p[` ALT field.
197
+
198
+
Comparing SVs across these representation styles have the following caveats:
199
+
200
+
1. When comparing Resolved and Symbolic SVs, sequence similarity is turned off for thresholding matches. If a user provides a `--reference`, symbolic SVs shorter than the `--max-resolve` parameter (default 25kbp) can be turned into Resolved SVs [details in API docs](https://truvari.readthedocs.io/en/latest/truvari.package.html#truvari.VariantRecord.resolve) and therefore the sequence similarity thresholds are still enforced.
201
+
2. When a BND is compared to a with Resolved or Symbolic SV, the SV is 'decomposed' into a set of BNDs and each is compared with the original BND. If any of the decomposed BNDs matches to the original BND, the Resolved/Symbolic SV and BND are considered matching. Details of SV decomposition are [in the API docs](https://truvari.readthedocs.io/en/latest/truvari.package.html#truvari.VariantRecord.decompose)
202
+
203
+
Note that only Deletions (symbolic or resolved), INV (symbolic or resolved), and symbolic DUPs can be decomposed into BNDs. DUPs are always decomposed into DUP:TANDEM breakends.
204
+
205
+
Because SVs decompose into multiple BNDs (2 for DEL/DUP, 4 for INV), and because `--pick single` is the default, a decomposed SV will only match to one BND and the BNDs 'mate' will be a FN. To enable all BNDs to match to a decomposed SV, specify `--pick multi`.
206
+
207
+
SV decomposition into BNDs can be turned off with `--no-decompose`.
WARNING! If you have symbolic variants, see [the below section](https://github.com/ACEnglish/truvari/wiki/collapse#symbolic-variants) on using bcftools.
11
12
12
13
This will `paste` SAMPLE information between vcfs when calls have the exact same chrom, pos, ref, and alt.
13
14
For example, consider two vcfs:
@@ -40,6 +41,26 @@ For example, if we collapsed our example merge.vcf by matching any calls within
40
41
>> truvari_collapsed.vcf
41
42
chr1 7 ... GT ./. 0/1
42
43
44
+
Symbolic Variants
45
+
=================
46
+
bcftools may not handle symbolic variants correctly since it doesn't consider their END position. To correct for this, ensure that every input variant has a unique ID and use `bcftools merge -m id`. For example:
47
+
```
48
+
# A.vcf
49
+
chr1 147022730 SV1 N <DEL> . PASS SVLEN=-570334;END=147593064
50
+
# B.vcf
51
+
chr1 147022730 SV2 N <DEL> . PASS SVLEN=-990414;END=148013144
52
+
53
+
# bcftools merge -m none A.vcf B.vcf
54
+
# Premature collapse
55
+
chr1 147022730 SV1;SV2 N <DEL> . PASS SVLEN=-570334;END=147593064
56
+
57
+
# bcftools merge -m id A.vcf B.vcf
58
+
chr1 147022730 SV1 N <DEL> . PASS SVLEN=-570334;END=147593064
59
+
chr1 147022730 SV2 N <DEL> . PASS SVLEN=-990414;END=148013144
60
+
```
61
+
62
+
This bug has been replicated with bcftools 1.18 and 1.21.
63
+
43
64
--choose behavior
44
65
=================
45
66
When collapsing, the default `--choose` behavior is to take the `first` variant by position from a cluster to
@@ -89,18 +110,22 @@ will become:
89
110
Normally, every variant in a set of variants that are collapsed together matches every other variant in the set. However, when using `--chain` mode, we allow 'transitive matching'. This means that all variants match to only at least one other variant in the set. In situations where a 'middle' variant has two matches that don't match each other, without `--chain` the locus will produce two variants whereas using `--chain` will produce one.
90
111
For example, if we have
91
112
92
-
chr1 5 ..
113
+
chr1 1 ..
114
+
chr1 4 ..
93
115
chr1 7 ..
94
-
chr1 9 ..
116
+
chr1 10 ..
95
117
96
-
When we collapse anything within 2bp of each other, without `--chain`, we output:
118
+
We take the `chr1 1` variant and find all its matches. When we collapse anything within 5bp of each other, without `--chain`, we output:
97
119
98
-
chr1 5 ..
99
-
chr1 9 ..
120
+
chr1 1 ..
121
+
chr1 7 ..
122
+
123
+
With `--chain`, we would allow one level of transitive matching. This means that after finding the `chr1 1 -> chr1 4` match, we check `chr1 4` against all the remaining variants and would output
100
124
101
-
With `--chain`, we would collapse `chr1 9` as well, producing
125
+
chr1 1 ..
126
+
chr1 10 ..
102
127
103
-
chr1 5 ..
128
+
Note that this leaves `chr1 10` because we don't do multiple levels of transitive matching, meaning we never compare `chr1 7` to `chr1 10`. This is preferred because otherwise variants which have a continuous range of similarity could all be collapsed into a single variant. e.g., if the position in this example were sizes and, we wouldn't want the 1bp variant being a kept representation for all the variants.
104
129
105
130
Annotations
106
131
===========
@@ -111,55 +136,4 @@ The output file has only two annotations added to the `INFO`.
111
136
-`NumCollapsed` - Number of variants collapsed into this variant
112
137
-`NumConsolidated` - Number of samples' genotypes consolidated into this call's genotypes
113
138
114
-
The collapsed file has all of the annotations added by [[bench|bench#definition-of-annotations-added-to-tp-vcfs]]. Note that `MatchId` is tied to the output file's `CollapseId`. See [MatchIds](https://github.com/spiralgenetics/truvari/wiki/MatchIds) for details.
--hap Collapsing a single individual's haplotype resolved calls (False)
139
-
--chain Chain comparisons to extend possible collapsing (False)
140
-
--no-consolidate Skip consolidation of sample genotype fields (True)
141
-
--null-consolidate NULL_CONSOLIDATE
142
-
Comma separated list of FORMAT fields to consolidate into the kept entry by taking the first non-null
143
-
from all neighbors (None)
144
-
145
-
Comparison Threshold Arguments:
146
-
-r REFDIST, --refdist REFDIST
147
-
Max reference location distance (500)
148
-
-p PCTSIM, --pctsim PCTSIM
149
-
Min percent allele sequence similarity. Set to 0 to ignore. (0.95)
150
-
-B MINHAPLEN, --minhaplen MINHAPLEN
151
-
Minimum haplotype sequence length to create (50)
152
-
-P PCTSIZE, --pctsize PCTSIZE
153
-
Min pct allele size similarity (minvarsize/maxvarsize) (0.95)
154
-
-O PCTOVL, --pctovl PCTOVL
155
-
Min pct reciprocal overlap (0.0) for DEL events
156
-
-t, --typeignore Variant types don't need to match to compare (False)
157
-
--use-lev Use the Levenshtein distance ratio instead of edlib editDistance ratio (False)
158
-
159
-
Filtering Arguments:
160
-
-s SIZEMIN, --sizemin SIZEMIN
161
-
Minimum variant size to consider for comparison (50)
162
-
-S SIZEMAX, --sizemax SIZEMAX
163
-
Maximum variant size to consider for comparison (50000)
164
-
--passonly Only consider calls with FILTER == PASS
165
-
```
139
+
The collapsed file has all of the annotations added by [[bench|bench#definition-of-annotations-added-to-tp-vcfs]]. Note that `MatchId` is tied to the output file's `CollapseId`. See [MatchIds](https://github.com/spiralgenetics/truvari/wiki/MatchIds) for details.
0 commit comments