Skip to content

Commit 14fa7b6

Browse files
committed
Merge branch 'release/v5.1.1'
2 parents 7a45d98 + 61248ee commit 14fa7b6

File tree

153 files changed

+9103
-7463
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

153 files changed

+9103
-7463
lines changed

docs/Updates.md

+32-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,37 @@
1-
# Truvari 5.0
1+
# Truvari 5.1.1
22
*in progress*
33

4+
* `bench`
5+
* new automatic hook into the refine step via `truvari bench --refine`
6+
* `refine`
7+
* completely reworked UI in favor of easier whole-genome SV refinement. See wiki for details
8+
* Now writes a consolidated `refine.base.vcf.gz` and `refine.comp.vcf.gz` for easier tracking of variants' final states.
9+
* Default behavior count original variant representations instead of the `phab` variant representations
10+
* `collapse`
11+
* Add `--dup-to-ins`
12+
* Fixed bug where regions with >100 variants would sometimes not have all variants compared
13+
* `--chain` functionality now capped to do only 1 transitive match, preventing uncontrolled over-merging
14+
* `ga4gh`
15+
* New/renamed parameters as part of general improvement work
16+
* Output suffixes are now `.base.vcf.gz` and `.comp.vcf.gz` for consistency.
17+
* `stratify`
18+
* 1--complement` now outputs a single line of total variant counts outside of the regions instead of arbitrarily assigning variants to their nearest region
19+
20+
* misc
21+
* Fix BND bugs
22+
* `pysam.VariantFile.allele_variant_types` falsely identified some BNDs as INDELs, causing incorrect filtering by Truvari
23+
* SVs Decomposed to BNDs strandedness flipped to be more representative of original SV
24+
* unroll seqsim checks all directions
25+
* Match sorting breaks seq/size ties with start/end distance
26+
* Long SV roll limit speeds - ≥500bp, rolling is turned off
27+
* `truvari.VariantRecord.within` edge case fix
28+
29+
30+
31+
32+
# Truvari 5.0
33+
*January 9, 2025*
34+
435
* Reference context sequence comparison is now deprecated and sequence similarity calculation improved by also checking lexicographically minimum rotation's similarity. [details](https://github.com/ACEnglish/truvari/wiki/bench#comparing-sequences-of-variants)
536
* Symbolic variants (`<DEL>`, `<INV>`, `<DUP>`) can now be resolved for sequence comparison when a `--reference` is provided. The function for resolving the sequences is largely similar to [this discussion](https://github.com/ACEnglish/truvari/discussions/216)
637
* Symbolic variants can now match to resolved variants, even with `--pctseq 0`, with or without the new sequence resolving procedure.

docs/bench.md

+20-4
Original file line numberDiff line numberDiff line change
@@ -178,17 +178,33 @@ Truvari can replace the symbolic alt of resolved SVs in the output VCF with the
178178

179179
BND Comparison
180180
==============
181-
Breakend (BND) variants are compared by checking a few conditions using a single threshold of `--bnddist` which holds the maximum distance around a breakpoint position to search for a match. Similar to the `--refdist` parameter, truvari looks for overlaps between the `dist` 'buffered' boundaries (e.g. `overlaps( POS_base - dist, POS_base + dist, POS_comp - dist, POS_comp + dist)`
181+
Breakend (BND) variants are compared by checking a few conditions using a single threshold of `--bnddist` which holds the maximum distance around a breakpoint position to search for a match. Similar to the `--refdist` parameter, truvari looks for overlaps between the `dist` 'buffered' boundaries (e.g. `overlaps( POS_base - dist, POS_base + dist, POS_comp - dist, POS_comp + dist)` Additionally, if the CIPOS and and CIEND info tags are available in the entry, the e.g. POS is further buffered by `-abs(CIPOS[0])` and `+(abs(CIPOS[1])`.
182182

183183
The baseline and comparison BNDs' POS and their joined position must both be within `--bnddist` to be a match candidate (i.e. no partial matches). Furthermore, the direction and strand of the two BNDs must match, for example `t[p[` (piece extending to the right of p is joined after t) only matches with `t[p[` and won't match to `[p[t` (reverse comp piece extending right of p is joined before t).
184184

185185
BND's are annotated in the truvari output with fields: StartDistance (baseline minus comparison POS); EndDistance (baseline minus comparison join position); TruScore which describes the percent of the allowed distance needed to find this match (`(1 - ((abs(StartDistance) + abs(EndDistance)) / 2) / (bnddist*2)) * 100`). For example, two BNDs 20bp apart with bnddist of 100 makes a score of 90.
186186

187-
Another complication for matching BNDs is that they may represent an event which could be 'resolved' in another VCFs. For example, a tandem duplication between `start-end` could be represented as two BNDs of `start to N[{chrom}:{end}[` and `end to ]{self.chrom}:{start}]N`. Therefore, truvari also attempts to compare symbolic alt SVs (ALT = `<DEL>`, `<INV>`, `<DUP>`) to a BND by decomposing the symbolic alt into its breakpoints. These decomposed BNDs are then each checked against a comparison BND and the highest TruScore match kept.
187+
BND comparison can be turned off by setting `--bnddist -1`. Single-end BNDs (e.g. ALT=`TTT.`) are still ignored.
188188

189-
Note that DUPs are always decomposed to DUP:TANDEM breakpoints. Note that with `--pick single`, a decomposed SV will only match to one BND, so `--pick multi` is recommended to ensure all BNDs will match to a single decomposed SV.
189+
Cross-Representation Matching
190+
=============================
190191

191-
BND comparison can be turned off by setting `--bnddist -1`. Symbolic ALT decomposition can be turned off with `--no-decompose`. Single-end BNDs (e.g. ALT=`TTT.`) are still ignored.
192+
Truvari considers there to be three possible representation styles of SVs.
193+
194+
1. Resolved: SVs with the full REF and ALT sequences, most frequently representing INS and DEL.
195+
2. Symbolic: SVs without the REF or ALT sequences having an ALT of e.g. `<DEL>, <DUP>`, etc.
196+
3. BNDs: SV breakends represented with the e.g. `t[p[` ALT field.
197+
198+
Comparing SVs across these representation styles have the following caveats:
199+
200+
1. When comparing Resolved and Symbolic SVs, sequence similarity is turned off for thresholding matches. If a user provides a `--reference`, symbolic SVs shorter than the `--max-resolve` parameter (default 25kbp) can be turned into Resolved SVs [details in API docs](https://truvari.readthedocs.io/en/latest/truvari.package.html#truvari.VariantRecord.resolve) and therefore the sequence similarity thresholds are still enforced.
201+
2. When a BND is compared to a with Resolved or Symbolic SV, the SV is 'decomposed' into a set of BNDs and each is compared with the original BND. If any of the decomposed BNDs matches to the original BND, the Resolved/Symbolic SV and BND are considered matching. Details of SV decomposition are [in the API docs](https://truvari.readthedocs.io/en/latest/truvari.package.html#truvari.VariantRecord.decompose)
202+
203+
Note that only Deletions (symbolic or resolved), INV (symbolic or resolved), and symbolic DUPs can be decomposed into BNDs. DUPs are always decomposed into DUP:TANDEM breakends.
204+
205+
Because SVs decompose into multiple BNDs (2 for DEL/DUP, 4 for INV), and because `--pick single` is the default, a decomposed SV will only match to one BND and the BNDs 'mate' will be a FN. To enable all BNDs to match to a decomposed SV, specify `--pick multi`.
206+
207+
SV decomposition into BNDs can be turned off with `--no-decompose`.
192208

193209
Controlling the number of matches
194210
=================================

docs/collapse.md

+33-59
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ To start, we merge multiple VCFs (each with their own sample) and ensure there a
88
```bash
99
bcftools merge -m none one.vcf.gz two.vcf.gz | bgzip > merge.vcf.gz
1010
```
11+
WARNING! If you have symbolic variants, see [the below section](https://github.com/ACEnglish/truvari/wiki/collapse#symbolic-variants) on using bcftools.
1112

1213
This will `paste` SAMPLE information between vcfs when calls have the exact same chrom, pos, ref, and alt.
1314
For example, consider two vcfs:
@@ -40,6 +41,26 @@ For example, if we collapsed our example merge.vcf by matching any calls within
4041
>> truvari_collapsed.vcf
4142
chr1 7 ... GT ./. 0/1
4243

44+
Symbolic Variants
45+
=================
46+
bcftools may not handle symbolic variants correctly since it doesn't consider their END position. To correct for this, ensure that every input variant has a unique ID and use `bcftools merge -m id`. For example:
47+
```
48+
# A.vcf
49+
chr1 147022730 SV1 N <DEL> . PASS SVLEN=-570334;END=147593064
50+
# B.vcf
51+
chr1 147022730 SV2 N <DEL> . PASS SVLEN=-990414;END=148013144
52+
53+
# bcftools merge -m none A.vcf B.vcf
54+
# Premature collapse
55+
chr1 147022730 SV1;SV2 N <DEL> . PASS SVLEN=-570334;END=147593064
56+
57+
# bcftools merge -m id A.vcf B.vcf
58+
chr1 147022730 SV1 N <DEL> . PASS SVLEN=-570334;END=147593064
59+
chr1 147022730 SV2 N <DEL> . PASS SVLEN=-990414;END=148013144
60+
```
61+
62+
This bug has been replicated with bcftools 1.18 and 1.21.
63+
4364
--choose behavior
4465
=================
4566
When collapsing, the default `--choose` behavior is to take the `first` variant by position from a cluster to
@@ -89,18 +110,22 @@ will become:
89110
Normally, every variant in a set of variants that are collapsed together matches every other variant in the set. However, when using `--chain` mode, we allow 'transitive matching'. This means that all variants match to only at least one other variant in the set. In situations where a 'middle' variant has two matches that don't match each other, without `--chain` the locus will produce two variants whereas using `--chain` will produce one.
90111
For example, if we have
91112

92-
chr1 5 ..
113+
chr1 1 ..
114+
chr1 4 ..
93115
chr1 7 ..
94-
chr1 9 ..
116+
chr1 10 ..
95117

96-
When we collapse anything within 2bp of each other, without `--chain`, we output:
118+
We take the `chr1 1` variant and find all its matches. When we collapse anything within 5bp of each other, without `--chain`, we output:
97119

98-
chr1 5 ..
99-
chr1 9 ..
120+
chr1 1 ..
121+
chr1 7 ..
122+
123+
With `--chain`, we would allow one level of transitive matching. This means that after finding the `chr1 1 -> chr1 4` match, we check `chr1 4` against all the remaining variants and would output
100124

101-
With `--chain`, we would collapse `chr1 9` as well, producing
125+
chr1 1 ..
126+
chr1 10 ..
102127

103-
chr1 5 ..
128+
Note that this leaves `chr1 10` because we don't do multiple levels of transitive matching, meaning we never compare `chr1 7` to `chr1 10`. This is preferred because otherwise variants which have a continuous range of similarity could all be collapsed into a single variant. e.g., if the position in this example were sizes and, we wouldn't want the 1bp variant being a kept representation for all the variants.
104129

105130
Annotations
106131
===========
@@ -111,55 +136,4 @@ The output file has only two annotations added to the `INFO`.
111136
- `NumCollapsed` - Number of variants collapsed into this variant
112137
- `NumConsolidated` - Number of samples' genotypes consolidated into this call's genotypes
113138

114-
The collapsed file has all of the annotations added by [[bench|bench#definition-of-annotations-added-to-tp-vcfs]]. Note that `MatchId` is tied to the output file's `CollapseId`. See [MatchIds](https://github.com/spiralgenetics/truvari/wiki/MatchIds) for details.
115-
116-
```
117-
usage: collapse [-h] -i INPUT [-o OUTPUT] [-c COLLAPSED_OUTPUT] [-f REFERENCE] [-k {first,maxqual,common}] [--debug]
118-
[-r REFDIST] [-p PCTSIM] [-B MINHAPLEN] [-P PCTSIZE] [-O PCTOVL] [-t] [--use-lev] [--hap] [--chain]
119-
[--no-consolidate] [--null-consolidate NULL_CONSOLIDATE] [-s SIZEMIN] [-S SIZEMAX] [--passonly]
120-
121-
Structural variant collapser
122-
123-
Will collapse all variants within sizemin/max that match over thresholds
124-
125-
options:
126-
-h, --help show this help message and exit
127-
-i INPUT, --input INPUT
128-
Comparison set of calls
129-
-o OUTPUT, --output OUTPUT
130-
Output vcf (stdout)
131-
-c COLLAPSED_OUTPUT, --collapsed-output COLLAPSED_OUTPUT
132-
Where collapsed variants are written (collapsed.vcf)
133-
-f REFERENCE, --reference REFERENCE
134-
Indexed fasta used to call variants
135-
-k {first,maxqual,common}, --keep {first,maxqual,common}
136-
When collapsing calls, which one to keep (first)
137-
--debug Verbose logging
138-
--hap Collapsing a single individual's haplotype resolved calls (False)
139-
--chain Chain comparisons to extend possible collapsing (False)
140-
--no-consolidate Skip consolidation of sample genotype fields (True)
141-
--null-consolidate NULL_CONSOLIDATE
142-
Comma separated list of FORMAT fields to consolidate into the kept entry by taking the first non-null
143-
from all neighbors (None)
144-
145-
Comparison Threshold Arguments:
146-
-r REFDIST, --refdist REFDIST
147-
Max reference location distance (500)
148-
-p PCTSIM, --pctsim PCTSIM
149-
Min percent allele sequence similarity. Set to 0 to ignore. (0.95)
150-
-B MINHAPLEN, --minhaplen MINHAPLEN
151-
Minimum haplotype sequence length to create (50)
152-
-P PCTSIZE, --pctsize PCTSIZE
153-
Min pct allele size similarity (minvarsize/maxvarsize) (0.95)
154-
-O PCTOVL, --pctovl PCTOVL
155-
Min pct reciprocal overlap (0.0) for DEL events
156-
-t, --typeignore Variant types don't need to match to compare (False)
157-
--use-lev Use the Levenshtein distance ratio instead of edlib editDistance ratio (False)
158-
159-
Filtering Arguments:
160-
-s SIZEMIN, --sizemin SIZEMIN
161-
Minimum variant size to consider for comparison (50)
162-
-S SIZEMAX, --sizemax SIZEMAX
163-
Maximum variant size to consider for comparison (50000)
164-
--passonly Only consider calls with FILTER == PASS
165-
```
139+
The collapsed file has all of the annotations added by [[bench|bench#definition-of-annotations-added-to-tp-vcfs]]. Note that `MatchId` is tied to the output file's `CollapseId`. See [MatchIds](https://github.com/spiralgenetics/truvari/wiki/MatchIds) for details.

0 commit comments

Comments
 (0)