Skip to content

Commit f90f194

Browse files
committed
Merge branch 'release/v5.0.0'
2 parents e6a49e7 + 7380b25 commit f90f194

File tree

748 files changed

+42105
-32998
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

748 files changed

+42105
-32998
lines changed

README.md

+3-2
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,14 @@
22
[![pylint](imgs/pylint.svg)](https://github.com/acenglish/truvari/actions/workflows/pylint.yml)
33
[![FuncTests](https://github.com/acenglish/truvari/actions/workflows/func_tests.yml/badge.svg?branch=develop&event=push)](https://github.com/acenglish/truvari/actions/workflows/func_tests.yml)
44
[![coverage](imgs/coverage.svg)](https://github.com/acenglish/truvari/actions/workflows/func_tests.yml)
5-
[![develop](https://img.shields.io/github/commits-since/acenglish/truvari/v4.3.0)](https://github.com/ACEnglish/truvari/compare/v4.3.0...develop)
5+
[![develop](https://img.shields.io/github/commits-since/acenglish/truvari/v4.3.1)](https://github.com/ACEnglish/truvari/compare/v4.3.1...develop)
66
[![Downloads](https://static.pepy.tech/badge/truvari)](https://pepy.tech/project/truvari)
77

88
![Logo](https://raw.githubusercontent.com/ACEnglish/truvari/develop/imgs/BoxScale1_DarkBG.png)
99
Toolkit for benchmarking, merging, and annotating Structural Variants
1010

11-
📚 [WIKI page](https://github.com/acenglish/truvari/wiki) has detailed documentation.
11+
📚 [WIKI page](https://github.com/acenglish/truvari/wiki) has detailed user documentation.
12+
🛠️ [Developer Docs](https://truvari.readthedocs.io/en/latest/) for the truvari API.
1213
📈 See [Updates](https://github.com/acenglish/truvari/wiki/Updates) on new versions.
1314
📝 Read our Papers ([#1](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02840-6), [#2](https://rdcu.be/dFQNN)) to learn more.
1415

File renamed without changes.
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
1-
Some regions of the reference genome can give rise to a huge number of SVs. These regions are typically in telomeres or centromeres, especially on chm13. Furthermore, some multi-sample VCFs might contain samples with a high genetic distance from the reference can also create an unreasonable number of SVs. These 'high-density' regions can cause difficulties for `truvari collapse` when there are too many comparisons that need to be made.
1+
Some regions of the reference genome can give rise to a huge number of SVs. These regions are typically in telomeres or centromeres, especially on chm13. Furthermore, some multi-sample VCFs might contain samples with a high genetic distance from the reference which can also create an unreasonable number of SVs. These 'high-density' regions can cause difficulties for `truvari collapse` when there are too many comparisons that need to be made.
22

33
If you find `truvari collapse` 'freezing' during processing where it is no longer writing variants to the output VCFs, you should investigate if there are regions which should be excluded from analysis. To do this, first run:
44

55
```
66
truvari anno chunks input.vcf.gz > counts.bed
77
```
88

9-
The `counts.bed` will have the chromosome, start, and end position of each chunk in the VCF as well as three additional columns. The 3rd column has the number of SVs inside the chunk while the 4th and 5th have a comma-separated counts of the number of variants in 'sub-chunks'. These sub-chunk counts correspond to advanced separation of the variants beyond just their window. The 4th is after separating by svtype and svlen, the 5th is re-chunking the svtype/svlen separation by distance.
9+
The `counts.bed` will have the chromosome, start, and end position of each chunk in the VCF as well as three additional columns. The 4th column has the number of SVs inside the chunk while the 5th and 6th have a comma-separated counts of the number of variants in 'sub-chunks'. These sub-chunk counts correspond to advanced separation of the variants beyond just their window. The 5th is after separating by svtype and svlen, the 6th is re-chunking the svtype/svlen separation by distance.
1010

11-
If you find spans in the `counts.bed` with a huge number of SVs, these are prime candidates for exclusion to speed up `truvari collapse`. To exclude them, you first subset the regions of interest and then use `bedtools` to create a bed file that will skip them.
11+
If you find spans in the `counts.bed` with a huge number of SVs, these are prime candidates for exclusion to speed up `truvari collapse`. To exclude them, you first subset to the regions of interest and then use `bedtools` to create a bed file that will skip them.
1212

1313
```
1414
# exclude regions with more than 30k SVs. You should test this threshold.
@@ -18,6 +18,6 @@ bedtools complement -g genome.bed -i to_exclude.bed > to_analyize.bed
1818
truvari collapse --bed to_analyze.bed -i input.vcf.gz
1919
```
2020

21-
When considering what threshold you would like to use, just looking at the 3rd column may not be sufficient as the 'sub-chunks' may be smaller and will therefore run faster. There also is no guarantee that a region with high SV density will be slow. For example, if there are 100k SVs in a window, but they all could collapse, it would only take O(N - 1) comparisons to perform the collapse. The problems arise when the 100k SVs have few redundant SVs and therefore requires O(N**2) comparisons.
21+
When considering what threshold you would like to use, just looking at the 4th column may not be sufficient as the 'sub-chunks' may be smaller and will therefore run faster. There also is no guarantee that a region with high SV density will be slow. For example, if all SVs in a chunk could collapse, it would only take O(N - 1) comparisons to complete the collapse. The problems arise when the SVs have few redundant SVs and therefore requires O(N**2) comparisons.
2222

23-
A conservative workflow for figuring out which regions to exclude would run `truvari collapse` without a `--bed`, wait for regions to 'freeze', and then check if the last variant written out by `truvari collapse` is just before a high-density chunk in the `counts.bed`. That region could then be excluded an a collapsing could be repeated until success.
23+
A conservative workflow for figuring out which regions to exclude would run `truvari collapse` without a `--bed`, wait for regions to 'freeze', and then check if the last variant written out by `truvari collapse` is just before a high-density chunk in the `counts.bed`. That region could then be excluded and collapsing repeated until success.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.

docs/v4.3.1/Updates.md renamed to docs/Updates.md

+15
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,18 @@
1+
# Truvari 5.0
2+
*in progress*
3+
4+
* Reference context sequence comparison is now deprecated and sequence similarity calculation improved by also checking lexicographically minimum rotation's similarity. [details](https://github.com/ACEnglish/truvari/wiki/bench#comparing-sequences-of-variants)
5+
* Symbolic variants (`<DEL>`, `<INV>`, `<DUP>`) can now be resolved for sequence comparison when a `--reference` is provided. The function for resolving the sequences is largely similar to [this discussion](https://github.com/ACEnglish/truvari/discussions/216)
6+
* Symbolic variants can now match to resolved variants, even with `--pctseq 0`, with or without the new sequence resolving procedure.
7+
* Symbolic variant sub-types are ignored e.g. `<DUP:TANDEM> == <DUP>`
8+
* `--sizemax` now default to `-1`, meaning all variant ≥ `--sizemin / --sizefilt` are compared
9+
* Redundant variants which are collapsed into kept (a.k.a. removed) variants now more clearly labeled (`--removed-output` instead of `--collapsed-output`)
10+
* Fixed 'Unknown error' caused by unset TMPDIR ([#229](https://github.com/ACEnglish/truvari/issues/229) and [#245](https://github.com/ACEnglish/truvari/issues/245))
11+
* Fixes to minor record keeping bugs in refine/ga4gh better ensure all variants are counted/preserved
12+
* BND variants are now compared by bench ([details](https://github.com/ACEnglish/truvari/wiki/bench#bnd-comparison))
13+
* Cleaner outputs by not writing matching annotations (e.g. `PctSeqSimilarity`) that are `None`
14+
* Major refactor of Truvari package API for easy reuse of SV comparison functions ([details](https://truvari.readthedocs.io/en/latest/))
15+
116
# Truvari 4.3.1
217
*September 9, 2024*
318
* `bench`

docs/v4.2.2/anno.md renamed to docs/anno.md

+7-1
Original file line numberDiff line numberDiff line change
@@ -74,7 +74,7 @@ Adds a tandem repeat annotation to sequence resolved Insertion/Deletion variants
7474
### Details
7575
Given a set of tandem repeat regions and a VCF, this annotate the tandem repeat motif and copy number change of insertions and deletions as expansions/contraction. The expected input catalog of tandem repeats is from a subset of columns in the Adotto TR catalog ([link](https://github.com/ACEnglish/adotto/blob/main/regions/DataDescription.md)). This file can be formatted for `truvari anno trf` via:
7676
```bash
77-
zcat adotto_TRregions_v1.1.bed.gz | cut -f1-3,18 | bgzip > anno.trf.bed.gz
77+
zcat adotto_TRregions_v1.2.1.bed.gz | cut -f1-3,18 | bgzip > anno.trf.bed.gz
7878
tabix anno.trf.bed.gz
7979
```
8080
For deletions, the tool simply finds the motif annotation with the highest overlap over the variant's boundaries. It then removes that sequence from the reference and calculates how many copies of the motif are removed with the formula `round(-(ovl_pct * svlen) / anno["period"], 1)`. If a deletion overlaps multiple motifs, the highest scoring motif is chosen based on higher reciprocal overlap percent first and TRF score second (see [code](https://github.com/ACEnglish/truvari/blob/2219f52850252c18dcd8c679da6644bb1cee5b68/truvari/annotations/trf.py#L29)].
@@ -116,6 +116,12 @@ options:
116116
--debug Verbose logging
117117
```
118118

119+
### Notes about the Adotto catalog
120+
If using an older version of Truvari (≤4.3.1) or adotto catalog v1.2, you may have incompatibility issues due to the 'repeat' field being named 'motif'. This can be fixed by running:
121+
```
122+
zcat adotto_TRregions_v1.2.bed.gz | cut -f1-3,18 | sed 's/"\[/[/g; s/"$//; s/""/"/g; s/"motif":/"repeat":/g' | bgzip > anno.trf.bed.gz
123+
```
124+
119125
# truvari anno grm
120126

121127
For every SV, we create a kmer over the the upstream and downstream reference and alternate breakpoints.

docs/api/conf.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,7 @@
1717
# -- Project information -----------------------------------------------------
1818

1919
project = 'Truvari'
20-
copyright = '2023, Adam English'
20+
copyright = '2025, Adam English'
2121
author = 'Adam English'
2222

2323

docs/api/index.rst

+2-1
Original file line numberDiff line numberDiff line change
@@ -17,7 +17,8 @@ https://github.com/ACEnglish/truvari/wiki
1717
:maxdepth: 2
1818
:caption: Contents:
1919

20-
truvari
20+
truvari.examples
21+
truvari.package
2122

2223

2324
Indices and tables

docs/api/truvari.examples.rst

+112
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,112 @@
1+
Truvari API QuickStart
2+
======================
3+
4+
Truvari provides functionality to facilitate the comparison of structural variants (SVs) in VCF variant records. Developers can easily leverage this functionality by replacing calls to `pysam.VariantFile` with `truvari.VariantFile`. The `truvari.VariantFile` retains all `pysam` functionality.
5+
6+
.. code-block:: python
7+
8+
import truvari
9+
vcf = truvari.VariantFile("input.vcf.gz")
10+
for entry in vcf:
11+
# Access variant's INFO fields using pysam
12+
if 'SVTYPE' in entry.info and 'SVLEN' in entry.info:
13+
print(entry.info['SVTYPE'], entry.info['SVLEN'])
14+
15+
# But these INFOs aren't always available.
16+
# Access type/size properties of variants using truvari
17+
print(entry.var_type(), entry.var_size())
18+
19+
# Access genotype (GT)
20+
# using pysam ->
21+
if 'SAMPLE' in entry.samples and 'GT' in entry.samples['SAMPLE']:
22+
print(entry.samples['SAMPLE']['GT'])
23+
24+
# using truvari ->
25+
print(entry.gt('SAMPLE'))
26+
27+
# Calculate variant's allele frequency with truvari
28+
print(entry.allele_freq_annos())
29+
30+
Details of all available functions are in :ref:`package documentation <variant_handling>`.
31+
32+
Comparing Variants
33+
------------------
34+
35+
The `truvari.VariantRecord` simplifies comparing two VCF entries.
36+
37+
.. code-block:: python
38+
39+
# Given two `truvari.VariantRecords`, entry1 and entry2
40+
match = entry1.match(entry2)
41+
print("Entries' Sequence Similarity:", match.seqsim)
42+
print("Entries' Size Similarity:", match.sizesim)
43+
print("Is the match above thresholds:", match.state)
44+
45+
This returns a :ref:`truvari.MatchResult <match_result>`. You can customize matching thresholds by providing :ref:`truvari.VariantParams <variant_params>` to the `truvari.VariantFile`.
46+
47+
.. code-block:: python
48+
49+
# Disable sequence and size similarity; enable reciprocal overlap
50+
p = truvari.VariantParams(pctseq=0, pctsize=0, pctovl=0.5)
51+
vcf = truvari.VariantFile("input.vcf.gz", params=p)
52+
entry1 = next(vcf)
53+
entry2 = next(vcf)
54+
match = entry1.match(entry2)
55+
56+
Filtering Variants
57+
------------------
58+
59+
The `truvari.VariantParams` provides parameters for filtering variants, such as minimum or maximum SV sizes.
60+
61+
.. code-block:: python
62+
63+
p = truvari.VariantParams(sizemin=200, sizemax=500)
64+
vcf = truvari.VariantFile("input.vcf.gz", params=p)
65+
# Retrieve all variant records within sizemin and sizemax
66+
results = [entry for entry in vcf if not entry.filter_size()]
67+
68+
Additional filters, such as excluding monomorphic reference sites or single-end BNDs, can be applied using `entry.filter_call()`.
69+
70+
Subsetting to Regions
71+
---------------------
72+
73+
To subset a VCF to regions specified in a BED file, use:
74+
75+
.. code-block:: python
76+
77+
for entry in vcf.fetch_bed("regions.bed"):
78+
print("Entry's variant type:", entry.var_type())
79+
print("Entry's variant size:", entry.var_size())
80+
81+
If your regions of interest are stored in an in-memory object instead of a BED file, use the `.fetch_regions` method:
82+
83+
.. code-block:: python
84+
85+
from collections import defaultdict
86+
from pyintervaltree import IntervalTree
87+
tree = defaultdict(IntervalTree)
88+
tree['chr1'].addi(10, 100)
89+
tree['chr2'].addi(2000, 2200)
90+
count = 0
91+
for entry in vcf.fetch_regions(tree):
92+
count += 1
93+
print(f"Total of {count} variants")
94+
95+
To iterate over variants that are not within the regions, use `vcf.fetch_regions(tree, inside=False)`. Both of these
96+
fetch methods use heuristics to choose the more efficient fetching strategy of either seeking through the VCF file or
97+
streaming the entire file.
98+
99+
Parsing BND Information
100+
-----------------------
101+
102+
Truvari also simplifies parsing BND information from VCF entries:
103+
104+
.. code-block:: python
105+
106+
# Example entry:
107+
# chr1 23272628 SV_1 G G]chr5:52747359] . PASS SVTYPE=BND;EVENTTYPE=TRA:UNBALANCED;SUBCLONAL=n;COMPLEX=n;MATEID=SV_171 GT:PSL:PSO 0/1:.:.
108+
print(entry.bnd_position())
109+
# ('chr5', 52747359)
110+
print(entry.bnd_direction_strand())
111+
# ('right', 'direct')
112+

docs/api/truvari.package.rst

+164
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,164 @@
1+
Truvari package
2+
===============
3+
4+
Overview
5+
--------
6+
.. automodule:: truvari
7+
:members:
8+
:undoc-members:
9+
:show-inheritance:
10+
11+
.. _variant_handling:
12+
13+
Variant Handling
14+
----------------
15+
16+
.. autoclass:: VariantFile
17+
:members:
18+
19+
.. autoclass:: VariantRecord
20+
:members:
21+
22+
.. _variant_params:
23+
24+
.. autoclass:: VariantParams
25+
:members:
26+
27+
Objects
28+
-------
29+
30+
.. _match_result:
31+
32+
.. autoclass:: MatchResult
33+
:members:
34+
35+
.. autoclass:: GT
36+
:members:
37+
38+
.. autoclass:: SV
39+
:members:
40+
41+
.. autoclass:: Bench
42+
:members:
43+
44+
.. autoclass:: BenchOutput
45+
:members:
46+
47+
.. autoclass:: StatsBox
48+
:members:
49+
50+
.. autoclass:: LogFileStderr
51+
:members:
52+
53+
Extra Methods
54+
-------------
55+
.. autofunction:: bed_ranges
56+
57+
.. autofunction:: benchdir_count_entries
58+
59+
.. autofunction:: best_seqsim
60+
61+
.. autofunction:: build_region_tree
62+
63+
.. autofunction:: read_bed_tree
64+
65+
.. autofunction:: check_vcf_index
66+
67+
.. autofunction:: chunker
68+
69+
.. autofunction:: cmd_exe
70+
71+
.. autofunction:: compress_index_vcf
72+
73+
.. autofunction:: coords_within
74+
75+
.. autofunction:: count_entries
76+
77+
.. autofunction:: extend_region_tree
78+
79+
.. autofunction:: file_zipper
80+
81+
.. autofunction:: help_unknown_cmd
82+
83+
.. autofunction:: get_gt
84+
85+
.. autofunction:: get_scalebin
86+
87+
.. autofunction:: get_sizebin
88+
89+
.. autofunction:: get_svtype
90+
91+
.. autofunction:: make_temp_filename
92+
93+
.. autofunction:: merge_region_tree_overlaps
94+
95+
.. autofunction:: msa2vcf
96+
97+
.. autofunction:: opt_gz_open
98+
99+
.. autofunction:: optimize_df_memory
100+
101+
.. autofunction:: overlap_percent
102+
103+
.. autofunction:: overlaps
104+
105+
.. autofunction:: performance_metrics
106+
107+
.. autofunction:: phab
108+
109+
.. autofunction:: reciprocal_overlap
110+
111+
.. autofunction:: restricted_float
112+
113+
.. autofunction:: restricted_int
114+
115+
.. autofunction:: ref_ranges
116+
117+
.. autofunction:: roll_seqsim
118+
119+
.. autofunction:: seqsim
120+
121+
.. autofunction:: setup_logging
122+
123+
.. autofunction:: sizesim
124+
125+
.. autofunction:: unroll_seqsim
126+
127+
.. autofunction:: vcf_ranges
128+
129+
.. autofunction:: vcf_to_df
130+
131+
Data
132+
----
133+
HEADERMAT
134+
^^^^^^^^^
135+
regular expression of vcf header INFO/FORMAT fields with groups
136+
137+
.. autodata:: HEADERMAT
138+
139+
140+
QUALBINS
141+
^^^^^^^^
142+
0-100 quality score bin strings (step size 10)
143+
144+
.. autodata:: QUALBINS
145+
146+
SVTYTYPE
147+
^^^^^^^^
148+
:class:`pandas.CategoricalDtype` of :class:`truvari.SV`
149+
150+
SZBINMAX
151+
^^^^^^^^
152+
integer list of maximum size for size bins
153+
154+
.. autodata:: SZBINMAX
155+
156+
SZBINS
157+
^^^^^^
158+
string list of size bins
159+
160+
.. autodata:: SZBINS
161+
162+
SZBINTYPE
163+
^^^^^^^^^
164+
:class:`pandas.CategoricalDtype` of :data:`truvari.SZBINS`

0 commit comments

Comments
 (0)