Skip to content

Commit 42c5051

Browse files
committed
Merge branch 'release/v4.1.0'
2 parents 49fa01a + e55e788 commit 42c5051

File tree

2,369 files changed

+28438
-61855
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

2,369 files changed

+28438
-61855
lines changed

.github/workflows/func_tests.yml

+3-6
Original file line numberDiff line numberDiff line change
@@ -22,12 +22,9 @@ jobs:
2222
key: ${{ env.pythonLocation }}-${{ hashFiles('setup.py') }}
2323
- name: Install dependencies
2424
run: |
25-
python -m pip install --upgrade pip
26-
pip install pylint
27-
pip install anybadge
28-
pip install coverage
29-
sudo apt-get install bcftools samtools tabix
30-
python -m pip install .
25+
python3 -m pip install --upgrade pip setuptools
26+
python3 -m pip install Cython pylint anybadge coverage
27+
python3 -m pip install .
3128
- name: Running ssshtest
3229
run: |
3330
bash repo_utils/truvari_ssshtests.sh

.pylintrc

+1-1
Original file line numberDiff line numberDiff line change
@@ -324,4 +324,4 @@ exclude-protected=_asdict,_fields,_replace,_source,_make
324324

325325
# Exceptions that will emit a warning when being caught. Defaults to
326326
# "Exception"
327-
overgeneral-exceptions=Exception
327+
overgeneral-exceptions=builtins.Exception

Dockerfile

-4
Original file line numberDiff line numberDiff line change
@@ -2,13 +2,9 @@ FROM ubuntu:22.04
22

33
RUN apt-get -qq update \
44
&& DEBIAN_FRONTEND=noninteractive apt-get install -yq \
5-
bcftools \
65
curl \
76
python3-dev \
87
python3-pip \
9-
samtools \
10-
tabix \
11-
vcftools \
128
wget \
139
&& \
1410
rm -rf /var/lib/apt/lists/*

README.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -2,11 +2,11 @@
22
[![pylint](imgs/pylint.svg)](https://github.com/acenglish/truvari/actions/workflows/pylint.yml)
33
[![FuncTests](https://github.com/acenglish/truvari/actions/workflows/func_tests.yml/badge.svg?branch=develop&event=push)](https://github.com/acenglish/truvari/actions/workflows/func_tests.yml)
44
[![coverage](imgs/coverage.svg)](https://github.com/acenglish/truvari/actions/workflows/func_tests.yml)
5-
[![develop](https://img.shields.io/github/commits-since/acenglish/truvari/v3.5.0)](https://github.com/ACEnglish/truvari/compare/v3.5.0...develop)
5+
[![develop](https://img.shields.io/github/commits-since/acenglish/truvari/v4.0.0)](https://github.com/ACEnglish/truvari/compare/v4.0.0...develop)
66
[![Downloads](https://pepy.tech/badge/truvari)](https://pepy.tech/project/truvari)
77

88
![Logo](https://raw.githubusercontent.com/ACEnglish/truvari/develop/imgs/BoxScale1_DarkBG.png)
9-
Toolkit for benchmarking, merging, and annotating Structrual Variants
9+
Toolkit for benchmarking, merging, and annotating Structural Variants
1010

1111
📚 [WIKI page](https://github.com/acenglish/truvari/wiki) has detailed documentation.
1212
📈 See [Updates](https://github.com/acenglish/truvari/wiki/Updates) on new versions.

docs/api/truvari.rst

+5-16
Original file line numberDiff line numberDiff line change
@@ -128,10 +128,6 @@ phab
128128
^^^^
129129
.. autofunction:: phab
130130

131-
phab_multi
132-
^^^^^^^^^^
133-
.. autofunction:: phab_multi
134-
135131
reciprocal_overlap
136132
^^^^^^^^^^^^^^^^^^
137133
.. autofunction:: reciprocal_overlap
@@ -170,18 +166,10 @@ cmd_exe
170166
^^^^^^^
171167
.. autofunction:: cmd_exe
172168

173-
consolidate_phab_vcfs
174-
^^^^^^^^^^^^^^^^^^^^^
175-
.. autofunction:: consolidate_phab_vcfs
176-
177169
count_entries
178170
^^^^^^^^^^^^^
179171
.. autofunction:: count_entries
180172

181-
fchain
182-
^^^^^^
183-
.. autofunction:: fchain
184-
185173
file_zipper
186174
^^^^^^^^^^^
187175
.. autofunction:: file_zipper
@@ -218,10 +206,6 @@ setup_logging
218206
^^^^^^^^^^^^^
219207
.. autofunction:: setup_logging
220208

221-
setup_progressbar
222-
^^^^^^^^^^^^^^^^^
223-
.. autofunction:: setup_progressbar
224-
225209
vcf_to_df
226210
^^^^^^^^^
227211
.. autofunction:: vcf_to_df
@@ -239,6 +223,11 @@ BenchOutput
239223
.. autoclass:: BenchOutput
240224
:members:
241225

226+
StatsBox
227+
^^^^^^^^
228+
.. autoclass:: StatsBox
229+
:members:
230+
242231
GT
243232
^^
244233
.. autoclass:: GT

docs/requirements.txt

+1
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
pywfa>=0.5.1
12
sphinx==4.2.0
23
sphinx_rtd_theme==1.0.0
34
readthedocs-sphinx-search==0.1.1

docs/v4.1.0/Citations.md

+30
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
# Citing Truvari
2+
3+
English, A.C., Menon, V.K., Gibbs, R.A. et al. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol 23, 271 (2022). https://doi.org/10.1186/s13059-022-02840-6
4+
5+
# Citations
6+
7+
List of publications using Truvari. Most of these are just pulled from a [Google Scholar Search](https://scholar.google.com/scholar?q=truvari). Please post in the [show-and-tell](https://github.com/spiralgenetics/truvari/discussions/categories/show-and-tell) to have your publication added to the list.
8+
* [A robust benchmark for detection of germline large deletions and insertions](https://www.nature.com/articles/s41587-020-0538-8)
9+
* [Leveraging a WGS compression and indexing format with dynamic graph references to call structural variants](https://www.biorxiv.org/content/10.1101/2020.04.24.060202v1.abstract)
10+
* [Duphold: scalable, depth-based annotation and curation of high-confidence structural variant calls](https://academic.oup.com/gigascience/article/8/4/giz040/5477467?login=true)
11+
* [Parliament2: Accurate structural variant calling at scale](https://academic.oup.com/gigascience/article/9/12/giaa145/6042728)
12+
* [Learning What a Good Structural Variant Looks Like](https://www.biorxiv.org/content/10.1101/2020.05.22.111260v1.full)
13+
* [Long-read trio sequencing of individuals with unsolved intellectual disability](https://www.nature.com/articles/s41431-020-00770-0)
14+
* [lra: A long read aligner for sequences and contigs](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1009078)
15+
* [Samplot: a platform for structural variant visual validation and automated filtering](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02380-5)
16+
* [AsmMix: A pipeline for high quality diploid de novo assembly](https://www.biorxiv.org/content/10.1101/2021.01.15.426893v1.abstract)
17+
* [Accurate chromosome-scale haplotype-resolved assembly of human genomes](https://www.nature.com/articles/s41587-020-0711-0)
18+
* [Accurate circular consensus long-read sequencing improves variant detection and assembly of a human genome](https://www.nature.com/articles/s41587-019-0217-9)
19+
* [NPSV: A simulation-driven approach to genotyping structural variants in whole-genome sequencing data](https://academic.oup.com/bioinformatics/article-abstract/37/11/1497/5466452)
20+
* [SVIM-asm: structural variant detection from haploid and diploid genome assemblies](https://academic.oup.com/bioinformatics/article/36/22-23/5519/6042701?login=true)
21+
* [Readfish enables targeted nanopore sequencing of gigabase-sized genomes](https://www.nature.com/articles/s41587-020-00746-x)
22+
* [stLFRsv: A Germline Structural Variant Analysis Pipeline Using Co-barcoded Reads](https://internal-journal.frontiersin.org/articles/10.3389/fgene.2021.636239/full)
23+
* [Long-read-based human genomic structural variation detection with cuteSV](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02107-y)
24+
* [An international virtual hackathon to build tools for the analysis of structural variants within species ranging from coronaviruses to vertebrates](https://f1000research.com/articles/10-246)
25+
* [Paragraph: a graph-based structural variant genotyper for short-read sequence data](https://link.springer.com/article/10.1186/s13059-019-1909-7)
26+
* [Genome-wide investigation identifies a rare copy-number variant burden associated with human spina bifida](https://www.nature.com/articles/s41436-021-01126-9)
27+
* [TT-Mars: Structural Variants Assessment Based on Haplotype-resolved Assemblies](https://www.biorxiv.org/content/10.1101/2021.09.27.462044v1.abstract)
28+
* [An ensemble deep learning framework to refine large deletions in linked-reads](https://www.biorxiv.org/content/10.1101/2021.09.27.462057v1.abstract)
29+
* [MAMnet: detecting and genotyping deletions and insertions based on long reads and a deep learning approach](https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac195/6587170)](https://academic.oup.com/bib/advance-article-abstract/doi/10.1093/bib/bbac195/6587170)
30+
* [Automated filtering of genome-wide large deletions through an ensemble deep learning framework](https://www.sciencedirect.com/science/article/pii/S1046202322001712#b0110)
+67
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
A frequent application of comparing SVs is to perform a 'bakeoff' of performance
2+
between two SV programs against a single set of base calls.
3+
4+
Beyond looking at the Truvari results/report, you may like to investigate what calls
5+
are different between the programs.
6+
7+
Below is a set of scripts that may help you generate those results. For our examples,
8+
we'll be comparing arbitrary programs Asvs and Bsvs aginst base calls Gsvs.
9+
10+
*_Note_* - This assumes that each record in Gsvs has a unique ID in the vcf.
11+
12+
Generate the Truvari report for Asvs and Bsvs
13+
=============================================
14+
15+
```bash
16+
truvari bench -b Gsvs.vcf.gz -c Asvs.vcf.gz -o cmp_A/ ...
17+
truvari bench -b Gsvs.vcf.gz -c Bsvs.vcf.gz -o cmp_B/ ...
18+
```
19+
Consistency
20+
===========
21+
The simplest way to compare the programs is to get the intersection of TPbase calls from the two reports.
22+
```bash
23+
truvari consistency cmp_A/tp-base.vcf cmp_B/tp-base.vcf
24+
```
25+
See [[consistency wiki|consistency]] for details on the report created.
26+
27+
Below are older notes to manually create a similar report to what one can make using `truvari consistency`
28+
29+
Combine the TPs within each report
30+
==================================
31+
32+
```bash
33+
cd cmp_A/
34+
paste <(grep -v "#" tp-base.vcf) <(grep -v "#" tp-comp.vcf) > combined_tps.txt
35+
cd ../cmp_B/
36+
paste <(grep -v "#" tp-base.vcf) <(grep -v "#" tp-comp.vcf) > combined_tps.txt
37+
```
38+
39+
Grab the FNs missed by only one program
40+
=======================================
41+
42+
```bash
43+
(grep -v "#" cmp_A/fn.vcf && grep -v "#" cmp_B/fn.vcf) | cut -f3 | sort | uniq -c | grep "^ *1 " | cut -f2- -d1 > missed_names.txt
44+
```
45+
46+
Pull the TP sets' difference
47+
============================
48+
49+
```bash
50+
cat missed_names.txt | xargs -I {} grep -w {} cmp_A/combined_tps.txt > missed_by_B.txt
51+
cat missed_names.txt | xargs -I {} grep -w {} cmp_B/combined_tps.txt > missed_by_A.txt
52+
```
53+
54+
To look at the base-calls that Bsvs found, but Asvs didn't, run `cut -f1-12 missed_by_A.txt`.
55+
56+
To look at the Asvs that Bsvs didn't find, run `cut -f13- missed_by_B.txt`.
57+
58+
Shared FPs between the programs
59+
===============================
60+
61+
All of the work above has been about how to analyze the TruePositives. If you'd like to see which calls are shared between Asvs and Bsvs that aren't in Gsvs, simply run Truvari again.
62+
63+
```bash
64+
bgzip cmp_A/fp.vcf && tabix -p vcf cmp_A/fp.vcf.gz
65+
bgzip cmp_B/fp.vcf && tabix -p vcf cmp_B/fp.vcf.gz
66+
truvari bench -b cmp_A/fp.vcf.gz -c cmp_B/fp.vcf.gz -o shared_fps ...
67+
```

docs/v4.1.0/Development.md

+90
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# Truvari API
2+
Many of the helper methods/objects are documented such that developers can reuse truvari in their own code. To see developer documentation, visit [readthedocs](https://truvari.readthedocs.io/en/latest/).
3+
4+
Documentation can also be seen using
5+
```python
6+
import truvari
7+
help(truvari)
8+
```
9+
10+
# docker
11+
12+
A Dockerfile exists to build an image of Truvari. To make a Docker image, clone the repository and run
13+
```bash
14+
docker build -t truvari .
15+
```
16+
17+
You can then run Truvari through docker using
18+
```bash
19+
docker run -v `pwd`:/data -it truvari
20+
```
21+
Where `pwd` can be whatever directory you'd like to mount in the docker to the path `/data/`, which is the working directory for the Truvari run. You can provide parameters directly to the entry point.
22+
```bash
23+
docker run -v `pwd`:/data -it truvari anno svinfo -i example.vcf.gz
24+
```
25+
26+
If you'd like to interact within the docker container for things like running the CI/CD scripts
27+
```bash
28+
docker run -v `pwd`:/data --entrypoint /bin/bash -it truvari
29+
```
30+
You'll now be inside the container and can run FuncTests or run Truvari directly
31+
```bash
32+
bash repo_utils/truvari_ssshtests.sh
33+
truvari anno svinfo -i example.vcf.gz
34+
```
35+
36+
# CI/CD
37+
38+
Scripts that help ensure the tool's quality. Extra dependencies need to be installed in order to run Truvari's CI/CD scripts.
39+
40+
```bash
41+
pip install pylint anybadge coverage
42+
```
43+
44+
Check code formatting with
45+
```bash
46+
python repo_utils/pylint_maker.py
47+
```
48+
We use [autopep8](https://pypi.org/project/autopep8/) (via [vim-autopep8](https://github.com/tell-k/vim-autopep8)) for formatting.
49+
50+
Test the code and generate a coverage report with
51+
```bash
52+
bash repo_utils/truvari_ssshtests.sh
53+
```
54+
55+
Truvari leverages github actions to perform these checks when new code is pushed to the repository. We've noticed that the actions sometimes hangs through no fault of the code. If this happens, cancel and resubmit the job. Once FuncTests are successful, it uploads an artifact of the `coverage html` report which you can download to see a line-by-line accounting of test coverage.
56+
57+
# git flow
58+
59+
To organize the commits for the repository, we use [git-flow](https://danielkummer.github.io/git-flow-cheatsheet/). Therefore, `develop` is the default branch, the latest tagged release is on `master`, and new, in-development features are within `feature/<name>`
60+
61+
When contributing to the code, be sure you're working off of develop and have run `git flow init`.
62+
63+
# versioning
64+
65+
Truvari uses [Semantic Versioning](https://semver.org/) and tries to stay compliant to [PEP440](https://peps.python.org/pep-0440/). As of v3.0.0, a single version is kept in the code under `truvari/__init__.__version__`. We try to keep the suffix `-dev` on the version in the develop branch. When cutting a new release, we may replace the suffix with `-rc` if we've built a release candidate that may need more testing/development. Once we've committed to a full release that will be pushed to PyPi, no suffix is placed on the version. If you install Truvari from the develop branch, the git repo hash is appended to the installed version as well as '.uc' if there are un-staged commits in the repo.
66+
67+
# docs
68+
69+
The github wiki serves the documentation most relevant to the `develop/` branch. When cutting a new release, we freeze and version the wiki's documentation with the helper utility `docs/freeze_wiki.sh`.
70+
71+
# Creating a release
72+
Follow these steps to create a release
73+
74+
0) Bump release version
75+
1) Run tests locally
76+
2) Update API Docs
77+
3) Change Updates Wiki
78+
4) Freeze the Wiki
79+
5) Ensure all code is checked in
80+
6) Do a [git-flow release](https://danielkummer.github.io/git-flow-cheatsheet/)
81+
7) Use github action to make a testpypi release
82+
8) Check test release
83+
```bash
84+
python3 -m venv test_truvari
85+
python3 -m pip install --index-url https://test.pypi.org/simple --extra-index-url https://pypi.org/simple/ truvari
86+
```
87+
9) Use GitHub action to make a pypi release
88+
10) Download release-tarball.zip from step #9’s action
89+
11) Create release (include #9) from the tag
90+
12) Checkout develop and Bump to dev version and README ‘commits since’ badge
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
By default, Truvari uses [edlib](https://github.com/Martinsos/edlib) to calculate the edit distance between two SV calls. Optionally, the [Levenshtein edit distance ratio](https://en.wikipedia.org/wiki/Levenshtein_distance) can be used to compute the `--pctsim` between two variants. These measures are different than the sequence similarity calculated by [Smith-Waterman alignment](https://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm).
2+
3+
To show this difference, consider the following two sequences.:
4+
5+
```
6+
AGATACAGGAGTACGAACAGTACAGTACGA
7+
|||||||||||||||*||||||||||||||
8+
ATCACAGATACAGGAGTACGTACAGTACAGTACGA
9+
10+
30bp Aligned
11+
1bp Mismatched (96% similarity)
12+
5bp Left-Trimmed (~14% of the bottom sequence)
13+
```
14+
15+
The code below runs swalign, Levenshtein, and edlib to compute the `--pctsim` between the two sequences.
16+
17+
18+
```python
19+
import swalign
20+
import Levenshtein
21+
import edlib
22+
23+
seq1 = "AGATACAGGAGTACGAACAGTACAGTACGA"
24+
seq2 = "ATCACAGATACAGGAGTACGTACAGTACAGTACGA"
25+
26+
scoring = swalign.NucleotideScoringMatrix(2, -1)
27+
alner = swalign.LocalAlignment(scoring, gap_penalty=-2, gap_extension_decay=0.5)
28+
aln = alner.align(seq1, seq2)
29+
mat_tot = aln.matches
30+
mis_tot = aln.mismatches
31+
denom = float(mis_tot + mat_tot)
32+
if denom == 0:
33+
ident = 0
34+
else:
35+
ident = mat_tot / denom
36+
scr = edlib.align(seq1, seq2)
37+
totlen = len(seq1) + len(seq2)
38+
39+
print('swalign', ident)
40+
# swalign 0.966666666667
41+
print('levedit', Levenshtein.ratio(seq1, seq2))
42+
# levedit 0.892307692308
43+
print('edlib', (totlen - scr["editDistance"]) / totlen)
44+
# edlib 0.9076923076923077
45+
```
46+
47+
Because the swalign procedure only considers the number of matches and mismatches, the `--pctsim` is higher than the edlib and Levenshtein ratio.
48+
49+
If we were to account for the 5 'trimmed' bases from the Smith-Waterman alignment when calculating the `--pctsim` by counting each trimmed base as a mismatch, we would see the similarity drop to ~83%.
50+
51+
[This post](https://stackoverflow.com/questions/14260126/how-python-levenshtein-ratio-is-computed) has a nice response describing exactly how the Levenshtein ratio is computed.
52+
53+
The Smith-Waterman alignment is much more expensive to compute compared to the Levenshtein ratio, and does not account for 'trimmed' sequence difference.
54+
55+
However, edlib is the fastest comparison method and is used by default. Levenshtein can be specified with `--use-lev` in `bench` and `collapse`.

docs/v4.1.0/Home.md

+35
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,35 @@
1+
The wiki holds documentation most relevant for develop. For information on a specific version of Truvari, see [`docs/`](https://github.com/spiralgenetics/truvari/tree/develop/docs)
2+
3+
Citation:
4+
English, A.C., Menon, V.K., Gibbs, R.A. et al. Truvari: refined structural variant comparison preserves allelic diversity. Genome Biol 23, 271 (2022). https://doi.org/10.1186/s13059-022-02840-6
5+
6+
# Before you start
7+
VCFs aren't always created with a strong adherence to the format's specification.
8+
9+
Truvari expects input VCFs to be valid so that it will only output valid VCFs.
10+
11+
We've developed a separate tool that runs multiple validation programs and standard VCF parsing libraries in order to validate a VCF.
12+
13+
Run [this program](https://github.com/acenglish/usable_vcf) over any VCFs that are giving Truvari trouble.
14+
15+
Furthermore, Truvari expects 'resolved' SVs (e.g. DEL/INS) and will not interpret BND signals across SVTYPEs (e.g. combining two BND lines to match a DEL call). A brief description of Truvari bench methodology is linked below.
16+
17+
Finally, Truvari does not handle multi-allelic VCF entries and as of v4.0 will throw an error if multi-allelics are encountered. Please use `bcftools norm` to split multi-allelic entries.
18+
19+
# Index
20+
21+
- [[Updates|Updates]]
22+
- [[Installation|Installation]]
23+
- Truvari Commands:
24+
- [[anno|anno]]
25+
- [[bench|bench]]
26+
- [[collapse|collapse]]
27+
- [[consistency|consistency]]
28+
- [[divide|divide]]
29+
- [[phab|phab]]
30+
- [[refine|refine]]
31+
- [[segment|segment]]
32+
- [[stratify|stratify]]
33+
- [[vcf2df|vcf2df]]
34+
- [[Development|Development]]
35+
- [[Citations|Citations]]

0 commit comments

Comments
 (0)