Name	Name	Last commit message	Last commit date
Latest commit ldenti Updates README.md May 9, 2022 085029b · May 9, 2022 History 280 Commits
.github/workflows	.github/workflows	Updates c-cpp.yml	May 9, 2022
SVDSS-experiments @ 6e9d0b8	SVDSS-experiments @ 6e9d0b8	Updates SVDSS-experiments submodule	May 4, 2022
docs	docs	Update pipeline figure.	Feb 23, 2022
holy	holy	Adds holy compilation	Apr 25, 2022
.gitignore	.gitignore	Adds holy compilation	Apr 25, 2022
.gitmodules	.gitmodules	Merge branch 'master' into addcmake	May 9, 2022
CMakeLists.txt	CMakeLists.txt	Adds holy compilation	Apr 25, 2022
LICENSE	LICENSE	Create LICENSE	Mar 9, 2022
README.md	README.md	Updates README.md	May 9, 2022
Snakefile	Snakefile	Updates snakefile	Mar 10, 2022
assembler.cpp	assembler.cpp	Add putative SFS extraction and triple-bufferring modulo-3.	Oct 2, 2021
assembler.hpp	assembler.hpp	Integerate new POA code with repository with optimizations.	Sep 26, 2021
bam.cpp	bam.cpp	Fix a bug with signed/unsigned int comparison in realigner.	Jun 12, 2021
bam.hpp	bam.hpp	Add putative SFS extraction and triple-bufferring modulo-3.	Oct 2, 2021
bed_utils.cpp	bed_utils.cpp	Optimizations to PingPong implementation.	May 17, 2021
bed_utils.hpp	bed_utils.hpp	Fix formatting in source files.	May 14, 2021
caller.cpp	caller.cpp	Output clipped SVs to separate file.	Jan 21, 2022
caller.hpp	caller.hpp	Fix a bug with Clipper.	Sep 30, 2021
chromosomes.cpp	chromosomes.cpp	Re-implement extended SFS clustering without interval trees.	Sep 30, 2021
chromosomes.hpp	chromosomes.hpp	Add assembly verification for POA.	Sep 17, 2021
clipper.cpp	clipper.cpp	Fix a bug with SV calling from clipped SFS.	Dec 6, 2021
clipper.hpp	clipper.hpp	Re-implement extended SFS clustering without interval trees.	Sep 30, 2021
cluster.cpp	cluster.cpp	Integerate new POA code with repository with optimizations.	Sep 26, 2021
cluster.hpp	cluster.hpp	Parallelize interval-tree creation.	Sep 29, 2021
config.cpp	config.cpp	Output clipped SVs to separate file.	Jan 21, 2022
config.hpp	config.hpp	Output clipped SVs to separate file.	Jan 21, 2022
config.yaml	config.yaml	Adds example	Jan 31, 2022
cxxopts.hpp	cxxopts.hpp	Merge PingPong implementation with postprocessing pipeline.	Nov 3, 2020
extender.cpp	extender.cpp	Fix merge errors.	Jan 21, 2022
extender.hpp	extender.hpp	Add --clipped and --min-cluster-weight options.	Dec 9, 2021
fastq.hpp	fastq.hpp	Remove experiment codes from root directory.	Dec 30, 2020
kseq.h	kseq.h	Porting code to ropebwt2 (from fermi)	May 28, 2020
lprint.cpp	lprint.cpp	Fix merge conflicts.	May 14, 2021
lprint.hpp	lprint.hpp	Improves log messages handling	May 14, 2021
main.cpp	main.cpp	Remove references to reconstructor.	Feb 23, 2022
ping_pong.cpp	ping_pong.cpp	Remove references to reconstructor.	Feb 23, 2022
ping_pong.hpp	ping_pong.hpp	Remove references to reconstructor.	Feb 23, 2022
sfs.cpp	sfs.cpp	Integerate new POA code with repository with optimizations.	Sep 26, 2021
sfs.hpp	sfs.hpp	Re-implement extended SFS clustering without interval trees.	Sep 30, 2021
smoother.cpp	smoother.cpp	Remove references to reconstructor.	Feb 23, 2022
smoother.hpp	smoother.hpp	Remove references to reconstructor.	Feb 23, 2022
sv.cpp	sv.cpp	Fix a bug with SVLEN calculating.	Sep 29, 2021
sv.hpp	sv.hpp	Disable same-consensus SV merging. Add chain-filtering.	Oct 17, 2021
vcf.cpp	vcf.cpp	WIP - First version of sv caller	Jul 27, 2021
vcf.hpp	vcf.hpp	WIP - First version of sv caller	Jul 27, 2021

SVDSS: Structural Variant Discovery from Sample-specific Strings

SVDSS is a novel method for discovery of structural variants in accurate long reads (e.g PacBio HiFi) using sample-specific strings (SFS).

SFS are the shortest substrings that are unique to one genome, called target, w.r.t another genome, called reference. Here our method utilizes SFS for coarse-grained identification (anchoring) of potential SV sites and performs local partial-order-assembly (POA) of clusters of SFS from such sites to produce accurate SV predictions. We refer to our manuscript on SFS for more details regarding the concept of SFS.

Download and Installation

To compile and use SVDSS, you need:

a C++14-compliant compiler (GCC 8.2 or newer)
make, automake, autoconf
cmake (>=3.14)
git
some other development libraries: zlib, bz2, lzma
samtools and bcftools

To install these dependencies:

# On a deb-based system (tested on ubuntu 20.04 and debian 11):
sudo apt install build-essential autoconf cmake git zlib1g-dev libbz2-dev liblzma-dev samtools bcftools
# On a rpm-based system (tested on fedora 35):
sudo dnf install gcc gcc-c++ make automake autoconf cmake git zlib-devel bzip2-devel xz-devel samtools bcftools

The following libraries are needed to build and run SVDSS but they are automatically downloaded and compiled while compiling SVDSS:

htslib built with libdeflate for BAM processing.
ksw2 for FASTA and FASTQ processing.
ropebwt2 for FMD index creation and querying.
abPOA for POA computation.
parasail for local alignment of POA consensus.
rapidfuzz for string similarity computation.
interval-tree for variant overlap detection and clustering.

To download and install SVDSS (should take ~10 minutes):

git clone https://github.com/Parsoa/SVDSS.git
cd SVDSS 
mkdir build ; cd build
cmake ..
make

This will create the SVDSS binary in the root of the repo.

Note

For user convenience, we also provide a static binary for x86_64 linux systems (see Releases) - use at your own risk. If it does not work, please let us know or build it yourself :)

General Usage

Index genome:
    SVDSS index --fastq/--fasta /path/to/genome/file --index /path/to/output/index/file
    Optional arguments: 
        --binary                 output index in binary format. allows for other indices to be appended to this index later.
        --append  /path/         append to existing binary index.

Extract SFS from BAM/FASTQ/FASTA files:
    SVDSS search --index /path/to/index --fastq/--bam /path/to/input --workdir /output/directory
    Optional arguments: 
        --assemble               automatically assembles output

Assmble SFS into superstrings:
    SVDSS assemble --workdir /path/to/.sfs/files --batches /number/of/SFS/batches

Smooth reads:
    SVDSS smooth --workdir /output/file/direcotry --bam /path/to/input/bam/file --reference /path/to/reference/genome/fasta

Call SVs:
    SVDSS call --workdir /path/to/assembled/.sfs/files --bam /path/to/input/bam/file --reference /path/to/reference/genome/fasta
    Optional arguments: 
        --min-cluster-weight     minimum number of supporting superstrings for a call to be reported.
        --min-sv-length          minimum length of reported SVs. Default is 25. Values < 25 are ignored.

General options: 
    --threads                    sets number of threads, default 4.

Usage Guide

SVDSS requires as input the BAM file of the sample to be genotyped, a reference genome in FASTA format. To genotype a sample we need to perform the following steps:

Build FMD index of reference genome (SVDSS index)
Smooth the input BAM file (SVDSS smooth)
Extract SFS from smoothed BAM file (SVDSS search)
Assemble SFS into superstrings (SVDSS assemble)
Genotype SVs from the assembled superstrings (SVDSS call)

In the guide below we assume we are using the reference genome file GRCh38.fa and the input BAM file sample.bam. We assume both files are present in the working directory. All of SVDSS steps must be run in the same directory so we always pass --workdir $PWD for every command.

Note that you can reuse the index from step 1 for any number of samples genotyped against the same reference genome.

Figure below shows the full pipeline of commands that needs to be run:

We will now explain each step in more detail:

Index reference genome

The FMD index is the same as from PingPong:

SVDSS index --fastq GRCh38.fa --index GRCh38.bwt

The --index option specifies the output file name.

Smoothing the target sample

Smoothing removes nearly all SNPs, small indels and sequencing errors from reads. This results in smaller number of SFS being extracted and increases the relevance of extracted SFS to SV discovery significantly. To smooth the sample run:

SVDSS smooth --bam sample.bam --workdir $PWD --reference GRCh38.fa --threads 16

This produces a file named smoothed.selective.bam. This file is sorted in the same order as the input file, however it needs to be indexed again with samtools index. The command also produces two files smoothed_reads.txt and ignored_reads.txt in workdir that contains the ids of reads that were smoothed and ids of reads that didn't have any large (> 20bp) indels in their alignemnts. This information is used by the next step.

Extract SFS from target sample

To extract SFS run:

SVDSS search --index GRCh38.bwt --bam smoothed.selective.bam --workdir $PWD

This step produces a number of solution_batch_<i>.sfs files. These files include the coordinates of SFS relative to the reads they were extracted from.

Assemble SFS into superstrings

To reduce redundancy, overlapping SFS on each reads are merged. Simply run:

SVDSS assemble --workdir $PWD --batches N

Here N is the number of files produces by the previous step. Each .sfs file will be processed independently and output as a solution_batch_<i>.assembled.sfs file.

You can combine SFS extraction and assembly by passing --assemble to SVDSS search. This will automatically run the assembler.

Call SVs

We are now ready to call SVs. Run:

SVDSS call --reference GRCh38.fasta --bam smoothed.selective.bam --workdir $PWD --batches N

You can filter the reported SVs by passing the --min-sv-length and --min-cluster-weight options. These options control the minimum length and minimum number of supporting superstrings for the reported SVs. Higher values for --min-cluster-weight will increase precision at the cost of reducing recall. For a 30x coverage sample, --min-cluster-weight 4 produced the best results in our experiments.

This commands output two files: svs_poa.vcf that includes the SV calls and poa.sam which includes alignments of POA contigs to the reference genome.

Snakemake workflow

For user convenience, we distribute a Snakefile to run the entire pipeline, from reference + aligned reads to SVs:

# update config.yaml to suit your needs
# run:
snakemake [-n] -j 4

Example

Download example data from here. The archive contains the input files required to run the SVDSS pipeline (i.e., reference and alignments), as well as the expected output (i.e., VCF file with SVs calls).

Setup data to match the provided config.yaml:

cd /path/to/SVDSS-local-repo
mkdir example
cd example
mv /path/to/SVDSS-example.tar.gz .
tar xvfz SVDSS-example.tar.gz
cd ..

Then run (it should take less than 5 minutes):

snakemake -p -j 2

Authors

SVDSS was developed by Luca Denti and Parsoa Khorsand.

For inquiries on this software please open an issue or contact either Parsoa Khorsand or Luca Denti.

Citation

SVDSS is currently pending peer review. A pre-print is available on BioRxiv.

Experiments

Instructions on how to reproduce the experiments described in the manuscript can be found here (also provided as submodule of this repository).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SVDSS: Structural Variant Discovery from Sample-specific Strings

Download and Installation

Note

General Usage

Usage Guide

Index reference genome

Smoothing the target sample

Extract SFS from target sample

Assemble SFS into superstrings

Call SVs

Snakemake workflow

Example

Authors

Citation

Experiments

About

Releases 11

Packages

Contributors 4

Languages

License

Parsoa/SVDSS

Folders and files

Latest commit

History

Repository files navigation

SVDSS: Structural Variant Discovery from Sample-specific Strings

Download and Installation

Note

General Usage

Usage Guide

Index reference genome

Smoothing the target sample

Extract SFS from target sample

Assemble SFS into superstrings

Call SVs

Snakemake workflow

Example

Authors

Citation

Experiments

About

Resources

License

Stars

Watchers

Forks

Releases 11

Packages 0

Contributors 4

Languages

Packages