svaNUMT
contains functions for detecting NUMT events from structural
variant calls. It takes structural variant calls in GRanges of breakend
notation and identifies NUMTs by nuclear-mitochondrial breakend
junctions. The main function reports candidate NUMTs if there is a pair
of valid insertion sites found on the nuclear genome within a certain
distance threshold. The candidate NUMTs are reported by events.
This package uses a breakend-centric event notation adopted from the
StructuralVariantAnnotation
package. More information about VCF
objects and breakend-centric
GRanges object can be found by consulting the vignettes in the
corresponding packages with browseVignettes("VariantAnnotation")
and
browseVignettes("StructuralVariantAnnotation")
.
svaNUMT is currently available for download in Bioconductor (since BioC 3.14 & R 4.1):
# install.packages("BiocManager")
BiocManager::install("svaNUMT")
The development version can be installed from GitHub:
BiocManager::install("PapenfussLab/svaNUMT")
If you use svaNUMT, please cite svaNUMT
here.
@article {Dong2021.08.18.456578,
author = {Dong, Ruining and Cameron, Daniel and Bedo, Justin and Papenfuss, Anthony T},
title = {svaRetro and svaNUMT: Modular packages for annotation of retrotransposed transcripts and nuclear integration of mitochondrial DNA in genome sequencing data},
elocation-id = {2021.08.18.456578},
year = {2021},
doi = {10.1101/2021.08.18.456578},
publisher = {Cold Spring Harbor Laboratory},
abstract = {Background The biological significance of structural variation is now more widely recognized. However, due to the lack of available tools for downstream analysis, including processing and annotating, interpretation of structural variant calls remains a challenge.Findings Here we present svaRetro and svaNUMT, R packages that provide functions for annotating novel genomic events such as non-reference retro-copied transcripts and nuclear integration of mitochondrial DNA. We evaluate the performance of these packages to detect events using simulations and public benchmarking datasets, and annotate processed transcripts in a public structural variant database.Conclusions svaRetro and svaNUMT provide efficient, modular tools for downstream identification and annotation of structural variant calls.Competing Interest StatementThe authors have declared no competing interest.SVstructural variantNUMTnuclear mitochondrial integrationRTretroposed transcriptTSDtarget site duplicationmtDNAmitochondrial DNA},
URL = {https://www.biorxiv.org/content/early/2021/08/19/2021.08.18.456578},
eprint = {https://www.biorxiv.org/content/early/2021/08/19/2021.08.18.456578.full.pdf},
journal = {bioRxiv}
}
Below is a workflow example for detecting NUMTs from a simulated human genome sample. This example is taken from the vignette of svaNUMT.
VCF data is parsed into a VCF
object using the readVCF
function from
the Bioconductor package VariantAnnotation
. The VCF
object is then
converted to a GRanges
object with breakend-centric notations using
StructuralVariantAnnotation
.
library(StructuralVariantAnnotation)
library(VariantAnnotation)
library(svaNUMT)
vcf <- readVcf(system.file("extdata", "chr1_numt_pe_HS25.sv.vcf", package = "svaNUMT"))
gr <- breakpointRanges(vcf)
Note that StructuralVariantAnnotation
requires the GRanges
object to
be composed entirely of valid breakpoints. Please consult the vignette
of the StructuralVariantAnnotation
package for ensuring breakpoint
consistency.
Function svaNUMT
searches for NUMT events by identifying breakends
supporting the fusion of nuclear chromosome and mitochondrial genome.
svaNUMT
returns identified breakends supporting candidate NUMTs in 2
lists of list of GRanges, grouped by chromosome and insertion sites.
library(readr)
numtS <- read_table(system.file("extdata", "numtS.txt", package = "svaNUMT"),
col_names = FALSE)
colnames(numtS) <- c("bin", "seqnames", "start", "end", "name", "score", "strand")
numtS <- `seqlevelsStyle<-`(GRanges(numtS), "NCBI")
library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
genomeMT <- genome$chrMT
NUMT <- numtDetect(gr, numtS, genomeMT, max_ins_dist = 20)
#> There is no MT sequence from known NUMT events detected.
The breakends supporting the insertion sites and the MT sequence are
arranged by the order of events. Below is an example of a detected NUMT
event, where MT sequence MT:15737-15836
followed by polyadenylation is
inserted between chr1:1688363-1688364
.
GRangesList(NU=NUMT$MT$NU$`1`[[1]], MT=NUMT$MT$MT$`1`[[1]])
#> GRangesList object of length 2:
#> $NU
#> GRanges object with 2 ranges and 13 metadata columns:
#> seqnames ranges strand | paramRangeID REF ALT QUAL FILTER sourceId partner svtype svLen insSeq insLen event HOMLEN
#> <Rle> <IRanges> <Rle> | <factor> <character> <character> <numeric> <character> <character> <character> <character> <numeric> <character> <integer> <character> <numeric>
#> gridss1fb_4o 1 1688363 + | NA C C[MT:15737[ 3928.49 PASS gridss1fb_4o gridss1fb_4h BND NA 0 gridss1fb_4 0
#> gridss1bf_1o 1 1688364 - | NA C ]MT:15836]AAAAAAAAAA.. 3581.13 PASS gridss1bf_1o gridss1bf_1h BND NA AAAAAAAAAAAAA 13 gridss1bf_1 0
#> -------
#> seqinfo: 86 sequences from an unspecified genome
#>
#> $MT
#> GRanges object with 2 ranges and 13 metadata columns:
#> seqnames ranges strand | paramRangeID REF ALT QUAL FILTER sourceId partner svtype svLen insSeq insLen event HOMLEN
#> <Rle> <IRanges> <Rle> | <factor> <character> <character> <numeric> <character> <character> <character> <character> <numeric> <character> <integer> <character> <numeric>
#> gridss1fb_4h MT 15737 - | NA G ]1:1688363]G 3928.49 PASS gridss1fb_4h gridss1fb_4o BND NA 0 gridss1fb_4 0
#> gridss1bf_1h MT 15836 + | NA A AAAAAAAAAAAAAA[1:168.. 3581.13 PASS gridss1bf_1h gridss1bf_1o BND NA AAAAAAAAAAAAA 13 gridss1bf_1 0
#> -------
#> seqinfo: 86 sequences from an unspecified genome
Below is an example to subset the detected NUMTs by a genomic region
given seqnames
, start
, and end
. For region chr1:1000000-3000000
,
there are 3 NUMTs detected.
seqnames = 1
start = 1000000
end = 3000000
i <- sapply(NUMT$MT$NU[[seqnames]], function(x)
sum(countOverlaps(x, GRanges(seqnames = seqnames, IRanges(start, end))))>0)
list(NU=NUMT$MT$NU[[seqnames]][i], MT=NUMT$MT$MT[[seqnames]][i])
#> $NU
#> $NU[[1]]
#> GRanges object with 2 ranges and 13 metadata columns:
#> seqnames ranges strand | paramRangeID REF ALT QUAL FILTER sourceId partner svtype svLen insSeq insLen event HOMLEN
#> <Rle> <IRanges> <Rle> | <factor> <character> <character> <numeric> <character> <character> <character> <character> <numeric> <character> <integer> <character> <numeric>
#> gridss1fb_4o 1 1688363 + | NA C C[MT:15737[ 3928.49 PASS gridss1fb_4o gridss1fb_4h BND NA 0 gridss1fb_4 0
#> gridss1bf_1o 1 1688364 - | NA C ]MT:15836]AAAAAAAAAA.. 3581.13 PASS gridss1bf_1o gridss1bf_1h BND NA AAAAAAAAAAAAA 13 gridss1bf_1 0
#> -------
#> seqinfo: 86 sequences from an unspecified genome
#>
#> $NU[[2]]
#> GRanges object with 2 ranges and 13 metadata columns:
#> seqnames ranges strand | paramRangeID REF ALT QUAL FILTER sourceId partner svtype svLen insSeq insLen event HOMLEN
#> <Rle> <IRanges> <Rle> | <factor> <character> <character> <numeric> <character> <character> <character> <character> <numeric> <character> <integer> <character> <numeric>
#> gridss1fb_5o 1 1791082-1791083 + | NA G G[MT:2592[ 1929.85 PASS gridss1fb_5o gridss1fb_5h BND NA 0 gridss1fb_5 1
#> gridss1bf_2o 1 1791084 - | NA A ]MT:3592]AAAAAAAAAAAA 2894.91 PASS gridss1bf_2o gridss1bf_2h BND NA AAAAAAAAAAA 11 gridss1bf_2 0
#> -------
#> seqinfo: 86 sequences from an unspecified genome
#>
#> $NU[[3]]
#> GRanges object with 2 ranges and 13 metadata columns:
#> seqnames ranges strand | paramRangeID REF ALT QUAL FILTER sourceId partner svtype svLen insSeq insLen event HOMLEN
#> <Rle> <IRanges> <Rle> | <factor> <character> <character> <numeric> <character> <character> <character> <character> <numeric> <character> <integer> <character> <numeric>
#> gridss2fb_3o 1 2869079 + | NA G G[MT:2786[ 2472.12 PASS gridss2fb_3o gridss2fb_3h BND NA 0 gridss2fb_3 0
#> gridss2bf_2o 1 2869080 - | NA A ]MT:2985]AAAAAAAAAAA.. 2456.81 PASS gridss2bf_2o gridss2bf_2h BND NA AAAAAAAAAAAAAAA 15 gridss2bf_2 0
#> -------
#> seqinfo: 86 sequences from an unspecified genome
#>
#>
#> $MT
#> $MT[[1]]
#> GRanges object with 2 ranges and 13 metadata columns:
#> seqnames ranges strand | paramRangeID REF ALT QUAL FILTER sourceId partner svtype svLen insSeq insLen event HOMLEN
#> <Rle> <IRanges> <Rle> | <factor> <character> <character> <numeric> <character> <character> <character> <character> <numeric> <character> <integer> <character> <numeric>
#> gridss1fb_4h MT 15737 - | NA G ]1:1688363]G 3928.49 PASS gridss1fb_4h gridss1fb_4o BND NA 0 gridss1fb_4 0
#> gridss1bf_1h MT 15836 + | NA A AAAAAAAAAAAAAA[1:168.. 3581.13 PASS gridss1bf_1h gridss1bf_1o BND NA AAAAAAAAAAAAA 13 gridss1bf_1 0
#> -------
#> seqinfo: 86 sequences from an unspecified genome
#>
#> $MT[[2]]
#> GRanges object with 2 ranges and 13 metadata columns:
#> seqnames ranges strand | paramRangeID REF ALT QUAL FILTER sourceId partner svtype svLen insSeq insLen event HOMLEN
#> <Rle> <IRanges> <Rle> | <factor> <character> <character> <numeric> <character> <character> <character> <character> <numeric> <character> <integer> <character> <numeric>
#> gridss1fb_5h MT 2592-2593 - | NA G ]1:1791082]G 1929.85 PASS gridss1fb_5h gridss1fb_5o BND NA 0 gridss1fb_5 1
#> gridss1bf_2h MT 3592 + | NA G GAAAAAAAAAAA[1:17910.. 2894.91 PASS gridss1bf_2h gridss1bf_2o BND NA AAAAAAAAAAA 11 gridss1bf_2 0
#> -------
#> seqinfo: 86 sequences from an unspecified genome
#>
#> $MT[[3]]
#> GRanges object with 2 ranges and 13 metadata columns:
#> seqnames ranges strand | paramRangeID REF ALT QUAL FILTER sourceId partner svtype svLen insSeq insLen event HOMLEN
#> <Rle> <IRanges> <Rle> | <factor> <character> <character> <numeric> <character> <character> <character> <character> <numeric> <character> <integer> <character> <numeric>
#> gridss2fb_3h MT 2786 - | NA T ]1:2869079]T 2472.12 PASS gridss2fb_3h gridss2fb_3o BND NA 0 gridss2fb_3 0
#> gridss2bf_2h MT 2985 + | NA C CAAAAAAAAAAAAAAA[1:2.. 2456.81 PASS gridss2bf_2h gridss2bf_2o BND NA AAAAAAAAAAAAAAA 15 gridss2bf_2 0
#> -------
#> seqinfo: 86 sequences from an unspecified genome
One way of visualising paired breakpoints is by circos plots. Here we
use the package
circlize
to
demonstrate breakpoint visualisation. The bedpe2circos
function takes
BEDPE-formatted dataframes (see breakpointgr2bedpe()
) and plotting
parameters for the circos.initializeWithIdeogram()
and
circos.genomicLink()
functions from circlize
.
To generate a simple circos plot of one candidate NUMT event:
library(circlize)
numt_chr_prefix <- c(NUMT$MT$NU$`1`[[2]], NUMT$MT$MT$`1`[[2]])
GenomeInfoDb::seqlevelsStyle(numt_chr_prefix) <- "UCSC"
pairs <- breakpointgr2pairs(numt_chr_prefix)
pairs
To see supporting breakpoints clearly, we generate the circos plot according to the loci of event.
circos.initializeWithIdeogram(
data.frame(V1=c("chr1", "chrM"),
V2=c(1791073,1),
V3=c(1791093,16571),
V4=c("p15.4",NA),
V5=c("gpos50",NA)), sector.width = c(0.2, 0.8))
#circos.initializeWithIdeogram()
circos.genomicLink(as.data.frame(S4Vectors::first(pairs)),
as.data.frame(S4Vectors::second(pairs)))
circos.clear()