Skip to content

PapenfussLab/svaNUMT

Repository files navigation

svaNUMT: R package for NUMT detection using structural variant calls

License: GPL v3

svaNUMT contains functions for detecting NUMT events from structural variant calls. It takes structural variant calls in GRanges of breakend notation and identifies NUMTs by nuclear-mitochondrial breakend junctions. The main function reports candidate NUMTs if there is a pair of valid insertion sites found on the nuclear genome within a certain distance threshold. The candidate NUMTs are reported by events.

This package uses a breakend-centric event notation adopted from the StructuralVariantAnnotation package. More information about VCF objects and breakend-centric GRanges object can be found by consulting the vignettes in the corresponding packages with browseVignettes("VariantAnnotation") and browseVignettes("StructuralVariantAnnotation").

Installation

svaNUMT is currently available for download in Bioconductor (since BioC 3.14 & R 4.1):

# install.packages("BiocManager")
BiocManager::install("svaNUMT")

The development version can be installed from GitHub:

BiocManager::install("PapenfussLab/svaNUMT")

How to cite

If you use svaNUMT, please cite svaNUMT here.

@article {Dong2021.08.18.456578,
    author = {Dong, Ruining and Cameron, Daniel and Bedo, Justin and Papenfuss, Anthony T},
    title = {svaRetro and svaNUMT: Modular packages for annotation of retrotransposed transcripts and nuclear integration of mitochondrial DNA in genome sequencing data},
    elocation-id = {2021.08.18.456578},
    year = {2021},
    doi = {10.1101/2021.08.18.456578},
    publisher = {Cold Spring Harbor Laboratory},
    abstract = {Background The biological significance of structural variation is now more widely recognized. However, due to the lack of available tools for downstream analysis, including processing and annotating, interpretation of structural variant calls remains a challenge.Findings Here we present svaRetro and svaNUMT, R packages that provide functions for annotating novel genomic events such as non-reference retro-copied transcripts and nuclear integration of mitochondrial DNA. We evaluate the performance of these packages to detect events using simulations and public benchmarking datasets, and annotate processed transcripts in a public structural variant database.Conclusions svaRetro and svaNUMT provide efficient, modular tools for downstream identification and annotation of structural variant calls.Competing Interest StatementThe authors have declared no competing interest.SVstructural variantNUMTnuclear mitochondrial integrationRTretroposed transcriptTSDtarget site duplicationmtDNAmitochondrial DNA},
    URL = {https://www.biorxiv.org/content/early/2021/08/19/2021.08.18.456578},
    eprint = {https://www.biorxiv.org/content/early/2021/08/19/2021.08.18.456578.full.pdf},
    journal = {bioRxiv}
}

Workflow

Below is a workflow example for detecting NUMTs from a simulated human genome sample. This example is taken from the vignette of svaNUMT.

Loading data from VCF

VCF data is parsed into a VCF object using the readVCF function from the Bioconductor package VariantAnnotation. The VCF object is then converted to a GRanges object with breakend-centric notations using StructuralVariantAnnotation.

library(StructuralVariantAnnotation)
library(VariantAnnotation)
library(svaNUMT)

vcf <- readVcf(system.file("extdata", "chr1_numt_pe_HS25.sv.vcf", package = "svaNUMT"))
gr <- breakpointRanges(vcf)

Note that StructuralVariantAnnotation requires the GRanges object to be composed entirely of valid breakpoints. Please consult the vignette of the StructuralVariantAnnotation package for ensuring breakpoint consistency.

Identifying Nuclear-mitochondrial Genome Fusion Events

Function svaNUMT searches for NUMT events by identifying breakends supporting the fusion of nuclear chromosome and mitochondrial genome. svaNUMT returns identified breakends supporting candidate NUMTs in 2 lists of list of GRanges, grouped by chromosome and insertion sites.

library(readr)
numtS <- read_table(system.file("extdata", "numtS.txt", package = "svaNUMT"), 
    col_names = FALSE)
colnames(numtS) <- c("bin", "seqnames", "start", "end", "name", "score", "strand")
numtS <- `seqlevelsStyle<-`(GRanges(numtS), "NCBI")

library(BSgenome.Hsapiens.UCSC.hg19)
genome <- BSgenome.Hsapiens.UCSC.hg19
genomeMT <- genome$chrMT
NUMT <- numtDetect(gr, numtS, genomeMT, max_ins_dist = 20)
#> There is no MT sequence from known NUMT events detected.

The breakends supporting the insertion sites and the MT sequence are arranged by the order of events. Below is an example of a detected NUMT event, where MT sequence MT:15737-15836 followed by polyadenylation is inserted between chr1:1688363-1688364.

GRangesList(NU=NUMT$MT$NU$`1`[[1]], MT=NUMT$MT$MT$`1`[[1]])
#> GRangesList object of length 2:
#> $NU
#> GRanges object with 2 ranges and 13 metadata columns:
#>                seqnames    ranges strand | paramRangeID         REF                    ALT      QUAL      FILTER     sourceId      partner      svtype     svLen        insSeq    insLen       event    HOMLEN
#>                   <Rle> <IRanges>  <Rle> |     <factor> <character>            <character> <numeric> <character>  <character>  <character> <character> <numeric>   <character> <integer> <character> <numeric>
#>   gridss1fb_4o        1   1688363      + |           NA           C            C[MT:15737[   3928.49        PASS gridss1fb_4o gridss1fb_4h         BND        NA                       0 gridss1fb_4         0
#>   gridss1bf_1o        1   1688364      - |           NA           C ]MT:15836]AAAAAAAAAA..   3581.13        PASS gridss1bf_1o gridss1bf_1h         BND        NA AAAAAAAAAAAAA        13 gridss1bf_1         0
#>   -------
#>   seqinfo: 86 sequences from an unspecified genome
#> 
#> $MT
#> GRanges object with 2 ranges and 13 metadata columns:
#>                seqnames    ranges strand | paramRangeID         REF                    ALT      QUAL      FILTER     sourceId      partner      svtype     svLen        insSeq    insLen       event    HOMLEN
#>                   <Rle> <IRanges>  <Rle> |     <factor> <character>            <character> <numeric> <character>  <character>  <character> <character> <numeric>   <character> <integer> <character> <numeric>
#>   gridss1fb_4h       MT     15737      - |           NA           G           ]1:1688363]G   3928.49        PASS gridss1fb_4h gridss1fb_4o         BND        NA                       0 gridss1fb_4         0
#>   gridss1bf_1h       MT     15836      + |           NA           A AAAAAAAAAAAAAA[1:168..   3581.13        PASS gridss1bf_1h gridss1bf_1o         BND        NA AAAAAAAAAAAAA        13 gridss1bf_1         0
#>   -------
#>   seqinfo: 86 sequences from an unspecified genome

Below is an example to subset the detected NUMTs by a genomic region given seqnames, start, and end. For region chr1:1000000-3000000, there are 3 NUMTs detected.

seqnames = 1
start = 1000000
end = 3000000
i <- sapply(NUMT$MT$NU[[seqnames]], function(x) 
  sum(countOverlaps(x, GRanges(seqnames = seqnames, IRanges(start, end))))>0)
list(NU=NUMT$MT$NU[[seqnames]][i], MT=NUMT$MT$MT[[seqnames]][i])
#> $NU
#> $NU[[1]]
#> GRanges object with 2 ranges and 13 metadata columns:
#>                seqnames    ranges strand | paramRangeID         REF                    ALT      QUAL      FILTER     sourceId      partner      svtype     svLen        insSeq    insLen       event    HOMLEN
#>                   <Rle> <IRanges>  <Rle> |     <factor> <character>            <character> <numeric> <character>  <character>  <character> <character> <numeric>   <character> <integer> <character> <numeric>
#>   gridss1fb_4o        1   1688363      + |           NA           C            C[MT:15737[   3928.49        PASS gridss1fb_4o gridss1fb_4h         BND        NA                       0 gridss1fb_4         0
#>   gridss1bf_1o        1   1688364      - |           NA           C ]MT:15836]AAAAAAAAAA..   3581.13        PASS gridss1bf_1o gridss1bf_1h         BND        NA AAAAAAAAAAAAA        13 gridss1bf_1         0
#>   -------
#>   seqinfo: 86 sequences from an unspecified genome
#> 
#> $NU[[2]]
#> GRanges object with 2 ranges and 13 metadata columns:
#>                seqnames          ranges strand | paramRangeID         REF                   ALT      QUAL      FILTER     sourceId      partner      svtype     svLen      insSeq    insLen       event    HOMLEN
#>                   <Rle>       <IRanges>  <Rle> |     <factor> <character>           <character> <numeric> <character>  <character>  <character> <character> <numeric> <character> <integer> <character> <numeric>
#>   gridss1fb_5o        1 1791082-1791083      + |           NA           G            G[MT:2592[   1929.85        PASS gridss1fb_5o gridss1fb_5h         BND        NA                     0 gridss1fb_5         1
#>   gridss1bf_2o        1         1791084      - |           NA           A ]MT:3592]AAAAAAAAAAAA   2894.91        PASS gridss1bf_2o gridss1bf_2h         BND        NA AAAAAAAAAAA        11 gridss1bf_2         0
#>   -------
#>   seqinfo: 86 sequences from an unspecified genome
#> 
#> $NU[[3]]
#> GRanges object with 2 ranges and 13 metadata columns:
#>                seqnames    ranges strand | paramRangeID         REF                    ALT      QUAL      FILTER     sourceId      partner      svtype     svLen          insSeq    insLen       event    HOMLEN
#>                   <Rle> <IRanges>  <Rle> |     <factor> <character>            <character> <numeric> <character>  <character>  <character> <character> <numeric>     <character> <integer> <character> <numeric>
#>   gridss2fb_3o        1   2869079      + |           NA           G             G[MT:2786[   2472.12        PASS gridss2fb_3o gridss2fb_3h         BND        NA                         0 gridss2fb_3         0
#>   gridss2bf_2o        1   2869080      - |           NA           A ]MT:2985]AAAAAAAAAAA..   2456.81        PASS gridss2bf_2o gridss2bf_2h         BND        NA AAAAAAAAAAAAAAA        15 gridss2bf_2         0
#>   -------
#>   seqinfo: 86 sequences from an unspecified genome
#> 
#> 
#> $MT
#> $MT[[1]]
#> GRanges object with 2 ranges and 13 metadata columns:
#>                seqnames    ranges strand | paramRangeID         REF                    ALT      QUAL      FILTER     sourceId      partner      svtype     svLen        insSeq    insLen       event    HOMLEN
#>                   <Rle> <IRanges>  <Rle> |     <factor> <character>            <character> <numeric> <character>  <character>  <character> <character> <numeric>   <character> <integer> <character> <numeric>
#>   gridss1fb_4h       MT     15737      - |           NA           G           ]1:1688363]G   3928.49        PASS gridss1fb_4h gridss1fb_4o         BND        NA                       0 gridss1fb_4         0
#>   gridss1bf_1h       MT     15836      + |           NA           A AAAAAAAAAAAAAA[1:168..   3581.13        PASS gridss1bf_1h gridss1bf_1o         BND        NA AAAAAAAAAAAAA        13 gridss1bf_1         0
#>   -------
#>   seqinfo: 86 sequences from an unspecified genome
#> 
#> $MT[[2]]
#> GRanges object with 2 ranges and 13 metadata columns:
#>                seqnames    ranges strand | paramRangeID         REF                    ALT      QUAL      FILTER     sourceId      partner      svtype     svLen      insSeq    insLen       event    HOMLEN
#>                   <Rle> <IRanges>  <Rle> |     <factor> <character>            <character> <numeric> <character>  <character>  <character> <character> <numeric> <character> <integer> <character> <numeric>
#>   gridss1fb_5h       MT 2592-2593      - |           NA           G           ]1:1791082]G   1929.85        PASS gridss1fb_5h gridss1fb_5o         BND        NA                     0 gridss1fb_5         1
#>   gridss1bf_2h       MT      3592      + |           NA           G GAAAAAAAAAAA[1:17910..   2894.91        PASS gridss1bf_2h gridss1bf_2o         BND        NA AAAAAAAAAAA        11 gridss1bf_2         0
#>   -------
#>   seqinfo: 86 sequences from an unspecified genome
#> 
#> $MT[[3]]
#> GRanges object with 2 ranges and 13 metadata columns:
#>                seqnames    ranges strand | paramRangeID         REF                    ALT      QUAL      FILTER     sourceId      partner      svtype     svLen          insSeq    insLen       event    HOMLEN
#>                   <Rle> <IRanges>  <Rle> |     <factor> <character>            <character> <numeric> <character>  <character>  <character> <character> <numeric>     <character> <integer> <character> <numeric>
#>   gridss2fb_3h       MT      2786      - |           NA           T           ]1:2869079]T   2472.12        PASS gridss2fb_3h gridss2fb_3o         BND        NA                         0 gridss2fb_3         0
#>   gridss2bf_2h       MT      2985      + |           NA           C CAAAAAAAAAAAAAAA[1:2..   2456.81        PASS gridss2bf_2h gridss2bf_2o         BND        NA AAAAAAAAAAAAAAA        15 gridss2bf_2         0
#>   -------
#>   seqinfo: 86 sequences from an unspecified genome

Visualising breakpoint pairs via circos plots

One way of visualising paired breakpoints is by circos plots. Here we use the package circlize to demonstrate breakpoint visualisation. The bedpe2circos function takes BEDPE-formatted dataframes (see breakpointgr2bedpe()) and plotting parameters for the circos.initializeWithIdeogram() and circos.genomicLink() functions from circlize.

To generate a simple circos plot of one candidate NUMT event:

library(circlize)
numt_chr_prefix <- c(NUMT$MT$NU$`1`[[2]], NUMT$MT$MT$`1`[[2]])
GenomeInfoDb::seqlevelsStyle(numt_chr_prefix) <- "UCSC"
pairs <- breakpointgr2pairs(numt_chr_prefix)
pairs

To see supporting breakpoints clearly, we generate the circos plot according to the loci of event.

circos.initializeWithIdeogram(
    data.frame(V1=c("chr1", "chrM"),
               V2=c(1791073,1),
               V3=c(1791093,16571),
               V4=c("p15.4",NA),
               V5=c("gpos50",NA)),  sector.width = c(0.2, 0.8))
#circos.initializeWithIdeogram()
circos.genomicLink(as.data.frame(S4Vectors::first(pairs)), 
                   as.data.frame(S4Vectors::second(pairs)))

circos.clear()