Skip to content

Commit

Permalink
Merge pull request #18 from sigven/vep_110
Browse files Browse the repository at this point in the history
VEP 110 + SIF file
  • Loading branch information
sigven authored Dec 29, 2023
2 parents c46e940 + f09a043 commit 72b9983
Show file tree
Hide file tree
Showing 25 changed files with 2,467 additions and 1,617 deletions.
101 changes: 62 additions & 39 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,15 @@

### Overview

The generic variant annotator (*gvanno*) is a software package intended for simple analysis and interpretation of human DNA variants. Variants and genes are annotated with disease-related and functional associations. Technically, the workflow is built with the [Docker](https://www.docker.com) technology, and it can also be installed through the [Singularity](https://sylabs.io/docs/) framework.
The generic variant annotator (*gvanno*) is a software package intended for simple analysis and interpretation of human DNA variants. Variants and genes are annotated with disease-related and functional associations. Technically, the workflow is developed in Python, and it relies upon [Docker](https://www.docker.com) / [Singularity](https://sylabs.io/docs/) technology for encapsulation of software dependencies.

*gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short insertions or deletions (indels). The workflow relies heavily upon [Ensembl's Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record. Note that if your input VCF contains data (genotypes) from multiple samples (i.e. a multisample VCF), the output TSV file will contain one line/record **per sample variant**.

### News
### News
- December 29th 2023 - **1.7.0 release**
- Data updates: ClinVar, GENCODE, GWAS catalog
- Software updates: VEP
- Improved Singularity support

- April 27th 2023 - **1.6.0 release**

Expand All @@ -28,16 +32,16 @@ The generic variant annotator (*gvanno*) is a software package intended for simp

- Added option `--vep_coding_only` - only report variants that fall into coding regions of transcripts (VEP option `--coding_only`)

### Annotation resources (v1.6.0)
### Annotation resources (v1.7.0)

- [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v109 (GENCODE v43/v19 as the gene reference dataset)
- [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.2, March 2021)
- [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v110 (GENCODE v44/v19 as the gene reference dataset)
- [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.5, November 2023)
- [gnomAD](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (release 2.1, October 2018) - from VEP
- [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 154) - from VEP
- [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of variants related to human health/disease phenotypes (April 2023)
- [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of variants related to human health/disease phenotypes (December 2023)
- [CancerMine](http://bionlp.bcgsc.ca/cancermine/) - literature-mined database of drivers, oncogenes and tumor suppressors in cancer (version 50, March 2023)
- [Mutation hotspots](cancerhotspots.org) - Database of mutation hotspots in cancer
- [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (March 27th 2023)
- [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (November 2023)

### Getting started

Expand All @@ -49,11 +53,15 @@ The generic variant annotator (*gvanno*) is a software package intended for simp

- *Other utilities*

The script that installs the reference data requires that the user has `bgzip` installed. See [here](http://www.htslib.org/download/) for instructions. The script also requires that basic Linux/UNIX commands are available (i.e. `gzip`, `tar`)

**NOTE**: We strongly recommend that _gvanno_ is installed on a MacOS or Linux/UNIX operating system
The script that installs the reference data requires that the user has `bgzip` and `tabix` installed. See [here](http://www.htslib.org/download/) for instructions. The script also requires that basic Linux/UNIX commands are available (i.e. `gzip`, `tar`)

#### STEP 1: Installation of Docker
**NOTE**: _gvanno_ should be installed on a MacOS or Linux/UNIX operating system

#### STEP 1: Installation of Docker/Singularity

- the _gvanno_ workflow can be executed with either _Docker_ or _Singularity_ container technology

##### Installation of Docker

1. [Install the Docker engine](https://docs.docker.com/engine/installation/) on your preferred platform
- installing [Docker on Linux](https://docs.docker.com/engine/installation/linux/)
Expand All @@ -65,33 +73,33 @@ The generic variant annotator (*gvanno*) is a software package intended for simp
- CPUs: minimum 4
- [How to - Mac OS X](https://docs.docker.com/docker-for-mac/#advanced)

##### 1.1: Installation of Singularity (_IN DEVELOPMENT_)

0. **Note: this works for Singularity version 3.0 and higher**.
##### Installation of Singularity

1. [Install Singularity](https://sylabs.io/docs/)

2. Test that singularity works by running `singularity --version`

3. If you are in the gvanno directory, build the singularity image like so:

`cd src`

`sudo ./buildSingularity.sh`

#### STEP 2: Download *gvanno* and data bundle

1. [Download and unpack the latest release](https://github.com/sigven/gvanno/releases/tag/v1.6.0)
1. [Download and unpack the latest release](https://github.com/sigven/gvanno/releases/tag/v1.7.0)

2. Install the assembly-specific VEP cache, and gvanno-specific reference data using the `download_gvanno_refdata.py` script, i.e.:

- `python download_gvanno_refdata.py --download_dir <PATH_TO_DOWNLOAD_DIR> --genome_assembly grch38`

**NOTE**: This can take a considerable amount of time depending on your local bandwidth (approx 30Gb pr. assembly-specific bundle)
**NOTE**: This can take a considerable amount of time depending on your local bandwidth (approx 20Gb pr. assembly-specific bundle)


3. Pull container images

* Docker
* Pull the [gvanno Docker image (v1.7.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 3.8Gb):

* `docker pull sigven/gvanno:1.7.0` (gvanno annotation engine)

* Singularity
* Download the [gvanno SIF image (v1.7.0)](https://insilico.hpc.uio.no/pcgr/gvanno/gvanno_1.7.0.sif) (approx 1.2Gb) and use this as the argument for `--sif_file` in the `gvanno.py` run script.

3. Pull the [gvanno Docker image (1.6.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 4.11Gb):

- `docker pull sigven/gvanno:1.6.0` (gvanno annotation engine)

#### STEP 3: Input preprocessing

Expand All @@ -105,7 +113,7 @@ We **strongly** recommend that the input VCF is compressed and indexed using [bg

Run the workflow with **gvanno.py**, which takes the following arguments and options:

```
```
usage:
gvanno.py -h [options]
--query_vcf <QUERY_VCF>
Expand All @@ -121,7 +129,7 @@ Required arguments:
--query_vcf QUERY_VCF
VCF input file with germline query variants (SNVs/InDels).
--gvanno_dir GVANNO_DIR
Directory that contains the gvanno reference data, e.g. ~/gvanno-1.6.0
Directory that contains the gvanno reference data, e.g. ~/gvanno-1.7.0
--output_dir OUTPUT_DIR
Output directory
--genome_assembly {grch37,grch38}
Expand All @@ -132,9 +140,9 @@ Required arguments:
Sample identifier - prefix for output files
VEP optional arguments:
--vep_regulatory Enable Variant Effect Predictor (VEP) to look for overlap with regulatory regions (option --regulatory in VEP).
--vep_gencode_all Consider all GENCODE transcripts with Variant Effect Predictor (VEP) (option --gencode_basic in VEP is used by default in gvanno).
--vep_lof_prediction Predict loss-of-function variants with Loftee plugin in Variant Effect Predictor (VEP), default: False
--vep_regulatory Enable Variant Effect Predictor (VEP) to look for overlap with regulatory regions (option --regulatory in VEP).
--vep_gencode_basic Consider only basic GENCODE transcripts with Variant Effect Predictor (VEP).
--vep_lof_prediction Predict loss-of-function variants with the LOFTEE plugin in Variant Effect Predictor (VEP), default: False
--vep_n_forks VEP_N_FORKS
Number of forks for Variant Effect Predictor (VEP) processing, default: 4
--vep_buffer_size VEP_BUFFER_SIZE
Expand All @@ -143,7 +151,7 @@ VEP optional arguments:
--vep_pick_order VEP_PICK_ORDER
Comma-separated string of ordered transcript properties for primary variant pick in
Variant Effect Predictor (VEP) processing, default: canonical,appris,biotype,ccds,rank,tsl,length,mane
--vep_skip_intergenic
--vep_no_intergenic
Skip intergenic variants in Variant Effect Predictor (VEP) processing, default: False
--vep_coding_only
Only report variants falling into coding regions of transcripts (VEP), default: False
Expand All @@ -160,25 +168,40 @@ Other optional arguments:
--oncogenicity_annotation
Classify variants according to oncogenicity (Horak et al., Genet Med, 2022)
--debug Print full Docker/Singularity commands to log and do not delete intermediate files with warnings etc.
--sif_file gvanno SIF image file for usage of gvanno workflow with option '--container singularity'
```

The *examples* folder contains an example VCF file. Analysis of the example VCF can be performed by the following command:
The *examples* folder contains an example VCF file. Analysis of the example VCF can be performed by the following command (Docker-based):

```
python ~/gvanno-1.6.0/gvanno.py
--query_vcf ~/gvanno-1.6.0/examples/example.grch37.vcf.gz
--gvanno_dir ~/gvanno-1.6.0
--output_dir ~/gvanno-1.6.0
```
python ~/gvanno-1.7.0/gvanno.py
--query_vcf ~/gvanno-1.7.0/examples/example.grch37.vcf.gz
--gvanno_dir ~/gvanno-1.7.0
--output_dir ~/gvanno-1.7.0
--sample_id example
--genome_assembly grch37
--container docker
--force_overwrite
```

or Singularity-based

```
python ~/gvanno-1.7.0/gvanno.py
--query_vcf ~/gvanno-1.7.0/examples/example.grch37.vcf.gz
--gvanno_dir ~/gvanno-1.7.0
--output_dir ~/gvanno-1.7.0
--sample_id example
--genome_assembly grch37
--container singularity
--sif_file gvanno_1.7.0.sif
--force_overwrite
```

This command will run the Docker-based *gvanno* workflow and produce the following output files in the *examples* folder:

1. **example_gvanno_pass_grch37.vcf.gz (.tbi)** - Bgzipped VCF file with rich set of functional/clinical annotations
2. **example_gvanno_pass_grch37.tsv.gz** - Compressed TSV file with rich set of functional/clinical annotations
1. **example_gvanno_grch37.pass.vcf.gz (.tbi)** - Bgzipped VCF file with rich set of functional/clinical variant and gene annotations
2. **example_gvanno_grch37.pass.tsv.gz** - Compressed TSV file with rich set of functional/clinical variant and gene annotations

Similar files are produced for all variants, not only variants with a *PASS* designation in the VCF FILTER column.

Expand Down
50 changes: 37 additions & 13 deletions download_gvanno_refdata.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,16 +8,16 @@
import sys
import locale
import errno
#import wget
import urllib.request as urllib2
from argparse import RawTextHelpFormatter

GVANNO_VERSION = '1.6.0'
REFDATA_VERSION = '20230425'
ENSEMBL_VERSION = '109'
GENCODE_VERSION = 'v43'
GVANNO_VERSION = '1.7.0'
REFDATA_VERSION = '20231224'
ENSEMBL_VERSION = '110'
GENCODE_VERSION = 'v44'
VEP_ASSEMBLY = "GRCh38"
HOST_GVANNO_REFDATA_URL = "http://insilico.hpc.uio.no/pcgr/gvanno/"
HOST_HUMAN_ANCESTOR = "https://personal.broadinstitute.org/konradk/loftee_data/"

def __main__():

Expand All @@ -35,7 +35,7 @@ def __main__():
'download directory already exist.\nYou can force the overwrite of existing download directory by using this flag, default: %(default)s')
optional.add_argument('--version', action='version', version='%(prog)s ' + str(GVANNO_VERSION))
optional.add_argument('--clean_raw_files',action="store_true", help="Delete raw compressed tar files (i.e. VEP) after download and unzip + untar has been conducted successfully")
optional.add_argument("--debug", action="store_true", help="Print full commands to log and do not delete intermediate files with warnings etc.")
optional.add_argument('--debug', action="store_true", help="Print full commands to log and do not delete intermediate files with warnings etc.")
required.add_argument('--download_dir',help='Destination directory for downloaded reference data', required = True)
required.add_argument('--genome_assembly',choices = ['grch37','grch38'], help='Choose build-specific reference data for download: grch37 or grch38', required = True)

Expand Down Expand Up @@ -65,7 +65,6 @@ def __main__():
os.mkdir(arg_dict['db_assembly_dir'])
os.mkdir(arg_dict['vep_assembly_dir'])


download_gvanno_ref_data(arg_dict = arg_dict)


Expand Down Expand Up @@ -190,7 +189,7 @@ def download_gvanno_ref_data(arg_dict):
vep_assembly_dir = os.path.join(os.path.abspath(arg_dict['download_dir']),'data',arg_dict['genome_assembly'], '.vep')

datasets = {}
for db in ['vep_cache','vep_fasta','gvanno_custom']:
for db in ['vep_cache','vep_fasta','gvanno_custom','human_ancestor']:
datasets[db] = {}
datasets[db]['remote_url'] = 'NA'
datasets[db]['local_path'] = 'NA'
Expand All @@ -208,8 +207,10 @@ def download_gvanno_ref_data(arg_dict):
)

logger = getlogger('download-vep-cache')
datasets['vep_cache']['local_path'] = os.path.join(arg_dict['vep_assembly_dir'], f"homo_sapiens_vep_{ENSEMBL_VERSION}_{VEP_ASSEMBLY}.tar.gz")
datasets['vep_fasta']['local_path'] = os.path.join(arg_dict['vep_assembly_dir'], "homo_sapiens", f"{ENSEMBL_VERSION}_{VEP_ASSEMBLY}", f"Homo_sapiens.{VEP_ASSEMBLY}.dna.primary_assembly.fa.gz")
datasets['vep_cache']['local_path'] = os.path.join(
arg_dict['vep_assembly_dir'], f"homo_sapiens_vep_{ENSEMBL_VERSION}_{VEP_ASSEMBLY}.tar.gz")
datasets['vep_fasta']['local_path'] = os.path.join(
arg_dict['vep_assembly_dir'], "homo_sapiens", f"{ENSEMBL_VERSION}_{VEP_ASSEMBLY}", f"Homo_sapiens.{VEP_ASSEMBLY}.dna.primary_assembly.fa.gz")
datasets['vep_fasta']['local_path_uncompressed'] = re.sub(r'.gz','',datasets['vep_fasta']['local_path'])

vep_cache_bytes_remote = get_url_num_bytes(url = datasets['vep_cache']['remote_url'], logger = logger)
Expand Down Expand Up @@ -273,15 +274,38 @@ def download_gvanno_ref_data(arg_dict):
check_subprocess(command = command_unzip_fasta, logger = logger)
command_bgzip_fasta = f"bgzip {datasets['vep_fasta']['local_path_uncompressed']}"
check_subprocess(command = command_bgzip_fasta, logger = logger)



logger = getlogger('download-human-ancestor')
logger.info("Downloading human ancestor FASTA files")
for postfix in ['gz', 'gz.fai', 'gz.gzi']:
datasets['human_ancestor']['remote_url'] = f'{HOST_HUMAN_ANCESTOR}{VEP_ASSEMBLY}/human_ancestor.fa.{postfix}'
datasets['human_ancestor']['local_path'] = os.path.join(
arg_dict['vep_assembly_dir'], "homo_sapiens", f"{ENSEMBL_VERSION}_{VEP_ASSEMBLY}", f'human_ancestor.fa.{postfix}')

logger = getlogger('download-human-ancestor')
custom_cache_bytes_remote = get_url_num_bytes(url = datasets['human_ancestor']['remote_url'], logger = logger)
logger.info('Human ancestor FASTA - remote target file ' + str(datasets['human_ancestor']['remote_url']))
logger.info('Human ancestor FASTA - size: ' + pretty_print(custom_cache_bytes_remote, logger = logger))
logger.info('Human ancestor FASTA - local destination file: ' + str(datasets['human_ancestor']['local_path']))


if os.path.exists(datasets['human_ancestor']['local_path']):
if os.path.getsize(datasets['human_ancestor']['local_path']) == custom_cache_bytes_remote:
logger.info('Human ancestor FASTA already downloaded')
else:
logger.info('Human ancestor FASTA - download in progress - this can take a while ... ')
urllib2.urlretrieve(datasets['human_ancestor']['remote_url'], datasets['human_ancestor']['local_path'])
else:
logger.info('Human ancestor FASTA - download in progress - this can take a while ... ')
urllib2.urlretrieve(datasets['human_ancestor']['remote_url'], datasets['human_ancestor']['local_path'])


datasets['gvanno_custom']['remote_url'] = f'{HOST_GVANNO_REFDATA_URL}gvanno.databundle.{arg_dict["genome_assembly"]}.{REFDATA_VERSION}.tgz'
datasets['gvanno_custom']['local_path'] = os.path.join(arg_dict['download_dir'], f'gvanno.databundle.{arg_dict["genome_assembly"]}.{REFDATA_VERSION}.tgz')

logger = getlogger('download-gvanno-custom')
custom_cache_bytes_remote = get_url_num_bytes(url = datasets['gvanno_custom']['remote_url'], logger = logger)
logger.info("Downloading custom gvanno variant datasets: Clinvar / dbNSFP / ncER / cancerhotspots ++")
logger.info("Downloading custom gvanno variant datasets: Clinvar / dbNSFP / ncER / GWAS catalog / cancerhotspots ++")
logger.info('Custom gvanno datasets - remote target file ' + str(datasets['gvanno_custom']['remote_url']))
logger.info('Custom gvanno datasets - size: ' + pretty_print(custom_cache_bytes_remote, logger = logger))
logger.info('Custom gvanno datasets - local destination file: ' + str(datasets['gvanno_custom']['local_path']))
Expand Down
Binary file added examples/example_small.grch37.vcf.gz
Binary file not shown.
Binary file added examples/example_small.grch37.vcf.gz.tbi
Binary file not shown.
File renamed without changes.
File renamed without changes.
Loading

0 comments on commit 72b9983

Please sign in to comment.