Skip to content

Commit 2640ad8

Browse files
committed
0.9.0 release
1 parent 8b5e122 commit 2640ad8

File tree

11 files changed

+782
-171
lines changed

11 files changed

+782
-171
lines changed

README.md

Lines changed: 33 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -6,20 +6,24 @@ The germline variant annotator (*gvanno*) is a simple, Docker-based software pac
66

77
*gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The workflow relies heavily upon [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record.
88

9-
#### Annotation resources included in _gvanno_ - 0.8.0
9+
#### Annotation resources included in _gvanno_ - 0.9.0
1010

11-
* [VEP v95](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor (GENCODE v29/v19 as the gene reference dataset)
12-
* [dBNSFP v4.0](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (February 2019)
13-
* [gnomAD r2](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (February 2017) - from VEP
14-
* [dbSNP build 151](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (October 2017) - from VEP
11+
* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v96 (GENCODE v30/v19 as the gene reference dataset)
12+
* [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.0, May 2019)
13+
* [gnomAD](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (release 2.1, October 2018) - from VEP
14+
* [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 151, October 2017) - from VEP
1515
* [1000 Genomes Project - phase3](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) - Germline variant frequencies genome-wide (May 2013) - from VEP
16-
* [ClinVar 20190305](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (March 2019)
16+
* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (May 2019)
1717
* [DisGeNET](http://www.disgenet.org) - Database of gene-disease associations (v6.0, January 2019)
18-
* [UniProt/SwissProt KnowledgeBase 2019_02](http://www.uniprot.org) - Resource on protein sequence and functional information (February 2019)
19-
* [Pfam v32](http://pfam.xfam.org) - Database of protein families and domains (Sept 2018)
18+
* [UniProt/SwissProt KnowledgeBase](http://www.uniprot.org) - Resource on protein sequence and functional information (2019_04, May 2019)
19+
* [Pfam](http://pfam.xfam.org) - Database of protein families and domains (v32, Sept 2018)
2020
* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (March 13th 2019)
2121

2222
### News
23+
* May 21st 2019 - **0.9.0 release**
24+
* Data bundle updates: ClinVar, UniProt
25+
* Adding gene-disease associations from [Open Targets Platform](https://targetvalidation.org),([Carvalho-Silva et. al, NAR, 2019](https://www.ncbi.nlm.nih.gov/pubmed/30462303))
26+
* Moved *vcf-validation* configuration to command-line option
2327
* March 21st 2019 - **0.8.0 release**
2428
* Data bundle updates: ClinVar, UniProt, GWAS catalog
2529
* Bundle bug: Missing VEP FASTA file for grch38
@@ -73,15 +77,15 @@ An installation of Python (version _3.6_) is required to run *gvanno*. Check tha
7377

7478
#### STEP 2: Download *gvanno* and data bundle
7579

76-
1. Download and unpack the [latest software release (0.8.0)](https://github.com/sigven/gvanno/releases/tag/v0.8.0)
80+
1. Download and unpack the [latest software release (0.9.0)](https://github.com/sigven/gvanno/releases/tag/v0.9.0)
7781
2. Download and unpack the assembly-specific data bundle in the gvanno directory
78-
* [grch37 data bundle](https://drive.google.com/file/d/1cJRaSD_UgeG34CnE3PHj3vxXSAAMN9Jl) (approx 14Gb)
79-
* [grch38 data bundle](https://drive.google.com/file/d/1uZw5iEibKJV_9SmCusHcpzKBVzTu2pcH) (approx 15Gb)
82+
* [grch37 data bundle](https://drive.google.com/open?id=1rqkzHTmPpBsVY3MvzCQdJurKuCDNf09D) (approx 14Gb)
83+
* [grch38 data bundle](https://drive.google.com/open?id=13pn59FpLU7Tta7X16H2GKkOfbsclPi9I) (approx 15Gb)
8084
* *Unpacking*: `gzip -dc gvanno.databundle.grch37.YYYYMMDD.tgz | tar xvf -`
8185

8286
A _data/_ folder within the _gvanno-X.X_ software folder should now have been produced
83-
3. Pull the [gvanno Docker image (0.8.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2Gb):
84-
* `docker pull sigven/gvanno:0.8.0` (gvanno annotation engine)
87+
3. Pull the [gvanno Docker image (0.9.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2Gb):
88+
* `docker pull sigven/gvanno:0.9.0` (gvanno annotation engine)
8589

8690
#### STEP 3: Input preprocessing
8791

@@ -95,42 +99,35 @@ We __strongly__ recommend that the input VCF is compressed and indexed using [bg
9599

96100
A few elements of the workflow can be figured using the *gvanno* configuration file (i.e. **gvanno.toml**), encoded in [TOML](https://github.com/toml-lang/toml) (an easy to read file format).
97101

98-
* The initial step of the workflow performs [VCF validation](https://github.com/EBIvariation/vcf-validator) on the input VCF file. This procedure is very strict, and often causes the workflow to return an error due to various violations of the VCF specification. If the user trusts that the most critical parts of the input VCF is properly encoded, a setting in the configuration file (`vcf_validation = false`) can be used to turn off VCF validation.
99-
100102
* Prediction of loss-of-function variants using VEP's LOFTEE plugin can be turned on in the configuration file (`lof_prediction = true`). Do note that this frequently increases the run time for VEP significantly.
101103

102104
#### STEP 5: Run example
103105

104106
Run the workflow with **gvanno.py**, which takes the following arguments and options:
105107

106-
usage: gvanno.py [-h] [--force_overwrite] [--version]
107-
query_vcf gvanno_dir output_dir {grch37,grch38} configuration_file
108-
sample_id
108+
usage: gvanno.py [options] <QUERY_VCF> <GVANNO_DIR> <OUTPUT_DIR> <GENOME_ASSEMBLY> <CONFIG_FILE> <SAMPLE_ID>
109109

110-
Germline variant annotation (gvanno) workflow for clinical and functional
111-
interpretation of germline nucleotide variants
110+
Germline variant annotation (gvanno) workflow for clinical and functional interpretation of germline nucleotide variants
112111

113-
positional arguments:
114-
query_vcf VCF input file with germline variants (SNVs/InDels)
115-
gvanno_dir gvanno base directory with accompanying data
116-
directory, e.g. ~/gvanno-0.8.0
117-
output_dir Output directory
118-
{grch37,grch38} grch37 or grch38
119-
configuration_file gvanno configuration file (TOML format)
120-
sample_id Sample identifier - prefix for output files
112+
positional arguments:
113+
query_vcf VCF input file with germline query variants (SNVs/InDels)
114+
gvanno_dir gvanno base directory with accompanying data directory, e.g. ~/gvanno-0.9.0
115+
output_dir Output directory
116+
{grch37,grch38} grch37 or grch38
117+
configuration_file gvanno configuration file (TOML format)
118+
sample_id Sample identifier - prefix for output files
121119

122-
optional arguments:
123-
-h, --help show this help message and exit
124-
--force_overwrite The script will fail with an error if the output file
125-
already exists. Force the overwrite of existing result
126-
files by using this flag (default: False)
127-
--version show program's version number and exit
120+
optional arguments:
121+
-h, --help show this help message and exit
122+
--force_overwrite The script will fail with an error if the output file already exists. Force the overwrite of existing result files by using this flag
123+
--version show program's version number and exit
124+
--no_vcf_validate Skip validation of input VCF with Ensembl's vcf-validator
128125

129126

130127
The _examples_ folder contains an example VCF file. Analysis of the example VCF can be performed by the following command:
131128

132-
`python ~/gvanno-0.8.0/gvanno.py ~/gvanno-0.8.0/examples/example_grch37.vcf.gz`
133-
` ~/gvanno-0.8.0 ~/gvanno-0.8.0/examples grch37 ~/gvanno-0.8.0/gvanno.toml example`
129+
`python ~/gvanno-0.9.0/gvanno.py ~/gvanno-0.9.0/examples/example_grch37.vcf.gz`
130+
` ~/gvanno-0.9.0 ~/gvanno-0.9.0/examples grch37 ~/gvanno-0.9.0/gvanno.toml example`
134131

135132

136133
This command will run the Docker-based *gvanno* workflow and produce the following output files in the _examples_ folder:

gvanno.py

Lines changed: 34 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -10,20 +10,23 @@
1010
import getpass
1111
import platform
1212
import toml
13+
from argparse import RawTextHelpFormatter
1314

1415

15-
gvanno_version = '0.8.0'
16-
db_version = 'GVANNO_DB_VERSION = 20190320'
17-
vep_version = '95'
16+
17+
gvanno_version = '0.9.0'
18+
db_version = 'GVANNO_DB_VERSION = 20190521'
19+
vep_version = '96'
1820
global vep_assembly
1921

2022
def __main__():
2123

22-
parser = argparse.ArgumentParser(description='Germline variant annotation (gvanno) workflow for clinical and functional interpretation of germline nucleotide variants',formatter_class=argparse.ArgumentDefaultsHelpFormatter)
24+
parser = argparse.ArgumentParser(description='Germline variant annotation (gvanno) workflow for clinical and functional interpretation of germline nucleotide variants',formatter_class=RawTextHelpFormatter, usage="%(prog)s [options] <QUERY_VCF> <GVANNO_DIR> <OUTPUT_DIR> <GENOME_ASSEMBLY> <CONFIG_FILE> <SAMPLE_ID>")
2325
parser.add_argument('--force_overwrite', action = "store_true", help='The script will fail with an error if the output file already exists. Force the overwrite of existing result files by using this flag')
2426
parser.add_argument('--version', action='version', version='%(prog)s ' + str(gvanno_version))
27+
parser.add_argument('--no_vcf_validate', action = "store_true",help="Skip validation of input VCF with Ensembl's vcf-validator")
2528
parser.add_argument('query_vcf', help='VCF input file with germline query variants (SNVs/InDels)')
26-
parser.add_argument('gvanno_dir',help='gvanno base directory with accompanying data directory, e.g. ~/gvanno-0.8.0')
29+
parser.add_argument('gvanno_dir',help='gvanno base directory with accompanying data directory, e.g. ~/gvanno-0.9.0')
2730
parser.add_argument('output_dir',help='Output directory')
2831
parser.add_argument('genome_assembly',choices = ['grch37','grch38'], help='grch37 or grch38')
2932
parser.add_argument('configuration_file',help='gvanno configuration file (TOML format)')
@@ -53,7 +56,7 @@ def __main__():
5356
gvanno_error_message(err_msg,logger)
5457
host_directories = verify_input_files(args.query_vcf, args.configuration_file, config_options, args.gvanno_dir, args.output_dir, args.sample_id, args.genome_assembly, overwrite, logger)
5558

56-
run_gvanno(host_directories, docker_image_version, config_options, args.sample_id, args.genome_assembly, gvanno_version)
59+
run_gvanno(host_directories, docker_image_version, config_options, args.sample_id, args.no_vcf_validate, args.genome_assembly, gvanno_version)
5760

5861

5962
def read_config_options(configuration_file, gvanno_dir, genome_assembly, logger):
@@ -78,7 +81,7 @@ def read_config_options(configuration_file, gvanno_dir, genome_assembly, logger)
7881
gvanno_error_message(err_msg, logger)
7982

8083

81-
boolean_tags = ['vep_skip_intergenic', 'vcf_validation', 'lof_prediction']
84+
boolean_tags = ['vep_skip_intergenic', 'lof_prediction']
8285
integer_tags = ['n_vcfanno_proc','n_vep_forks','buffer_size']
8386
for section in ['other']:
8487
if section in user_options:
@@ -246,7 +249,7 @@ def getlogger(logger_name):
246249

247250
return logger
248251

249-
def run_gvanno(host_directories, docker_image_version, config_options, sample_id, genome_assembly, gvanno_version):
252+
def run_gvanno(host_directories, docker_image_version, config_options, sample_id, no_vcf_validate, genome_assembly, gvanno_version):
250253
"""
251254
Main function to run the gvanno workflow using Docker
252255
"""
@@ -256,7 +259,7 @@ def run_gvanno(host_directories, docker_image_version, config_options, sample_id
256259
output_pass_vcf = 'None'
257260
uid = ''
258261
vep_assembly = 'GRCh38'
259-
gencode_version = 'release 29'
262+
gencode_version = 'release 30'
260263
if genome_assembly == 'grch37':
261264
gencode_version = 'release 19'
262265
vep_assembly = 'GRCh37'
@@ -272,6 +275,9 @@ def run_gvanno(host_directories, docker_image_version, config_options, sample_id
272275
uid = 'root'
273276

274277
vepdb_dir_host = os.path.join(str(host_directories['db_dir_host']),'.vep')
278+
vcf_validation = 1
279+
if no_vcf_validate:
280+
vcf_validation = 0
275281
data_dir = '/data'
276282
output_dir = '/workdir/output'
277283
vep_dir = '/usr/local/share/vep/data'
@@ -284,17 +290,31 @@ def run_gvanno(host_directories, docker_image_version, config_options, sample_id
284290
if host_directories['input_conf_basename_host'] != 'NA':
285291
input_conf_docker = '/workdir/input_conf/' + str(host_directories['input_conf_basename_host'])
286292

287-
docker_command_run1 = 'NA'
293+
vep_volume_mapping = str(vepdb_dir_host) + ":/usr/local/share/vep/data"
294+
databundle_volume_mapping = str(host_directories['base_dir_host']) + ":/data"
295+
input_vcf_volume_mapping = str(host_directories['input_vcf_dir_host']) + ":/workdir/input_vcf"
296+
input_conf_volume_mapping = str(host_directories['input_conf_dir_host']) + ":/workdir/input_conf"
297+
output_volume_mapping = str(host_directories['output_dir_host']) + ":/workdir/output"
298+
299+
docker_command_run1 = "docker run --rm -t -u " + str(uid) + " -v=" + str(databundle_volume_mapping) + " -v=" + str(vep_volume_mapping) + " -v=" + str(input_conf_volume_mapping) + " -v=" + str(output_volume_mapping)
288300
if host_directories['input_vcf_dir_host'] != 'NA':
289-
docker_command_run1 = "docker run --rm -t -u " + str(uid) + " -v=" + str(host_directories['base_dir_host']) + ":/data -v=" + str(vepdb_dir_host) + ":/usr/local/share/vep/data -v=" + str(host_directories['input_vcf_dir_host']) + ":/workdir/input_vcf -v=" + str(host_directories['input_conf_dir_host']) + ":/workdir/input_conf -v=" + str(host_directories['output_dir_host']) + ":/workdir/output -w=/workdir/output " + str(docker_image_version) + " sh -c \""
290-
docker_command_run2 = "docker run --rm -t -u " + str(uid) + " -v=" + str(host_directories['base_dir_host']) + ":/data -v=" + str(host_directories['output_dir_host']) + ":/workdir/output -w=/workdir " + str(docker_image_version) + " sh -c \""
301+
docker_command_run1 = docker_command_run1 + " -v=" + str(input_vcf_volume_mapping)
302+
303+
docker_command_run1 = docker_command_run1 + " -w=/workdir/output " + str(docker_image_version) + " sh -c \""
304+
docker_command_run2 = "docker run --rm -t -u " + str(uid) + " -v=" + str(databundle_volume_mapping) + " -v=" + str(output_volume_mapping)
305+
docker_command_run2 = docker_command_run2 + " -w=/workdir/output " + str(docker_image_version) + " sh -c \""
291306
docker_command_run_end = '\"'
292307

293-
308+
logger = getlogger("gvanno-start")
309+
logger.info("--- germline variant annotation (gvanno) workflow ----")
310+
logger.info("Sample name: " + str(sample_id))
311+
logger.info("Genome assembly: " + str(genome_assembly))
312+
print()
313+
294314
## verify VCF and CNA segment file
295315
logger = getlogger('gvanno-validate-input')
296316
logger.info("STEP 0: Validate input data")
297-
vcf_validate_command = str(docker_command_run1) + "gvanno_validate_input.py " + str(data_dir) + " " + str(input_vcf_docker) + " " + str(input_conf_docker) + " " + str(genome_assembly) + docker_command_run_end
317+
vcf_validate_command = str(docker_command_run1) + "gvanno_validate_input.py " + str(data_dir) + " " + str(input_vcf_docker) + " " + str(input_conf_docker) + " " + str(vcf_validation) + " " + str(genome_assembly) + docker_command_run_end
298318

299319
check_subprocess(vcf_validate_command)
300320
logger.info('Finished')

gvanno.toml

Lines changed: 0 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,6 @@
11
# gvanno configuration options (TOML).
22

33
[other]
4-
## Keep/skip VCF validation by https://github.com/EBIvariation/vcf-validator. The vcf-validator checks
5-
## that the input VCF is properly encoded. Since the vcf-validator is strict, and with error messages
6-
## that is not always self-explanatory, the users can skip validation if they are confident that the
7-
## most critical parts of the VCF are properly encoded
8-
vcf_validation = true
94
## Number of processes for vcfanno
105
n_vcfanno_proc = 4
116
## Number of forks for VEP

0 commit comments

Comments
 (0)