diff --git a/README.md b/README.md index c9fb1d2..8f07d70 100755 --- a/README.md +++ b/README.md @@ -4,22 +4,26 @@ The germline variant annotator (*gvanno*) is a simple, Docker-based software package intended for analysis and interpretation of human DNA variants of germline origin. It accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The workflow is largely based on [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record. -#### Annotation resources included in _gvanno_ - 0.3.1 +#### Annotation resources included in _gvanno_ - 0.4.0 -* [VEP v92](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor release 92 (GENCODE v19/v28 as the gene reference dataset) +* [VEP v93](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor (GENCODE v19/v28 as the gene reference dataset) * [dBNSFP v3.5](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (August 2017) * [gnomAD r2](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (February 2017) - from VEP * [dbSNP b150](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (February 2017) - from VEP * [1000 Genomes Project - phase3](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) - Germline variant frequencies genome-wide (May 2013) - from VEP -* [ClinVar 20180603](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (June 2018) +* [ClinVar 20180906](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of clinically related variants (September 2018) * [DisGeNET](http://www.disgenet.org) - Database of gene-disease associations (v5.0, May 2017) -* [UniProt/SwissProt KnowledgeBase 2018_06](http://www.uniprot.org) - Resource on protein sequence and functional information (June 2018) +* [UniProt/SwissProt KnowledgeBase 2018_08](http://www.uniprot.org) - Resource on protein sequence and functional information (September 2018) * [Pfam v31](http://pfam.xfam.org) - Database of protein families and domains (March 2017) * [TSGene v2.0](http://bioinfo.mc.vanderbilt.edu/TSGene/) - Tumor suppressor/oncogene database (November 2015) ### News - +* September 15th - **0.4.0 release** + * VEP upgrade (v93) + * Data bundle update (ClinVar 20180906) + * Code restructuring + * Running of LofTee can be configured * July 5th 2018 - **0.3.1 release** * Data bundle updates (ClinVar, UniProt) * Addition of [VEP LofTee plugin](https://github.com/konradjk/loftee) - predicts loss-of-function variants @@ -51,15 +55,15 @@ An installation of Python (version _3.6_) is required to run *gvanno*. Check tha #### STEP 2: Download *gvanno* and data bundle -1. Download and unpack the [latest software release (0.3.1)](https://github.com/sigven/gvanno/releases/tag/v0.3.1) +1. Download and unpack the [latest software release (0.4.0)](https://github.com/sigven/gvanno/releases/tag/v0.4.0) 2. Download and unpack the assembly-specific data bundle in the PCGR directory * [grch37 data bundle](https://drive.google.com/file/d/15NbYwwnb8J5IGhL6-RJXpAeQ-xqzjc5F/) (approx 9Gb) * [grch38 data bundle](https://drive.google.com/file/d/1hr4MShsEh2Xf-_bBgDPi7t-vj32XrWJ0/) (approx 9Gb) * *Unpacking*: `gzip -dc gvanno.databundle.grch37.YYYYMMDD.tgz | tar xvf -` A _data/_ folder within the _gvanno-X.X_ software folder should now have been produced -3. Pull the [gvanno Docker image (0.3.1)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2.5Gb): - * `docker pull sigven/gvanno:0.3.1` (gvanno annotation engine) +3. Pull the [gvanno Docker image (0.4.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2.5Gb): + * `docker pull sigven/gvanno:0.4.0` (gvanno annotation engine) #### STEP 3: Input preprocessing @@ -75,6 +79,8 @@ A few elements of the workflow can be figured using the *gvanno* configuration f The initial step of the workflow performs [VCF validation](https://github.com/EBIvariation/vcf-validator) on the input VCF file. This procedure is very strict, and often causes the workflow to return an error due to various violations of the VCF specification. If the user trusts that the most critical parts of the input VCF is properly encoded, a setting in the configuration file (`vcf_validation = false`) can be used to turn off VCF validation. +Prediction of loss-of-function variants using LofTee can be turned on in the configuration file (`lof_prediction = true`). Do note that this frequently increases the run time for VEP significantly. + #### STEP 5: Run example Run the workflow with **gvanno.py**, which takes the following arguments and options: @@ -88,7 +94,7 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt positional arguments: gvanno_dir gvanno base directory with accompanying data - directory, e.g. ~/gvanno-0.3.1 + directory, e.g. ~/gvanno-0.4.0 output_dir Output directory {grch37,grch38} grch37 or grch38 configuration_file gvanno configuration file (TOML format) @@ -107,8 +113,8 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt The _examples_ folder contains an example VCF file. It also contains a *gvanno* configuration file. Analysis of the example VCF can be performed by the following command: -`python ~/gvanno-0.3.1/gvanno.py --input_vcf ~/gvanno-0.3.1/examples/example.vcf.gz` -` ~/gvanno-0.3.1 ~/gvanno-0.3.1/examples grch37 ~/gvanno-0.3.1/examples/gvanno_config.toml example` +`python ~/gvanno-0.4.0/gvanno.py --input_vcf ~/gvanno-0.4.0/examples/example.vcf.gz` +` ~/gvanno-0.4.0 ~/gvanno-0.4.0/examples grch37 ~/gvanno-0.4.0/examples/gvanno_config.toml example` This command will run the Docker-based *gvanno* workflow and produce the following output files in the _examples_ folder: @@ -118,9 +124,9 @@ This command will run the Docker-based *gvanno* workflow and produce the followi Similar files are produced for all variants, not only variants with a *PASS* designation in the VCF FILTER column. -Documentation of the various variant and gene annotations should be interrogated from the header of the annotated VCF file. - +### Documentation +Documentation of the various variant and gene annotations should be interrogated from the header of the annotated VCF file. The column names of the tab-separated values (TSV) file will be identical to the INFO tags that are documented in the VCF file. ### Contact diff --git a/examples/gvanno_config.toml b/examples/gvanno_config.toml index 6cefd67..020b9bf 100644 --- a/examples/gvanno_config.toml +++ b/examples/gvanno_config.toml @@ -1,7 +1,17 @@ # gvanno configuration options (TOML). [other] +## Keep/skip VCF validation by https://github.com/EBIvariation/vcf-validator. The vcf-validator checks +## that the input VCF is properly encoded. Since the vcf-validator is strict, and with error messages +## that is not always self-explanatory, the users can skip validation if they are confident that the +## most critical parts of the VCF are properly encoded vcf_validation = true +## Number of processes for vcfanno n_vcfanno_proc = 4 +## Number of forks for VEP n_vep_forks = 4 +## Ignore/skip intergenic variants vep_skip_intergenic = false +## Predict loss-of-function variants using VEP's LofTee plugin +## Note that turning this on (true) are likely to increase VEP's run time substantially +lof_prediction = true diff --git a/gvanno.py b/gvanno.py index d069c62..f8d92cb 100755 --- a/gvanno.py +++ b/gvanno.py @@ -11,21 +11,25 @@ import platform import toml -version = '0.3.1' + +gvanno_version = '0.4.0' +db_version = 'GVANNO_DB_VERSION = 20180915' +vep_version = '93' +global vep_assembly def __main__(): parser = argparse.ArgumentParser(description='Germline variant annotation (gvanno) workflow for clinical and functional interpretation of germline nucleotide variants',formatter_class=argparse.ArgumentDefaultsHelpFormatter) parser.add_argument('--input_vcf', dest = "input_vcf", help='VCF input file with somatic query variants (SNVs/InDels)') parser.add_argument('--force_overwrite', action = "store_true", help='The script will fail with an error if the output file already exists. Force the overwrite of existing result files by using this flag') - parser.add_argument('--version', action='version', version='%(prog)s ' + str(version)) + parser.add_argument('--version', action='version', version='%(prog)s ' + str(gvanno_version)) parser.add_argument('gvanno_dir',help='gvanno base directory with accompanying data directory, e.g. ~/gvanno-0.2.0') parser.add_argument('output_dir',help='Output directory') parser.add_argument('genome_assembly',choices = ['grch37','grch38'], help='grch37 or grch38') parser.add_argument('configuration_file',help='gvanno configuration file (TOML format)') parser.add_argument('sample_id',help="Sample identifier - prefix for output files") - docker_image_version = 'sigven/gvanno:' + str(version) + docker_image_version = 'sigven/gvanno:' + str(gvanno_version) args = parser.parse_args() overwrite = 0 @@ -49,7 +53,7 @@ def __main__(): gvanno_error_message(err_msg,logger) host_directories = verify_input_files(args.input_vcf, args.configuration_file, config_options, args.gvanno_dir, args.output_dir, args.sample_id, args.genome_assembly, overwrite, logger) - run_gvanno(host_directories, docker_image_version, config_options, args.sample_id, args.genome_assembly, version) + run_gvanno(host_directories, docker_image_version, config_options, args.sample_id, args.genome_assembly, gvanno_version) def read_config_options(configuration_file, gvanno_dir, genome_assembly, logger): @@ -74,7 +78,7 @@ def read_config_options(configuration_file, gvanno_dir, genome_assembly, logger) gvanno_error_message(err_msg, logger) - boolean_tags = ['vep_skip_intergenic', 'vcf_validation'] + boolean_tags = ['vep_skip_intergenic', 'vcf_validation', 'lof_prediction'] integer_tags = ['n_vcfanno_proc','n_vep_forks'] for section in ['other']: if section in user_options: @@ -192,8 +196,7 @@ def verify_input_files(input_vcf, configuration_file, gvanno_config_options, bas f_rel_not = open(rel_notes_file,'r') compliant_data_bundle = 0 for line in f_rel_not: - version_check = 'GVANNO_DB_VERSION = 20180629' - if version_check in line: + if db_version in line: compliant_data_bundle = 1 f_rel_not.close() @@ -243,7 +246,7 @@ def getlogger(logger_name): return logger -def run_gvanno(host_directories, docker_image_version, config_options, sample_id, genome_assembly, version): +def run_gvanno(host_directories, docker_image_version, config_options, sample_id, genome_assembly, gvanno_version): """ Main function to run the gvanno workflow using Docker """ @@ -252,10 +255,11 @@ def run_gvanno(host_directories, docker_image_version, config_options, sample_id output_vcf = 'None' output_pass_vcf = 'None' uid = '' - vep_version = '92' + vep_assembly = 'GRCh38' gencode_version = 'release 28' if genome_assembly == 'grch37': gencode_version = 'release 19' + vep_assembly = 'GRCh37' logger = getlogger('gvanno-get-OS') if platform.system() == 'Linux' or platform.system() == 'Darwin' or sys.platform == 'darwin' or sys.platform == 'linux2' or sys.platform == 'linux': uid = os.getuid() @@ -268,6 +272,9 @@ def run_gvanno(host_directories, docker_image_version, config_options, sample_id uid = 'root' vepdb_dir_host = os.path.join(str(host_directories['db_dir_host']),'.vep') + data_dir = '/data' + output_dir = '/workdir/output' + vep_dir = '/usr/local/share/vep/data' input_vcf_docker = 'None' input_conf_docker = 'None' @@ -280,47 +287,49 @@ def run_gvanno(host_directories, docker_image_version, config_options, sample_id docker_command_run1 = 'NA' if host_directories['input_vcf_dir_host'] != 'NA': docker_command_run1 = "docker run --rm -t -u " + str(uid) + " -v=" + str(host_directories['base_dir_host']) + ":/data -v=" + str(vepdb_dir_host) + ":/usr/local/share/vep/data -v=" + str(host_directories['input_vcf_dir_host']) + ":/workdir/input_vcf -v=" + str(host_directories['input_conf_dir_host']) + ":/workdir/input_conf -v=" + str(host_directories['output_dir_host']) + ":/workdir/output -w=/workdir/output " + str(docker_image_version) + " sh -c \"" - docker_command_run2 = "docker run --rm -t -u " + str(uid) + " -v=" + str(host_directories['base_dir_host']) + ":/data -v=" + str(host_directories['output_dir_host']) + ":/workdir/output -w=/workdir " + str(docker_image_version) + " sh -c \"" - + docker_command_run_end = '\"' + ## verify VCF and CNA segment file logger = getlogger('gvanno-validate-input') logger.info("STEP 0: Validate input data") - vcf_validate_command = str(docker_command_run1) + "gvanno_validate_input.py /data " + str(input_vcf_docker) + " " + str(input_conf_docker) + " " + str(genome_assembly) + "\"" + vcf_validate_command = str(docker_command_run1) + "gvanno_validate_input.py " + str(data_dir) + " " + str(input_vcf_docker) + " " + str(input_conf_docker) + " " + str(genome_assembly) + docker_command_run_end + check_subprocess(vcf_validate_command) logger.info('Finished') if not input_vcf_docker == 'None': ## Define input, output and temporary file names - output_vcf = '/workdir/output/' + str(sample_id) + '_gvanno_' + str(genome_assembly) + '.vcf.gz' - output_tsv = '/workdir/output/' + str(sample_id) + '_gvanno_' + str(genome_assembly) + '.tsv' - output_pass_vcf = '/workdir/output/' + str(sample_id) + '_gvanno_pass_' + str(genome_assembly) + '.vcf.gz' - output_pass_tsv = '/workdir/output/' + str(sample_id) + '_gvanno_pass_' + str(genome_assembly) + '.tsv' - input_vcf_gvanno_ready = '/workdir/output/' + re.sub(r'(\.vcf$|\.vcf\.gz$)','.gvanno_ready.vcf.gz',host_directories['input_vcf_basename_host']) + output_vcf = os.path.join(output_dir, str(sample_id) + '_gvanno_' + str(genome_assembly) + '.vcf.gz') + output_tsv = os.path.join(output_dir, str(sample_id) + '_gvanno_' + str(genome_assembly) + '.tsv') + output_pass_vcf = os.path.join(output_dir, str(sample_id) + '_gvanno_pass_' + str(genome_assembly) + '.vcf.gz') + output_pass_tsv = os.path.join(output_dir, str(sample_id) + '_gvanno_pass_' + str(genome_assembly) + '.tsv') + input_vcf_gvanno_ready = os.path.join(output_dir, re.sub(r'(\.vcf$|\.vcf\.gz$)','.gvanno_ready.vcf.gz',host_directories['input_vcf_basename_host'])) vep_vcf = re.sub(r'(\.vcf$|\.vcf\.gz$)','.gvanno_vep.vcf',input_vcf_gvanno_ready) vep_vcfanno_vcf = re.sub(r'(\.vcf$|\.vcf\.gz$)','.gvanno_vep.vcfanno.vcf',input_vcf_gvanno_ready) vep_tmp_vcf = vep_vcf + '.tmp' vep_vcfanno_annotated_vcf = re.sub(r'\.vcfanno','.vcfanno.annotated',vep_vcfanno_vcf) + '.gz' vep_vcfanno_annotated_pass_vcf = re.sub(r'\.vcfanno','.vcfanno.annotated.pass',vep_vcfanno_vcf) + '.gz' - fasta_assembly = "/usr/local/share/vep/data/homo_sapiens/92_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz" - vep_assembly = 'GRCh37' - if genome_assembly == 'grch38': - vep_assembly = 'GRCh38' - fasta_assembly = "/usr/local/share/vep/data/homo_sapiens/92_GRCh38/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz" - vep_options = "--vcf --check_ref --flag_pick_allele --force_overwrite --species homo_sapiens --assembly " + str(vep_assembly) + " --offline --fork " + str(config_options['other']['n_vep_forks']) + " --hgvs --dont_skip --failed 1 --af --af_1kg --af_gnomad --variant_class --regulatory --domains --symbol --protein --ccds --uniprot --appris --biotype --canonical --gencode_basic --cache --numbers --total_length --allele_number --no_escape --xref_refseq --plugin LoF --dir /usr/local/share/vep/data" + fasta_assembly = os.path.join(vep_dir, "homo_sapiens", str(vep_version) + "_" + str(vep_assembly), "Homo_sapiens." + str(vep_assembly) + ".dna.primary_assembly.fa.gz") + vep_options = "--vcf --quiet --check_ref --flag_pick_allele --force_overwrite --species homo_sapiens --assembly " + str(vep_assembly) + " --offline --fork " + str(config_options['other']['n_vep_forks']) + " --hgvs --dont_skip --failed 1 --af --af_1kg --af_gnomad --variant_class --regulatory --domains --symbol --protein --ccds --uniprot --appris --biotype --canonical --gencode_basic --cache --numbers --total_length --allele_number --no_escape --xref_refseq --plugin LoF --dir /usr/local/share/vep/data" if config_options['other']['vep_skip_intergenic'] == 1: vep_options = vep_options + " --no_intergenic" - vep_main_command = str(docker_command_run1) + "vep --input_file " + str(input_vcf_gvanno_ready) + " --output_file " + str(vep_tmp_vcf) + " " + str(vep_options) + " --fasta " + str(fasta_assembly) + "\"" - vep_sed_command = str(docker_command_run1) + "sed -r 's/:p\.[A-Z]{1}[a-z]{2}[0-9]+=//g' " + str(vep_tmp_vcf) + " > " + str(vep_vcf) + "\"" + if config_options['other']['lof_prediction'] == 1: + vep_options = vep_options + " --plugin LoF" + vep_main_command = str(docker_command_run1) + "vep --input_file " + str(input_vcf_gvanno_ready) + " --output_file " + str(vep_tmp_vcf) + " " + str(vep_options) + " --fasta " + str(fasta_assembly) + docker_command_run_end + vep_sed_command = str(docker_command_run1) + "sed -r 's/:p\.[A-Z]{1}[a-z]{2}[0-9]+=//g' " + str(vep_tmp_vcf) + " > " + str(vep_vcf) + docker_command_run_end vep_bgzip_command = str(docker_command_run1) + "bgzip -f " + str(vep_vcf) + "\"" vep_tabix_command = str(docker_command_run1) + "tabix -f -p vcf " + str(vep_vcf) + ".gz" + "\"" logger = getlogger('gvanno-vep') print() - logger.info("STEP 1: Basic variant annotation with Variant Effect Predictor (" + str(vep_version) + ", GENCODE " + str(gencode_version) + ", " + str(genome_assembly) + ")") + if config_options['other']['lof_prediction'] == 1: + logger.info("STEP 1: Basic variant annotation with Variant Effect Predictor (" + str(vep_version) + ", GENCODE " + str(gencode_version) + ", " + str(genome_assembly) + ") including loss-of-function prediction") + else: + logger.info("STEP 1: Basic variant annotation with Variant Effect Predictor (" + str(vep_version) + ", GENCODE " + str(gencode_version) + ", " + str(genome_assembly) + ")") check_subprocess(vep_main_command) check_subprocess(vep_sed_command) check_subprocess(vep_bgzip_command) @@ -331,7 +340,7 @@ def run_gvanno(host_directories, docker_image_version, config_options, sample_id print() logger = getlogger('gvanno-vcfanno') logger.info("STEP 2: Clinical/functional variant annotations with gvanno-vcfanno (ClinVar, dbNSFP, UniProtKB)") - gvanno_vcfanno_command = str(docker_command_run2) + "gvanno_vcfanno.py --num_processes " + str(config_options['other']['n_vcfanno_proc']) + " --dbnsfp --clinvar --uniprot --pcgr_onco_xref " + str(vep_vcf) + ".gz " + str(vep_vcfanno_vcf) + " /data/data/" + str(genome_assembly) + "\"" + gvanno_vcfanno_command = str(docker_command_run2) + "gvanno_vcfanno.py --num_processes " + str(config_options['other']['n_vcfanno_proc']) + " --dbnsfp --clinvar --uniprot --gvanno_xref " + str(vep_vcf) + ".gz " + str(vep_vcfanno_vcf) + " " + os.path.join(data_dir, "data", str(genome_assembly)) + docker_command_run_end check_subprocess(gvanno_vcfanno_command) logger.info("Finished") @@ -339,7 +348,7 @@ def run_gvanno(host_directories, docker_image_version, config_options, sample_id print() logger = getlogger("gvanno-summarise") logger.info("STEP 3: Gene annotations with gvanno-summarise") - gvanno_summarise_command = str(docker_command_run2) + "gvanno_summarise.py " + str(vep_vcfanno_vcf) + ".gz /data/data/" + str(genome_assembly) + "\"" + gvanno_summarise_command = str(docker_command_run2) + "gvanno_summarise.py " + str(vep_vcfanno_vcf) + ".gz " + os.path.join(data_dir, "data", str(genome_assembly)) + docker_command_run_end check_subprocess(gvanno_summarise_command) logger.info("Finished") @@ -347,7 +356,7 @@ def run_gvanno(host_directories, docker_image_version, config_options, sample_id create_output_vcf_command2 = str(docker_command_run2) + 'mv ' + str(vep_vcfanno_annotated_vcf) + '.tbi ' + str(output_vcf) + '.tbi' + "\"" create_output_vcf_command3 = str(docker_command_run2) + 'mv ' + str(vep_vcfanno_annotated_pass_vcf) + ' ' + str(output_pass_vcf) + "\"" create_output_vcf_command4 = str(docker_command_run2) + 'mv ' + str(vep_vcfanno_annotated_pass_vcf) + '.tbi ' + str(output_pass_vcf) + '.tbi' + "\"" - clean_command = str(docker_command_run2) + 'rm -f ' + str(vep_vcf) + '* ' + str(vep_vcfanno_annotated_vcf) + ' ' + str(vep_vcfanno_annotated_pass_vcf) + '* ' + str(vep_vcfanno_vcf) + '* ' + str(input_vcf_gvanno_ready) + "* " + "\"" + clean_command = str(docker_command_run2) + 'rm -f ' + str(vep_vcf) + '* ' + str(vep_vcfanno_annotated_vcf) + ' ' + str(vep_vcfanno_annotated_pass_vcf) + '* ' + str(vep_vcfanno_vcf) + '* ' + str(input_vcf_gvanno_ready) + "* " + docker_command_run_end check_subprocess(create_output_vcf_command1) check_subprocess(create_output_vcf_command2) check_subprocess(create_output_vcf_command3) @@ -357,8 +366,8 @@ def run_gvanno(host_directories, docker_image_version, config_options, sample_id print() logger = getlogger("gvanno-vcf2tsv") logger.info("STEP 4: Converting VCF to TSV with https://github.com/sigven/vcf2tsv") - gvanno_vcf2tsv_command_pass = str(docker_command_run2) + "vcf2tsv.py " + str(output_pass_vcf) + " --compress " + str(output_pass_tsv) + "\"" - gvanno_vcf2tsv_command_all = str(docker_command_run2) + "vcf2tsv.py " + str(output_vcf) + " --compress --keep_rejected " + str(output_tsv) + "\"" + gvanno_vcf2tsv_command_pass = str(docker_command_run2) + "vcf2tsv.py " + str(output_pass_vcf) + " --compress " + str(output_pass_tsv) + docker_command_run_end + gvanno_vcf2tsv_command_all = str(docker_command_run2) + "vcf2tsv.py " + str(output_vcf) + " --compress --keep_rejected " + str(output_tsv) + docker_command_run_end logger.info("Conversion of VCF variant data to records of tab-separated values - PASS variants only") check_subprocess(gvanno_vcf2tsv_command_pass) logger.info("Conversion of VCF variant data to records of tab-separated values - PASS and non-PASS variants)") diff --git a/src/Dockerfile b/src/Dockerfile index 7c62ead..6d69e4b 100755 --- a/src/Dockerfile +++ b/src/Dockerfile @@ -1,197 +1,13 @@ ############################################################ # Dockerfile to build gvanno (germline variant annotator) # Main software components: -# 1. Variant Effect Predictor (VEP 92) +# 1. Variant Effect Predictor (VEP 93) # 2. vcfanno (0.2.9) # 3. EBI-variation's vcf_validator -# 4. vt (https://github.com/atks/vt) - tool set for short variant discovery in genetic sequence data +# 4. vt (https://github.com/atks/vt) - tool set for short +# variant discovery in genetic sequence data ############################################################ - -# FROM ubuntu:latest - -# RUN apt-get update \ -# && apt-get install -y python3-pip python3-dev \ -# && cd /usr/local/bin \ -# && ln -s /usr/bin/python3 python \ -# && pip3 install --upgrade pip - -# # RUN apt-get update \ -# # && apt-get install -y software-properties-common curl \ -# # && add-apt-repository ppa:jonathonf/python-3.6 \ -# # && apt-get remove -y software-properties-common \ -# # && apt autoremove -y \ -# # && apt-get update \ -# # && apt-get install -y python3.6 \ -# # && apt-get install -y python3.6-dev \ -# # && curl -o /tmp/get-pip.py "https://bootstrap.pypa.io/get-pip.py" \ -# # && python3.6 /tmp/get-pip.py \ -# # && apt-get remove -y curl \ -# # && apt autoremove -y \ -# # && rm -rf /var/lib/apt/lists/* - -# RUN apt-get update && apt-get -y install apache2 build-essential cpanminus curl git libmysqlclient-dev libpng-dev libssl-dev manpages mysql-client openssl perl perl-base unzip vim wget sudo -# # install ensembl dependencies -# RUN cpanm Test::Object PPI::Document Task::Weaken Test::SubCalls Test::Object DBI DBD::mysql Archive::Zip Perl::Critic - -# RUN sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 -# RUN sudo apt-get update \ -# && sudo apt-get -y install software-properties-common -# RUN sudo add-apt-repository 'deb [arch=amd64,i386] https://cran.rstudio.com/bin/linux/ubuntu xenial/' - -# USER root -# WORKDIR / - -# RUN apt-get install apt-transport-https - -# RUN apt-get update && apt-get install -y --no-install-recommends \ -# littler \ -# r-base \ -# r-base-dev \ -# r-recommended - - -# ENV PACKAGE_BIO="samtools libhts1 bedtools" -# ENV PACKAGE_DEV="gfortran gcc-multilib autoconf zlib1g-dev liblzma-dev libncurses5-dev libblas-dev liblapack-dev libcurl4-gnutls-dev libssh2-1-dev libxml2-dev vim libssl-dev libcairo2-dev libbz2-dev" -# #numpy cython scipy transvar bx-python pyvcf cyvcf cyvcf2 biopython crossmap pandas" -# ENV PYTHON_MODULES="numpy cython scipy pandas cyvcf2 toml" - - -# RUN apt-get update \ -# && apt-get install -y --no-install-recommends \ -# apt-utils nano ca-certificates ed less locales vim-tiny fonts-texgyre \ -# $PACKAGE_DEV $PACKAGE_BIO \ -# && rm -rf /var/lib/apt/lists/* - - -# ## Install vcfanno version 0.2.8 -# RUN wget https://github.com/brentp/vcfanno/releases/download/v0.2.9/vcfanno_linux64 && \ -# mv vcfanno_linux64 vcfanno && \ -# mv vcfanno /usr/local/bin && \ -# chmod 755 /usr/local/bin/vcfanno - -# ## Install Ensembl's Vcf-validator -# RUN wget https://github.com/EBIvariation/vcf-validator/releases/download/v0.6/vcf_validator && \ -# mv vcf_validator /usr/local/bin && \ -# chmod 755 /usr/local/bin/vcf_validator - - -# ## Install tools used for compilation -# RUN sudo -H pip install --upgrade pip -# RUN sudo -H pip install -U setuptools -# RUN sudo -H pip install $PYTHON_MODULES -# # -# RUN wget http://ab-initio.mit.edu/nlopt/nlopt-2.4.2.tar.gz \ -# && gzip -dc nlopt-2.4.2.tar.gz | tar xvf - \ -# && cd nlopt-2.4.2 \ -# && ./configure \ -# && make \ -# && make install - - -# RUN apt-get update \ -# && apt-get install -y --no-install-recommends libpq-dev libxt-dev libudunits2-dev - -# ###### INSTALL VARIANT EFFECT PREDICTOR (VEP) -# # create vep user -# RUN useradd -r -m -U -d /home/vep -s /bin/bash -c "VEP User" -p '' vep -# RUN usermod -a -G sudo vep -# USER vep -# ENV HOME /home/vep -# WORKDIR $HOME - -# # clone git repositories -# RUN mkdir -p src -# WORKDIR $HOME/src -# RUN git clone https://github.com/Ensembl/ensembl.git -# RUN git clone https://github.com/Ensembl/ensembl-vep.git - -# # get VEP dependencies -# WORKDIR $HOME/src -# RUN ensembl-vep/travisci/get_dependencies.sh -# ENV PERL5LIB $PERL5LIB:$HOME/src/bioperl-live-release-1-6-924 -# ENV KENT_SRC $HOME/src/kent-335_base/src -# ENV HTSLIB_DIR $HOME/src/htslib -# ENV MACHTYPE x86_64 -# ENV CFLAGS "-fPIC" -# ENV DEPS $HOME/src - -# # and run the complilation/install as root -# USER root -# RUN ensembl-vep/travisci/build_c.sh - -# # install htslib binaries (need bgzip, tabix) -# WORKDIR $HTSLIB_DIR -# RUN make install - -# # install bioperl-ext, faster alignments for haplo -# WORKDIR $HOME/src -# RUN git clone https://github.com/bioperl/bioperl-ext.git -# WORKDIR bioperl-ext/Bio/Ext/Align/ -# RUN perl -pi -e"s|(cd libs.+)CFLAGS=\\\'|\$1CFLAGS=\\\'-fPIC |" Makefile.PL -# RUN perl Makefile.PL -# RUN make -# RUN make install - -# # install perl dependencies -# WORKDIR $HOME/src -# RUN cpanm --installdeps --with-recommends --notest --cpanfile ensembl/cpanfile . -# RUN cpanm --installdeps --with-recommends --notest --cpanfile ensembl-vep/cpanfile . - -# # switch back to vep user -# USER vep - -# # update bash profile -# RUN echo >> $HOME/.profile && \ -# echo PATH=$HOME/src/ensembl-vep:\$PATH >> $HOME/.profile && \ -# echo export PATH >> $HOME/.profile - -# # setup environment -# ENV PATH $HOME/src/ensembl-vep:$PATH - -# # run INSTALL.pl -# WORKDIR $HOME/src/ensembl-vep -# RUN chmod u+x *.pl -# RUN ./INSTALL.pl -a ap -g dbNSFP,dbscSNV,miRNA -l - -# ##### END INSTALL VEP - -# USER root -# WORKDIR / -# RUN git clone https://github.com/atks/vt.git -# WORKDIR vt -# RUN make -# RUN make test -# RUN cp vt /usr/local/bin - -# # Add local gvanno Python scripts/libraries -# ADD gvanno.tgz / -# ENV PATH=$PATH:/gvanno -# ENV PYTHONPATH=:/gvanno/lib:${PYTHONPATH} -# ENV VCFANNO_DATA_DOCKER="/data" - -# ## Clean Up -# RUN apt-get clean autoclean -# RUN rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/* -# RUN rm -rf /var/lib/{dpkg,cache,log} - -# VOLUME /workdir -# WORKDIR /workdir/ -# USER root -# RUN mkdir /data && chmod 777 /data -# WORKDIR /data -# VOLUME /data - -# USER root -# WORKDIR / -# RUN rm -f nlopt-2.4.2.tar.gz -# RUN rm -rf $HOME/src/ensembl-vep/t/ -# RUN rm -f $HOME/src/v335_base.tar.gz -# RUN rm -f $HOME/src/release-1-6-924.zip -# RUN rm -rf /vt - - - FROM ubuntu:xenial RUN apt-get update \ @@ -200,13 +16,11 @@ RUN apt-get update \ && ln -s /usr/bin/python3 python \ && pip3 install --upgrade pip -RUN apt-get update && apt-get -y install apache2 apt-utils build-essential cpanminus curl git libmysqlclient-dev libpng-dev libssl-dev manpages mysql-client openssl perl perl-base unzip vim wget sudo +RUN apt-get update && apt-get -y install apache2 apt-utils build-essential cpanminus curl git libmysqlclient-dev libpng-dev libssl-dev manpages mysql-client openssl perl perl-base unzip vim wget sudo # install ensembl dependencies -RUN cpanm Test::Object PPI::Document Task::Weaken Test::SubCalls Test::Object DBI DBD::mysql Archive::Zip Perl::Critic - +RUN cpanm Test::Object PPI::Document Task::Weaken Test::SubCalls Test::Object DBI DBD::mysql Archive::Zip Perl::Critic Set::IntervalTree RUN apt-get update && apt-get install apt-transport-https - RUN sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys E298A3A825C0D65DFD57CBB651716619E084DAB9 RUN sudo apt-get update \ && sudo apt-get -y install software-properties-common @@ -215,10 +29,8 @@ RUN sudo add-apt-repository 'deb [arch=amd64,i386] https://cran.rstudio.com/bin/ USER root WORKDIR / - -ENV PACKAGE_BIO="samtools libhts1 bedtools" +ENV PACKAGE_BIO="libhts1 bedtools" ENV PACKAGE_DEV="gfortran gcc-multilib autoconf liblzma-dev libncurses5-dev libblas-dev liblapack-dev libssh2-1-dev libxml2-dev vim libssl-dev libcairo2-dev libbz2-dev libcurl4-openssl-dev" -#numpy cython scipy transvar bx-python pyvcf cyvcf cyvcf2 biopython crossmap pandas" ENV PYTHON_MODULES="numpy cython scipy pandas cyvcf2 toml" RUN apt-get update \ && apt-get install -y --no-install-recommends \ @@ -236,12 +48,17 @@ RUN wget https://github.com/brentp/vcfanno/releases/download/v0.2.9/vcfanno_linu ## Install Ensembl's Vcf-validator RUN wget https://github.com/EBIvariation/vcf-validator/releases/download/v0.6/vcf_validator && \ -mv vcf_validator /usr/local/bin && \ +mv vcf_validator /usr/local/bin/ && \ chmod 755 /usr/local/bin/vcf_validator USER root WORKDIR / +RUN wget https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2 +RUN bunzip2 -dc samtools-1.9.tar.bz2 | tar xvf - +RUN cd samtools-1.9 && ./configure --prefix=/usr/local/bin && make && make install + +WORKDIR / ## Install tools used for compilation RUN sudo -H pip install --upgrade pip @@ -255,34 +72,54 @@ RUN wget http://ab-initio.mit.edu/nlopt/nlopt-2.4.2.tar.gz \ && make \ && make install - RUN apt-get update \ && apt-get install -y --no-install-recommends libpq-dev libxt-dev libudunits2-dev - ###### INSTALL VARIANT EFFECT PREDICTOR (VEP) +RUN apt-get update && apt-get -y install \ + apache2 \ + build-essential \ + cpanminus \ + curl \ + git \ + libmysqlclient-dev \ + libpng-dev \ + libssl-dev \ + locales \ + manpages \ + mysql-client \ + openssl \ + perl \ + perl-base \ + unzip \ + vim \ + wget + +# install ensembl dependencies +RUN cpanm DBI DBD::mysql + # create vep user -RUN useradd -r -m -U -d /home/vep -s /bin/bash -c "VEP User" -p '' vep +RUN useradd -r -m -U -d /opt/vep -s /bin/bash -c "VEP User" -p '' vep RUN usermod -a -G sudo vep USER vep -ENV HOME /home/vep -WORKDIR $HOME +ENV OPT /opt/vep +WORKDIR $OPT # clone git repositories RUN mkdir -p src -WORKDIR $HOME/src +WORKDIR $OPT/src RUN git clone https://github.com/Ensembl/ensembl.git RUN git clone https://github.com/Ensembl/ensembl-vep.git - # get VEP dependencies -WORKDIR $HOME/src + +WORKDIR $OPT/src RUN ensembl-vep/travisci/get_dependencies.sh -ENV PERL5LIB $PERL5LIB:$HOME/src/bioperl-live-release-1-6-924 -ENV KENT_SRC $HOME/src/kent-335_base/src -ENV HTSLIB_DIR $HOME/src/htslib +ENV PERL5LIB $PERL5LIB:$OPT/src/bioperl-live-release-1-6-924 +ENV KENT_SRC $OPT/src/kent-335_base/src +ENV HTSLIB_DIR $OPT/src/htslib ENV MACHTYPE x86_64 ENV CFLAGS "-fPIC" -ENV DEPS $HOME/src +ENV DEPS $OPT/src # and run the complilation/install as root USER root @@ -293,7 +130,7 @@ WORKDIR $HTSLIB_DIR RUN make install # install bioperl-ext, faster alignments for haplo -WORKDIR $HOME/src +WORKDIR $OPT/src RUN git clone https://github.com/bioperl/bioperl-ext.git WORKDIR bioperl-ext/Bio/Ext/Align/ RUN perl -pi -e"s|(cd libs.+)CFLAGS=\\\'|\$1CFLAGS=\\\'-fPIC |" Makefile.PL @@ -301,31 +138,50 @@ RUN perl Makefile.PL RUN make RUN make install +# install ensembl-xs, faster run using re-implementation in C of some of the Perl subroutines +WORKDIR $OPT/src +RUN wget https://github.com/Ensembl/ensembl-xs/archive/2.3.2.zip -O ensembl-xs.zip +RUN unzip -q ensembl-xs.zip +RUN mv ensembl-xs-2.3.2 ensembl-xs +RUN rm ensembl-xs.zip +WORKDIR ensembl-xs +RUN perl Makefile.PL +RUN make +RUN make install + # install perl dependencies -WORKDIR $HOME/src +WORKDIR $OPT/src RUN cpanm --installdeps --with-recommends --notest --cpanfile ensembl/cpanfile . RUN cpanm --installdeps --with-recommends --notest --cpanfile ensembl-vep/cpanfile . +# configure locale, see https://github.com/rocker-org/rocker/issues/19 +RUN echo "en_US.UTF-8 UTF-8" >> /etc/locale.gen && \ +locale-gen en_US.utf8 && \ +/usr/sbin/update-locale LANG=en_US.UTF-8 + +ENV LC_ALL en_US.UTF-8 +ENV LANG en_US.UTF-8 + # switch back to vep user USER vep # update bash profile -RUN echo >> $HOME/.profile && \ -echo PATH=$HOME/src/ensembl-vep:\$PATH >> $HOME/.profile && \ -echo export PATH >> $HOME/.profile +RUN echo >> $OPT/.profile && \ +echo PATH=$OPT/src/ensembl-vep:\$PATH >> $OPT/.profile && \ +echo export PATH >> $OPT/.profile # setup environment -ENV PATH $HOME/src/ensembl-vep:$PATH +ENV PATH $OPT/src/ensembl-vep:$PATH # run INSTALL.pl -WORKDIR $HOME/src/ensembl-vep +WORKDIR $OPT/src/ensembl-vep RUN chmod u+x *.pl RUN ./INSTALL.pl -a ap -g miRNA,LoF -l WORKDIR / -ADD loftee_custom.tgz $HOME/src/ensembl-vep/modules +ADD loftee.tgz $OPT/src/ensembl-vep/modules -##### END INSTALL VEP +## END INSTALL VEP ADD gvanno.tgz / @@ -363,3 +219,5 @@ RUN rm -rf $HOME/src/ensembl-vep/t/ RUN rm -f $HOME/src/v335_base.tar.gz RUN rm -f $HOME/src/release-1-6-924.zip RUN rm -rf /vt + + diff --git a/src/build.sh b/src/build.sh index 19ce6e1..f590cbf 100755 --- a/src/build.sh +++ b/src/build.sh @@ -1,5 +1,7 @@ +cp /Users/sigven/research/software/vcf2tsv/vcf2tsv.py gvanno/ +cp /Users/sigven/research/docker/pcgr/src/pcgr/lib/annoutils.py gvanno/lib/ tar czvfh gvanno.tgz gvanno/ echo "Build the Docker Image" TAG=`date "+%Y%m%d"` -docker build -t sigven/gvanno:$TAG --rm=true . +docker build --no-cache -t sigven/gvanno:$TAG --rm=true . diff --git a/src/gvanno.tgz b/src/gvanno.tgz index 1aca021..fe753a2 100644 Binary files a/src/gvanno.tgz and b/src/gvanno.tgz differ diff --git a/src/gvanno/gvanno_summarise.py b/src/gvanno/gvanno_summarise.py index 133cf65..a7a1138 100755 --- a/src/gvanno/gvanno_summarise.py +++ b/src/gvanno/gvanno_summarise.py @@ -5,11 +5,10 @@ import argparse from cyvcf2 import VCF, Writer import gzip -import dbnsfp import os -import gvannoutils +import annoutils -logger = gvannoutils.getlogger('gvanno-gene-annotate') +logger = annoutils.getlogger('gvanno-gene-annotate') csv.field_size_limit(500 * 1024 * 1024) threeLettertoOneLetterAA = {'Ala':'A','Arg':'R','Asn':'N','Asp':'D','Cys':'C','Glu':'E','Gln':'Q','Gly':'G','His':'H','Ile':'I','Leu':'L','Lys':'K', 'Met':'M','Phe':'F','Pro':'P','Ser':'S','Thr':'T','Trp':'W','Tyr':'Y','Val':'V','Ter':'X'} @@ -23,131 +22,17 @@ def __main__(): extend_vcf_annotations(args.vcf_file, args.gvanno_db_dir) -def threeToOneAA(aa_change): - - for three_letter_aa in threeLettertoOneLetterAA.keys(): - aa_change = aa_change.replace(three_letter_aa,threeLettertoOneLetterAA[three_letter_aa]) - - aa_change = re.sub(r'[A-Z]{1}fsX([0-9]{1,}|\?)','fs',aa_change) - return aa_change - -def map_variant_effect_predictors(rec, algorithms): - - dbnsfp_predictions = dbnsfp.map_dbnsfp_predictions(str(rec.INFO.get('DBNSFP')), algorithms) - if rec.INFO.get('Gene') is None or rec.INFO.get('Consequence') is None: - return - gene_id = str(rec.INFO.get('Gene')) - consequence = str(rec.INFO.get('Consequence')) - - dbnsfp_key = '' - - if not rec.INFO.get('HGVSp_short') is None: - aa_change = str(rec.INFO.get('HGVSp_short')) - dbnsfp_key = gene_id + ':' + str(aa_change) - else: - if re.search('splice_site',consequence): - dbnsfp_key = gene_id - - if dbnsfp_key != '': - if dbnsfp_key in dbnsfp_predictions: - rec.INFO['EFFECT_PREDICTIONS'] = dbnsfp_predictions[dbnsfp_key] - -def set_coding_change(rec): - - for m in ['HGVSp_short','CDS_CHANGE']: - rec.INFO[m] = '.' - if not rec.INFO.get('HGVSc') is None: - if rec.INFO.get('HGVSc') != '.': - if 'splice_acceptor_variant' in rec.INFO.get('Consequence') or 'splice_donor_variant' in rec.INFO.get('Consequence'): - key = str(rec.INFO.get('Consequence')) + ':' + str(rec.INFO.get('HGVSc')) - rec.INFO['CDS_CHANGE'] = key - if rec.INFO.get('Amino_acids') is None or rec.INFO.get('Protein_position') is None or rec.INFO.get('Consequence') is None: - return - if not rec.INFO.get('Protein_position') is None: - if rec.INFO.get('Protein_position').startswith('-'): - return - - protein_change = '.' - if '/' in rec.INFO.get('Protein_position'): - protein_position = str(rec.INFO.get('Protein_position').split('/')[0]) - if '-' in protein_position: - if protein_position.split('-')[0].isdigit(): - rec.INFO['AMINO_ACID_START'] = protein_position.split('-')[0] - if protein_position.split('-')[1].isdigit(): - rec.INFO['AMINO_ACID_END'] = protein_position.split('-')[1] - else: - if protein_position.isdigit(): - rec.INFO['AMINO_ACID_START'] = protein_position - rec.INFO['AMINO_ACID_END'] = protein_position - - if not rec.INFO.get('HGVSp') is None: - if rec.INFO.get('HGVSp') != '.': - if ':' in rec.INFO.get('HGVSp'): - protein_identifier = str(rec.INFO.get('HGVSp').split(':')[0]) - if protein_identifier.startswith('ENSP'): - protein_change_VEP = str(rec.INFO.get('HGVSp').split(':')[1]) - protein_change = threeToOneAA(protein_change_VEP) - - if 'synonymous_variant' in rec.INFO.get('Consequence'): - protein_change = 'p.' + str(rec.INFO.get('Amino_acids')) + str(protein_position) + str(rec.INFO.get('Amino_acids')) - if 'stop_lost' in str(rec.INFO.get('Consequence')) and '/' in str(rec.INFO.get('Amino_acids')): - protein_change = 'p.X' + str(protein_position) + str(rec.INFO.get('Amino_acids')).split('/')[1] - - rec.INFO['HGVSp_short'] = protein_change - exon_number = 'NA' - if not rec.INFO.get('EXON') is None: - if rec.INFO.get('EXON') != '.': - if '/' in rec.INFO.get('EXON'): - exon_number = str(rec.INFO.get('EXON')).split('/')[0] - - if not rec.INFO.get('HGVSc') is None: - if rec.INFO.get('HGVSc') != '.': - if protein_change != '.': - key = str(rec.INFO.get('Consequence')) + ':' + str(rec.INFO.get('HGVSc')) + ':exon' + str(exon_number) + ':' + str(protein_change) - rec.INFO['CDS_CHANGE'] = key - - return - -def write_pass_vcf(annotated_vcf): - - out_vcf = re.sub(r'\.annotated\.vcf\.gz$','.annotated.pass.vcf',annotated_vcf) - vcf = VCF(annotated_vcf) - w = Writer(out_vcf, vcf) - - num_rejected = 0 - num_pass = 0 - for rec in vcf: - if rec.FILTER is None or rec.FILTER == 'None': - w.write_record(rec) - num_pass += 1 - else: - num_rejected +=1 - - vcf.close() - w.close() - - logger.info('Number of non-PASS/REJECTED variant calls: ' + str(num_rejected)) - logger.info('Number of PASSed variant calls: ' + str(num_pass)) - if num_pass == 0: - logger.warning('There are zero variants with a \'PASS\' filter in the VCF file') - os.system('bgzip -dc ' + str(annotated_vcf) + ' egrep \'^#\' > ' + str(out_vcf)) - #else: - os.system('bgzip -f ' + str(out_vcf)) - os.system('tabix -f -p vcf ' + str(out_vcf) + '.gz') - - return - def extend_vcf_annotations(query_vcf, gvanno_db_directory): """ Function that reads VEP/vcfanno-annotated VCF and extends the VCF INFO column with tags from 1. CSQ elements within the primary transcript consequence picked by VEP, e.g. SYMBOL, Feature, Gene, Consequence etc. - 2. Gene annotations, e.g. known oncogenes/tumor suppressors, MIM phenotype associations etc + 2. Gene annotations, e.g. known oncogenes/tumor suppressors, curated disease associations (DisGenet), MIM phenotype associations etc 3. Protein-relevant annotations, e.g. c functional protein features etc. 4. Variant effect predictions """ ## read VEP and PCGR tags to be appended to VCF file - gvanno_vcf_infotags_meta = gvannoutils.read_infotag_file(os.path.join(gvanno_db_directory,'gvanno_infotags.tsv')) + gvanno_vcf_infotags_meta = annoutils.read_infotag_file(os.path.join(gvanno_db_directory,'gvanno_infotags.tsv')) out_vcf = re.sub(r'\.vcf(\.gz){0,}$','.annotated.vcf',query_vcf) vep_to_gvanno_af = {'gnomAD_AMR_AF':'AMR_AF_GNOMAD','gnomAD_AFR_AF':'AFR_AF_GNOMAD','gnomAD_EAS_AF':'EAS_AF_GNOMAD','gnomAD_NFE_AF':'NFE_AF_GNOMAD','gnomAD_AF':'GLOBAL_AF_GNOMAD', @@ -189,7 +74,7 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory): w = Writer(out_vcf, vcf) current_chrom = None num_chromosome_records_processed = 0 - pcgr_onco_xref_map = {'SYMBOL':1, 'ENTREZ_ID':2, 'UNIPROT_ID':3, 'APPRIS':4,'UNIPROT_ACC':5,'CHORUM_ID':6,'TUMOR_SUPPRESSOR':7,'ONCOGENE':8,'DISGENET_CUI':10,'MIM_PHENOTYPE_ID':14} + pcgr_onco_xref_map = {'SYMBOL':1, 'ENTREZ_ID':2, 'UNIPROT_ID':3, 'APPRIS':4,'UNIPROT_ACC':5,'CHORUM_ID':6,'TUMOR_SUPPRESSOR':7,'ONCOGENE':8,'DISGENET_CUI':9,'MIM_PHENOTYPE_ID':10} for rec in vcf: all_transcript_consequences = [] if current_chrom is None: @@ -206,19 +91,19 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory): variant_id = 'g.' + str(rec.CHROM) + ':' + str(pos) + str(rec.REF) + '>' + alt_allele logger.warning('Variant record ' + str(variant_id) + ' does not have CSQ tag from Variant Effect Predictor (vep_skip_intergenic in config set to true?) - variant will be skipped') continue - pcgr_onco_xref = {} + gvanno_xref = {} num_chromosome_records_processed += 1 - if not rec.INFO.get('PCGR_ONCO_XREF') is None: - for transcript_onco_xref in rec.INFO.get('PCGR_ONCO_XREF').split(','): + if not rec.INFO.get('gvanno_xref') is None: + for transcript_onco_xref in rec.INFO.get('GVANNO_XREF').split(','): xrefs = transcript_onco_xref.split('|') ensembl_transcript_id = str(xrefs[0]) - pcgr_onco_xref[ensembl_transcript_id] = {} + gvanno_xref[ensembl_transcript_id] = {} for annotation in pcgr_onco_xref_map.keys(): annotation_index = pcgr_onco_xref_map[annotation] if annotation_index > (len(xrefs) - 1): continue if xrefs[annotation_index] != '': - pcgr_onco_xref[ensembl_transcript_id][annotation] = xrefs[annotation_index] + gvanno_xref[ensembl_transcript_id][annotation] = xrefs[annotation_index] for identifier in ['CSQ','DBNSFP']: if identifier == 'CSQ': num_picks = 0 @@ -234,15 +119,15 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory): rec.INFO[vep_csq_index2fields[j]] = str(csq_fields[j]) if vep_csq_index2fields[j] == 'Feature': ensembl_transcript_id = str(csq_fields[j]) - if ensembl_transcript_id in pcgr_onco_xref: + if ensembl_transcript_id in gvanno_xref: for annotation in pcgr_onco_xref_map.keys(): if annotation == 'UNIPROT_ACC': continue - if annotation in pcgr_onco_xref[ensembl_transcript_id]: + if annotation in gvanno_xref[ensembl_transcript_id]: if annotation == 'TUMOR_SUPPRESSOR' or annotation == 'ONCOGENE': rec.INFO[annotation] = True else: - rec.INFO[annotation] = pcgr_onco_xref[ensembl_transcript_id][annotation] + rec.INFO[annotation] = gvanno_xref[ensembl_transcript_id][annotation] if vep_csq_index2fields[j] == 'DOMAINS': domain_identifiers = str(csq_fields[j]).split('&') for v in domain_identifiers: @@ -257,14 +142,14 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory): if v.startswith('COSM'): cosmic_identifiers.append(v) if v.startswith('rs'): - dbsnp_identifiers.append(re.sub('^rs','',v)) + dbsnp_identifiers.append(v) if len(cosmic_identifiers) > 0: rec.INFO['COSMIC_MUTATION_ID'] = '&'.join(cosmic_identifiers) if len(dbsnp_identifiers) > 0: rec.INFO['DBSNPRSID'] = '&'.join(dbsnp_identifiers) j = j + 1 - set_coding_change(rec) + annoutils.set_coding_change(rec) symbol = '.' if csq_fields[vep_csq_fields2index['SYMBOL']] != "": symbol = str(csq_fields[vep_csq_fields2index['SYMBOL']]) @@ -273,7 +158,7 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory): if identifier == 'DBNSFP': if not rec.INFO.get('DBNSFP') is None: - map_variant_effect_predictors(rec, dbnsfp_prediction_algorithms) + annoutils.map_variant_effect_predictors(rec, dbnsfp_prediction_algorithms) rec.INFO['VEP_ALL_CONSEQUENCE'] = ','.join(all_transcript_consequences) w.write_record(rec) w.close() @@ -285,11 +170,11 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory): os.system('bgzip -f ' + str(out_vcf)) os.system('tabix -f -p vcf ' + str(out_vcf) + '.gz') annotated_vcf = out_vcf + '.gz' - write_pass_vcf(annotated_vcf) + annoutils.write_pass_vcf(annotated_vcf, logger) else: - gvannoutils.gvanno_error_message('No remaining PASS variants found in query VCF - exiting and skipping STEP 4 (gvanno-writer)', logger) + annoutils.error_message('No remaining PASS variants found in query VCF - exiting and skipping STEP 4 (gvanno-writer)', logger) else: - gvannoutils.gvanno_error_message('No remaining PASS variants found in query VCF - exiting and skipping STEP 4 (gvanno-writer)', logger) + annoutils.error_message('No remaining PASS variants found in query VCF - exiting and skipping STEP 4 (gvanno-writer)', logger) if __name__=="__main__": __main__() diff --git a/src/gvanno/gvanno_validate_input.py b/src/gvanno/gvanno_validate_input.py index 3d6dfcb..264d40a 100755 --- a/src/gvanno/gvanno_validate_input.py +++ b/src/gvanno/gvanno_validate_input.py @@ -7,7 +7,7 @@ import subprocess import logging import sys -import gvannoutils +import annoutils import pandas as np import toml from cyvcf2 import VCF @@ -26,12 +26,6 @@ def __main__(): if ret != 0: sys.exit(-1) -def gvanno_error_message(message, logger): - logger.error('') - logger.error(message) - logger.error('') - return -1 - def is_valid_vcf(input_vcf, logger): """ Function that reads the output file of EBIvariation/vcf-validator and reports potential errors and validation status @@ -61,13 +55,13 @@ def is_valid_vcf(input_vcf, logger): os.system('rm -f ' + str(vcf_validation_output_file)) else: err_msg = str(vcf_validation_output_file) + ' does not exist' - return gvanno_error_message(err_msg, logger) + return annoutils.error_message(err_msg, logger) if validation_results['validation_status'] == 0: error_string_42 = '\n'.join(validation_results['error_messages']) validation_status = 'According to the VCF specification, the VCF file (' + str(input_vcf) + ') is NOT valid' err_msg = validation_status + ':\n' + str(error_string_42) - return gvanno_error_message(err_msg, logger) + return annoutils.error_message(err_msg, logger) else: validation_status = 'According to the VCF specification, the VCF file ' + str(input_vcf) + ' is valid' logger.info(validation_status) @@ -80,7 +74,7 @@ def check_existing_vcf_info_tags(input_vcf, gvanno_directory, genome_assembly, l If any coinciding tags, an error will be returned """ - gvanno_infotags_desc = gvannoutils.read_infotag_file(os.path.join(gvanno_directory,'data',genome_assembly,'gvanno_infotags.tsv')) + gvanno_infotags_desc = annoutils.read_infotag_file(os.path.join(gvanno_directory,'data',genome_assembly,'gvanno_infotags.tsv')) vcf = VCF(input_vcf) logger.info('Checking if existing INFO tags of query VCF file coincide with gvanno INFO tags') @@ -91,7 +85,7 @@ def check_existing_vcf_info_tags(input_vcf, gvanno_directory, genome_assembly, l if header_element['HeaderType'] == 'INFO': if header_element['ID'] in gvanno_infotags_desc.keys(): err_msg = 'INFO tag ' + str(header_element['ID']) + ' in the query VCF coincides with a VCF annotation tag produced by gvanno - please remove or rename this tag in your query VCF' - return gvanno_error_message(err_msg, logger) + return annoutils.error_message(err_msg, logger) logger.info('No query VCF INFO tags coincide with gvanno INFO tags') return ret @@ -150,8 +144,8 @@ def validate_gvanno_input(gvanno_directory, input_vcf, configuration_file, genom 3. Check that if VCF have variants with multiple alternative alleles (e.g. 'A,T') run vt decompose 4. Any genotype data from VCF input file is stripped, and the resulting VCF file is sorted and indexed (bgzip + tabix) """ - logger = gvannoutils.getlogger('gvanno-validate-input') - config_options = gvannoutils.read_config_options(configuration_file, gvanno_directory, genome_assembly, logger) + logger = annoutils.getlogger('gvanno-validate-input') + config_options = annoutils.read_config_options(configuration_file, gvanno_directory, genome_assembly, logger, wflow = 'gvanno') if not input_vcf == 'None': diff --git a/src/gvanno/gvanno_vcfanno.py b/src/gvanno/gvanno_vcfanno.py index 2353ef4..7695bf9 100755 --- a/src/gvanno/gvanno_vcfanno.py +++ b/src/gvanno/gvanno_vcfanno.py @@ -3,12 +3,12 @@ import argparse from cyvcf2 import VCF import random -import gvannoutils +import annoutils import os import re import sys -logger = gvannoutils.getlogger('gvanno-vcfanno') +logger = annoutils.getlogger('gvanno-vcfanno') def __main__(): @@ -20,14 +20,14 @@ def __main__(): parser.add_argument("--clinvar",action = "store_true", help="Annotate VCF with annotations from ClinVar") parser.add_argument("--dbnsfp",action = "store_true", help="Annotate VCF with annotations from database of non-synonymous functional predictions") parser.add_argument("--uniprot",action = "store_true", help="Annotate VCF with protein functional features from the UniProt Knowledgebase") - parser.add_argument("--pcgr_onco_xref",action = "store_true", help="Annotate VCF with transcript annotations from gvanno (protein complexes, cancer gene associations, etc)") + parser.add_argument("--gvanno_xref",action = "store_true", help="Annotate VCF with transcript annotations from gvanno (protein complexes, disease associations, etc)") args = parser.parse_args() query_info_tags = get_vcf_info_tags(args.query_vcf) vcfheader_file = args.out_vcf + '.tmp.' + str(random.randrange(0,10000000)) + '.header.txt' conf_fname = args.out_vcf + '.tmp.conf.toml' print_vcf_header(args.query_vcf, vcfheader_file, chromline_only = False) - run_vcfanno(args.num_processes, args.query_vcf, query_info_tags, vcfheader_file, args.gvanno_db_dir, conf_fname, args.out_vcf, args.clinvar, args.dbnsfp, args.uniprot, args.pcgr_onco_xref) + run_vcfanno(args.num_processes, args.query_vcf, query_info_tags, vcfheader_file, args.gvanno_db_dir, conf_fname, args.out_vcf, args.clinvar, args.dbnsfp, args.uniprot, args.gvanno_xref) def prepare_vcfanno_configuration(vcfanno_data_directory, conf_fname, vcfheader_file, logger, datasource_info_tags, query_info_tags, datasource): @@ -37,14 +37,14 @@ def prepare_vcfanno_configuration(vcfanno_data_directory, conf_fname, vcfheader_ append_to_conf_file(datasource, datasource_info_tags, vcfanno_data_directory, conf_fname) append_to_vcf_header(vcfanno_data_directory, datasource, vcfheader_file) -def run_vcfanno(num_processes, query_vcf, query_info_tags, vcfheader_file, gvanno_db_directory, conf_fname, output_vcf, clinvar, dbnsfp, uniprot, pcgr_onco_xref): +def run_vcfanno(num_processes, query_vcf, query_info_tags, vcfheader_file, gvanno_db_directory, conf_fname, output_vcf, clinvar, dbnsfp, uniprot, gvanno_xref): """ Function that annotates a VCF file with vcfanno against a user-defined set of germline and somatic VCF files """ - clinvar_info_tags = ["CLINVAR_MSID","CLINVAR_PMIDS","CLINVAR_SIG","CLINVAR_VARIANT_ORIGIN", "CLINVAR_MEDGEN_CUI"] + clinvar_info_tags = ["CLINVAR_MSID","CLINVAR_PMID","CLINVAR_CLNSIG","CLINVAR_VARIANT_ORIGIN","CLINVAR_CONFLICTED","CLINVAR_MEDGEN_CUI","CLINVAR_MEDGEN_CUI_SOMATIC","CLINVAR_CLNSIG_SOMATIC","CLINVAR_PMID_SOMATIC","CLINVAR_ALLELE_ID","CLINVAR_HGSVP"] dbnsfp_info_tags = ["DBNSFP"] uniprot_info_tags = ["UNIPROT_FEATURE"] - pcgr_onco_xref_info_tags = ["PCGR_ONCO_XREF"] + gvanno_xref_info_tags = ["GVANNO_XREF"] if clinvar is True: prepare_vcfanno_configuration(gvanno_db_directory, conf_fname, vcfheader_file, logger, clinvar_info_tags, query_info_tags, "clinvar") @@ -52,8 +52,8 @@ def run_vcfanno(num_processes, query_vcf, query_info_tags, vcfheader_file, gvann prepare_vcfanno_configuration(gvanno_db_directory, conf_fname, vcfheader_file, logger, dbnsfp_info_tags, query_info_tags, "dbnsfp") if uniprot is True: prepare_vcfanno_configuration(gvanno_db_directory, conf_fname, vcfheader_file, logger, uniprot_info_tags, query_info_tags, "uniprot") - if pcgr_onco_xref is True: - prepare_vcfanno_configuration(gvanno_db_directory, conf_fname, vcfheader_file, logger, pcgr_onco_xref_info_tags, query_info_tags, "pcgr_onco_xref") + if gvanno_xref is True: + prepare_vcfanno_configuration(gvanno_db_directory, conf_fname, vcfheader_file, logger, gvanno_xref_info_tags, query_info_tags, "gvanno_xref") out_vcf_vcfanno_unsorted1 = output_vcf + '.tmp.unsorted.1' query_prefix = re.sub('\.vcf.gz$','',query_vcf) @@ -81,7 +81,7 @@ def append_to_conf_file(datasource, datasource_info_tags, gvanno_db_directory, c Function that appends data to a vcfanno conf file ('conf_fname') according to user-defined ('datasource'). The datasource defines the set of tags that will be appended during annotation """ fh = open(conf_fname,'a') - if datasource != 'uniprot' and datasource != 'pcgr_onco_xref': + if datasource != 'uniprot' and datasource != 'gvanno_xref': fh.write('[[annotation]]\n') fh.write('file="' + str(gvanno_db_directory) + '/' + str(datasource) + '/' + str(datasource) + '.vcf.gz"\n') fields_string = 'fields = ["' + '","'.join(datasource_info_tags) + '"]' @@ -90,7 +90,7 @@ def append_to_conf_file(datasource, datasource_info_tags, gvanno_db_directory, c fh.write(fields_string + '\n') fh.write(ops_string + '\n\n') else: - if datasource == 'uniprot' or datasource == 'pcgr_onco_xref': + if datasource == 'uniprot' or datasource == 'gvanno_xref': fh.write('[[annotation]]\n') fh.write('file="' + str(gvanno_db_directory) + '/' + str(datasource) + '/' + str(datasource) + '.bed.gz"\n') fh.write('columns=[4]\n') diff --git a/src/gvanno/lib/annoutils.py b/src/gvanno/lib/annoutils.py new file mode 100755 index 0000000..33c63b3 --- /dev/null +++ b/src/gvanno/lib/annoutils.py @@ -0,0 +1,407 @@ +#!/usr/bin/env python + +import os,re,sys +import csv +import logging +import gzip +import toml +from cyvcf2 import VCF, Writer + + +csv.field_size_limit(500 * 1024 * 1024) +threeLettertoOneLetterAA = {'Ala':'A','Arg':'R','Asn':'N','Asp':'D','Cys':'C','Glu':'E','Gln':'Q','Gly':'G','His':'H','Ile':'I','Leu':'L','Lys':'K', 'Met':'M','Phe':'F','Pro':'P','Ser':'S','Thr':'T','Trp':'W','Tyr':'Y','Val':'V','Ter':'X'} + + +def read_infotag_file(vcf_info_tags_tsv): + """ + Function that reads a VCF info tag file that denotes annotation tags produced by PCGR/CPSR/gvanno. + An example of the VCF info tag file is the following: + + tag number type description category + Consequence . String "Impact modifier for the consequence type (picked by VEP's --flag_pick_allele option)." vep + + A dictionary is returned, with the tag as the key, and the full dictionary record as the value + """ + info_tag_xref = {} ##dictionary returned + if not os.path.exists(vcf_info_tags_tsv): + return info_tag_xref + tsvfile = open(vcf_info_tags_tsv, 'r') + reader = csv.DictReader(tsvfile, delimiter='\t') + for row in reader: + if not row['tag'] in info_tag_xref: + info_tag_xref[row['tag']] = row + + return info_tag_xref + + +def error_message(message, logger): + logger.error('') + logger.error(message) + logger.error('') + exit(1) + +def write_pass_vcf(annotated_vcf, logger): + + out_vcf = re.sub(r'\.annotated\.vcf\.gz$','.annotated.pass.vcf',annotated_vcf) + vcf = VCF(annotated_vcf) + w = Writer(out_vcf, vcf) + + num_rejected = 0 + num_pass = 0 + for rec in vcf: + if rec.FILTER is None or rec.FILTER == 'None': + w.write_record(rec) + num_pass += 1 + else: + num_rejected +=1 + + vcf.close() + w.close() + + logger.info('Number of non-PASS/REJECTED variant calls: ' + str(num_rejected)) + logger.info('Number of PASSed variant calls: ' + str(num_pass)) + if num_pass == 0: + logger.warning('There are zero variants with a \'PASS\' filter in the VCF file') + os.system('bgzip -dc ' + str(annotated_vcf) + ' egrep \'^#\' > ' + str(out_vcf)) + #else: + os.system('bgzip -f ' + str(out_vcf)) + os.system('tabix -f -p vcf ' + str(out_vcf) + '.gz') + + return + +def warn_message(message, logger): + logger.warning(message) + +def getlogger(logger_name): + logger = logging.getLogger(logger_name) + logger.setLevel(logging.DEBUG) + + # create console handler and set level to debug + ch = logging.StreamHandler(sys.stdout) + ch.setLevel(logging.DEBUG) + + # add ch to logger + logger.addHandler(ch) + + # create formatter + formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s", "20%y-%m-%d %H:%M:%S") + + #add formatter to ch + ch.setFormatter(formatter) + + return logger + + +def read_config_options(configuration_file, base_dir, genome_assembly, logger, wflow = 'pcgr'): + + ## read default options + config_options = {} + configuration_file_default = os.path.join(base_dir,'data',str(genome_assembly),'pcgr_configuration_default.toml') + if wflow == 'cpsr': + configuration_file_default = os.path.join(base_dir,'data',str(genome_assembly),'cpsr_configuration_default.toml') + if wflow == 'gvanno': + configuration_file_default = os.path.join(base_dir,'data',str(genome_assembly),'gvanno_configuration_default.toml') + if not os.path.exists(configuration_file_default): + err_msg = "Default configuration file " + str(configuration_file_default) + " does not exist - exiting" + error_message(err_msg,logger) + try: + config_options = toml.load(configuration_file_default) + except (IndexError,TypeError): + err_msg = 'Default configuration file ' + str(configuration_file_default) + ' is not formatted correctly' + error_message(err_msg, logger) + + ## override with options set by the users + try: + user_options = toml.load(configuration_file) + except (IndexError,TypeError): + err_msg = 'Configuration file ' + str(configuration_file) + ' is not formatted correctly' + error_message(err_msg, logger) + + tumor_types = [] + for section in config_options: + if section in user_options: + for var in config_options[section]: + if not var in user_options[section]: + continue + #print(str(section) + '\t' + str(var)) + if isinstance(config_options[section][var],bool) and not isinstance(user_options[section][var],bool): + err_msg = 'Configuration value ' + str(user_options[section][var]) + ' for ' + str(var) + ' cannot be parsed properly (expecting boolean)' + error_message(err_msg, logger) + if isinstance(config_options[section][var],int) and not isinstance(user_options[section][var],int): + err_msg = 'Configuration value \"' + str(user_options[section][var]) + '\" for ' + str(var) + ' cannot be parsed properly (expecting integer)' + error_message(err_msg, logger) + if isinstance(config_options[section][var],float) and (not isinstance(user_options[section][var],float) and not isinstance(user_options[section][var],int)): + err_msg = 'Configuration value ' + str(user_options[section][var]) + ' for ' + str(var) + ' cannot be parsed properly (expecting float)' + error_message(err_msg, logger) + if isinstance(config_options[section][var],str) and not isinstance(user_options[section][var],str): + err_msg = 'Configuration value ' + str(user_options[section][var]) + ' for ' + str(var) + ' cannot be parsed properly (expecting string)' + error_message(err_msg, logger) + if section == 'tumor_type': + if user_options[section][var]: + tumor_types.append(str(var)) + tier_options = ['pcgr','pcgr_acmg'] + normalization_options = ['default','exome','genome','exome2genome'] + populations_tgp = ['afr','amr','eas','sas','eur','global'] + populations_gnomad = ['afr','amr','eas','sas','nfe','oth','fin','asj','global'] + theme_options = ['default', 'cerulean', 'journal', 'flatly', 'readable', 'spacelab', 'united', 'cosmo', 'lumen', 'paper', 'sandstone', 'simplex','yeti'] + if var == 'mutsignatures_normalization' and not str(user_options[section][var]) in normalization_options: + err_msg = 'Configuration value ' + str(user_options[section][var]) + ' for ' + str(var) + ' cannot be parsed properly (expecting \'default\', \'exome\', \'genome\', or \'exome2genome\')' + error_message(err_msg, logger) + if var == 'mutsignatures_cutoff' and (float(user_options[section][var]) > 1 or float(user_options[section][var]) < 0) : + err_msg = 'Configuration value ' + str(user_options[section][var]) + " must be within the [0,1] range" + error_message(err_msg, logger) + if var == 'mutsignatures_signature_limit': + if int(user_options[section][var]) < 1 or int(user_options[section][var]) > 30: + err_msg = "Number of mutational signatures in search space ('mutsignatures_signature_limit') must be positive and not more than 30 (retrieved value: " + str(user_options[section][var]) + ")" + error_message(err_msg,logger) + if var == 'tier_model' and not str(user_options[section][var]) in tier_options: + err_msg = 'Configuration value \'' + str(user_options[section][var]) + '\' for ' + str(var) + ' cannot be parsed properly (expecting \'pcgr\', or \'pcgr_acmg\')' + error_message(err_msg, logger) + if var == 'pop_gnomad' and not str(user_options[section][var]) in populations_gnomad: + err_msg = 'Configuration value \'' + str(user_options[section][var]) + '\' for ' + str(var) + ' cannot be parsed properly (expecting \'afr\', \'amr\', \'asj\', \'eas\', \'fin\', \'global\', \'nfe\', \'oth\', or \'sas\')' + error_message(err_msg, logger) + if var == 'pop_tgp' and not str(user_options[section][var]) in populations_tgp: + err_msg = 'Configuration value \'' + str(user_options[section][var]) + '\' for ' + str(var) + ' cannot be parsed properly (expecting \'afr\', \'amr\', \'eas\', \'eur\', \'global\', or \'sas\')' + error_message(err_msg, logger) + if var == 'report_theme' and not str(user_options[section][var]) in theme_options: + err_msg = 'Configuration value \'' + str(user_options[section][var]) + '\' for ' + str(var) + ' cannot be parsed properly (expecting \'default\', \'cerulean\', \'journal\', \'flatly\', \'readable\', \'spacelab\', \'united\', \'cosmo\', \'lumen\', \'paper\', \'sandstone\', \'simplex\', or \'yeti\')' + error_message(err_msg, logger) + if var.startswith('maf_'): + if user_options[section][var] < 0 or user_options[section][var] > 1: + err_msg = "MAF value: " + str(var) + " must be within the [0,1] range, current value is " + str(user_options[section][var]) + ")" + error_message(err_msg,logger) + if var == 'min_af_tumor': + if user_options[section][var] < 0 or user_options[section][var] > 1: + err_msg = "Minimum AF tumor: " + str(var) + " must be within the [0,1] range, current value is " + str(user_options[section][var]) + ")" + error_message(err_msg,logger) + if var == 'max_af_normal': + if user_options[section][var] < 0 or user_options[section][var] > 1: + err_msg = "Maximum AF normal: " + str(var) + " must be within the [0,1] range, current value is " + str(user_options[section][var]) + ")" + error_message(err_msg,logger) + if var == 'target_size_mb': + if user_options[section][var] < 0 or user_options[section][var] > 50: + err_msg = "Target size region in Mb (" + str(user_options[section][var]) + ") is not positive or larger than the likely maximum size of the coding human genome (50 Mb))" + error_message(err_msg,logger) + if user_options[section][var] < 1: + warn_msg = "Target size region in Mb (" + str(user_options[section][var]) + ") must be greater than 1 for mutational burden estimate to be robust (ignoring TMB calculation)" + warn_message(warn_msg,logger) + config_options[section]['mutational_burden'] = False + if var == 'cna_overlap_pct': + if user_options['cna'][var] > 100 or user_options['cna'][var] <= 0: + err_msg = "Minimum percent overlap between copy number segment and gene transcript (" + str(user_options[section][var]) + ") should be greater than zero and less than 100" + error_message(err_msg,logger) + if var == 'logR_homdel': + if user_options['cna'][var] > 0: + err_msg = "Log ratio for homozygous deletions (" + str(user_options[section][var]) + ") should be less than zero" + error_message(err_msg,logger) + if var == 'logR_gain': + if user_options['cna'][var] < 0: + err_msg = "Log ratio for copy number amplifications (" + str(user_options[section][var]) + ") should be greater than zero" + error_message(err_msg,logger) + if var == 'min_majority' and section == 'dbnsfp': + if user_options['dbnsfp'][var] > 8 or user_options['dbnsfp'][var] < 3: + err_msg = "Minimum number of majority votes for consensus calls among dbNSFP predictions should not exceed 8 and should not be less than 3" + error_message(err_msg,logger) + if var == 'max_minority' and section == 'dbnsfp': + if user_options['dbnsfp'][var] >= config_options['dbnsfp']['min_majority'] or user_options['dbnsfp'][var] > 3 or user_options['dbnsfp'][var] < 0 or (user_options['dbnsfp'][var] + config_options['dbnsfp']['min_majority'] > 8): + err_msg = "Maximum number of minority votes for consensus calls among dbNSFP predictions should not exceed 3 (8 algorithms in total) and should be less than min_majority (" + str(user_options[section][var]) + ")" + error_message(err_msg,logger) + + config_options[section][var] = user_options[section][var] + + if len(tumor_types) > 2: + err_msg = "Two many tumor types (", str(",".join(tumor_types)) + ") set to True - limit is set to two" + error_message(err_msg,logger) + if wflow == 'pcgr': + if 'msi' in config_options.keys() and 'mutational_burden' in config_options.keys(): + if config_options['msi']['msi'] == 1 and config_options['mutational_burden']['mutational_burden'] == 0: + err_msg = "Prediction of MSI status (msi = true) requires mutational burden/target_size input (mutational_burden = true) - this is currently set as false" + error_message(err_msg,logger) + + return config_options + + +def threeToOneAA(aa_change): + + for three_letter_aa in threeLettertoOneLetterAA.keys(): + aa_change = aa_change.replace(three_letter_aa,threeLettertoOneLetterAA[three_letter_aa]) + + aa_change = re.sub(r'[A-Z]{1}fsX([0-9]{1,}|\?)','fs',aa_change) + return aa_change + +def map_variant_effect_predictors(rec, algorithms): + + dbnsfp_predictions = map_dbnsfp_predictions(str(rec.INFO.get('DBNSFP')), algorithms) + if rec.INFO.get('Gene') is None or rec.INFO.get('Consequence') is None: + return + gene_id = str(rec.INFO.get('Gene')) + consequence = str(rec.INFO.get('Consequence')) + + dbnsfp_key = '' + + if not rec.INFO.get('HGVSp_short') is None: + aa_change = str(rec.INFO.get('HGVSp_short')) + dbnsfp_key = gene_id + ':' + str(aa_change) + else: + if re.search('splice_site',consequence): + dbnsfp_key = gene_id + + if dbnsfp_key != '': + if dbnsfp_key in dbnsfp_predictions: + rec.INFO['EFFECT_PREDICTIONS'] = dbnsfp_predictions[dbnsfp_key] + + for algo_pred in rec.INFO['EFFECT_PREDICTIONS'].split('&'): + if algo_pred.startswith('sift:'): + rec.INFO['SIFT_DBNSFP'] = str(algo_pred.split(':')[1]) + if algo_pred.startswith('provean:'): + rec.INFO['PROVEAN_DBNSFP'] = str(algo_pred.split(':')[1]) + if algo_pred.startswith('m-cap:'): + rec.INFO['M_CAP_DBNSFP'] = str(algo_pred.split(':')[1]) + if algo_pred.startswith('mutpred:'): + rec.INFO['MUTPRED_DBNSFP'] = str(algo_pred.split(':')[1]) + if algo_pred.startswith('metalr:'): + rec.INFO['META_LR_DBNSFP'] = str(algo_pred.split(':')[1]) + if algo_pred.startswith('fathmm:'): + rec.INFO['FATHMM_DBNSFP'] = str(algo_pred.split(':')[1]) + if algo_pred.startswith('fathmm_mkl_coding:'): + rec.INFO['FATHMM_MKL_DBNSFP'] = str(algo_pred.split(':')[1]) + if algo_pred.startswith('mutationtaster:'): + rec.INFO['MUTATIONTASTER_DBNSFP'] = str(algo_pred.split(':')[1]) + if algo_pred.startswith('mutationassessor:'): + rec.INFO['MUTATIONASSESSOR_DBNSFP'] = str(algo_pred.split(':')[1]) + if algo_pred.startswith('splice_site_rf:'): + rec.INFO['SPLICE_SITE_RF_DBNSFP'] = str(algo_pred.split(':')[1]) + if algo_pred.startswith('splice_site_ada:'): + rec.INFO['SPLICE_SITE_ADA_DBNSFP'] = str(algo_pred.split(':')[1]) + + +def detect_reserved_info_tag(tag, tag_name, logger): + reserved_tags = ['AA','AC','AF','AN','BQ','CIGAR','DB','DP','END','H2','H3','MQ','MQ0','NS','SB','SOMATIC','VALIDATED','1000G'] + if tag in reserved_tags: + err_msg = 'Custom INFO tag (' + str(tag_name) + ') needs another name - ' + str(tag) + ' is a reserved field in the VCF specification (INFO)' + return error_message(err_msg, logger) + + reserved_format_tags = ['GT','DP','FT','GL','GLE','GQ','PL','HQ','PS','PQ','EC','MQ'] + if tag in reserved_format_tags: + err_msg = 'Custom INFO tag (' + str(tag_name) + ') needs another name - ' + str(tag) + ' is a reserved field in the VCF specification (FORMAT)' + return error_message(err_msg, logger) + +def set_coding_change(rec): + + for m in ['HGVSp_short','CDS_CHANGE']: + rec.INFO[m] = '.' + if not rec.INFO.get('HGVSc') is None: + if rec.INFO.get('HGVSc') != '.': + if 'splice_acceptor_variant' in rec.INFO.get('Consequence') or 'splice_donor_variant' in rec.INFO.get('Consequence'): + key = str(rec.INFO.get('Consequence')) + ':' + str(rec.INFO.get('HGVSc')) + rec.INFO['CDS_CHANGE'] = key + if rec.INFO.get('Amino_acids') is None or rec.INFO.get('Protein_position') is None or rec.INFO.get('Consequence') is None: + return + if not rec.INFO.get('Protein_position') is None: + if rec.INFO.get('Protein_position').startswith('-'): + return + + protein_change = '.' + if '/' in rec.INFO.get('Protein_position'): + protein_position = str(rec.INFO.get('Protein_position').split('/')[0]) + if '-' in protein_position: + if protein_position.split('-')[0].isdigit(): + rec.INFO['AMINO_ACID_START'] = protein_position.split('-')[0] + if protein_position.split('-')[1].isdigit(): + rec.INFO['AMINO_ACID_END'] = protein_position.split('-')[1] + else: + if protein_position.isdigit(): + rec.INFO['AMINO_ACID_START'] = protein_position + rec.INFO['AMINO_ACID_END'] = protein_position + + if not rec.INFO.get('HGVSp') is None: + if rec.INFO.get('HGVSp') != '.': + if ':' in rec.INFO.get('HGVSp'): + protein_identifier = str(rec.INFO.get('HGVSp').split(':')[0]) + if protein_identifier.startswith('ENSP'): + protein_change_VEP = str(rec.INFO.get('HGVSp').split(':')[1]) + protein_change = threeToOneAA(protein_change_VEP) + + if 'synonymous_variant' in rec.INFO.get('Consequence'): + protein_change = 'p.' + str(rec.INFO.get('Amino_acids')) + str(protein_position) + str(rec.INFO.get('Amino_acids')) + if 'stop_lost' in str(rec.INFO.get('Consequence')) and '/' in str(rec.INFO.get('Amino_acids')): + protein_change = 'p.X' + str(protein_position) + str(rec.INFO.get('Amino_acids')).split('/')[1] + + rec.INFO['HGVSp_short'] = protein_change + exon_number = 'NA' + if not rec.INFO.get('EXON') is None: + if rec.INFO.get('EXON') != '.': + if '/' in rec.INFO.get('EXON'): + exon_number = str(rec.INFO.get('EXON')).split('/')[0] + + if not rec.INFO.get('HGVSc') is None: + if rec.INFO.get('HGVSc') != '.': + if protein_change != '.': + key = str(rec.INFO.get('Consequence')) + ':' + str(rec.INFO.get('HGVSc')) + ':exon' + str(exon_number) + ':' + str(protein_change) + rec.INFO['CDS_CHANGE'] = key + + return + +def map_dbnsfp_predictions(dbnsfp_tag, algorithms): + + effect_predictions = {} + + for v in dbnsfp_tag.split(','): + + dbnsfp_info = v.split('|') + if len(dbnsfp_info) == 1: + return effect_predictions + ref_aa = dbnsfp_info[0] + alt_aa = dbnsfp_info[1] + all_ids = dbnsfp_info[4].split('&') + unique_ids = {} + for s in all_ids: + unique_ids[s] = 1 + + isoform_aa_keys = [] + if ref_aa != '.' and alt_aa != '.' and ref_aa != '' and alt_aa != '': + aa_pos = dbnsfp_info[6].split('&') + for pos in aa_pos: + for gene_id in unique_ids: + k = str(gene_id) + ':p.' + str(ref_aa) + pos + str(alt_aa) + isoform_aa_keys.append(k) + else: + #continue + for gene_id in unique_ids: + isoform_aa_keys.append(gene_id) + + algorithm_raw_predictions = {} + + i = 7 + v = 0 + + if len(algorithms) != len(dbnsfp_info[7:]): + return effect_predictions + + while i < len(dbnsfp_info): + algorithm_raw_predictions[str(algorithms[v]).lower()] = dbnsfp_info[i].split('&') + i = i + 1 + v = v + 1 + dbnsfp_predictions = {} + + for k in isoform_aa_keys: + if not k in dbnsfp_predictions: + dbnsfp_predictions[k] = {} + all_preds = [] + for algo in algorithm_raw_predictions.keys(): + unique_algo_predictions = {} + for pred in algorithm_raw_predictions[algo]: + if pred != '': + if not pred in unique_algo_predictions: + unique_algo_predictions[pred] = 1 + else: + unique_algo_predictions['.'] = 1 + if len(unique_algo_predictions.keys()) > 1 and '.' in unique_algo_predictions.keys(): + del unique_algo_predictions['.'] + dbnsfp_predictions[k][algo] = str(algo) + ':' + '|'.join(unique_algo_predictions.keys()) + all_preds.append(dbnsfp_predictions[k][algo]) + effect_predictions[k] = '&'.join(all_preds) + + + + return effect_predictions \ No newline at end of file diff --git a/src/gvanno/lib/dbnsfp.py b/src/gvanno/lib/dbnsfp.py deleted file mode 100644 index f54b9aa..0000000 --- a/src/gvanno/lib/dbnsfp.py +++ /dev/null @@ -1,67 +0,0 @@ -#!/usr/bin/env python - -import os,re,sys -import logging -import csv - -def map_dbnsfp_predictions(dbnsfp_tag, algorithms): - - effect_predictions = {} - - for v in dbnsfp_tag.split(','): - - dbnsfp_info = v.split('|') - if len(dbnsfp_info) == 1: - dbnsfp_info = v.split('#') - ref_aa = dbnsfp_info[0] - alt_aa = dbnsfp_info[1] - all_ids = dbnsfp_info[4].split('&') - unique_ids = {} - for s in all_ids: - unique_ids[s] = 1 - - isoform_aa_keys = [] - if ref_aa != '.' and alt_aa != '.': - aa_pos = dbnsfp_info[6].split('&') - for pos in aa_pos: - for gene_id in unique_ids: - k = str(gene_id) + ':p.' + str(ref_aa) + pos + str(alt_aa) - isoform_aa_keys.append(k) - else: - #continue - for gene_id in unique_ids: - isoform_aa_keys.append(gene_id) - - algorithm_raw_predictions = {} - - i = 7 - v = 0 - - if len(algorithms) != len(dbnsfp_info[7:]): - return effect_predictions - - while i < len(dbnsfp_info): - algorithm_raw_predictions[str(algorithms[v]).lower()] = dbnsfp_info[i].split('&') - i = i + 1 - v = v + 1 - dbnsfp_predictions = {} - - for k in isoform_aa_keys: - if not k in dbnsfp_predictions: - dbnsfp_predictions[k] = {} - all_preds = [] - for algo in algorithm_raw_predictions.keys(): - unique_algo_predictions = {} - for pred in algorithm_raw_predictions[algo]: - if pred != '': - if not pred in unique_algo_predictions: - unique_algo_predictions[pred] = 1 - else: - unique_algo_predictions['.'] = 1 - if len(unique_algo_predictions.keys()) > 1 and '.' in unique_algo_predictions.keys(): - del unique_algo_predictions['.'] - dbnsfp_predictions[k][algo] = str(algo) + ':' + '|'.join(unique_algo_predictions.keys()) - all_preds.append(dbnsfp_predictions[k][algo]) - effect_predictions[k] = '&'.join(all_preds) - - return effect_predictions diff --git a/src/gvanno/lib/dbnsfp.pyc b/src/gvanno/lib/dbnsfp.pyc deleted file mode 100644 index 274aac3..0000000 Binary files a/src/gvanno/lib/dbnsfp.pyc and /dev/null differ diff --git a/src/gvanno/lib/gvannoutils.py b/src/gvanno/lib/gvannoutils.py deleted file mode 100755 index 4d20278..0000000 --- a/src/gvanno/lib/gvannoutils.py +++ /dev/null @@ -1,100 +0,0 @@ -#!/usr/bin/env python - -import os,re,sys -import csv -import logging -import gzip -import toml - -csv.field_size_limit(500 * 1024 * 1024) - -def read_infotag_file(vcf_info_tags_tsv): - """ - Function that reads a VCF info tag file that denotes annotation tags produced by PCGR. - An example of the VCF info tag file is the following: - - tag number type description - Consequence . String "Impact modifier for the consequence type (picked by VEP's --flag_pick_allele option)." - - A dictionary is returned, with the tag as the key, and the full dictionary record as the value - """ - info_tag_xref = {} ##dictionary returned - if not os.path.exists(vcf_info_tags_tsv): - return info_tag_xref - with open(vcf_info_tags_tsv, 'r') as tsvfile: - reader = csv.DictReader(tsvfile, delimiter='\t') - for rec in reader: - if not rec['tag'] in info_tag_xref: - info_tag_xref[rec['tag']] = rec - - return info_tag_xref - - -def gvanno_error_message(message, logger): - logger.error('') - logger.error(message) - logger.error('') - exit(1) - -def read_config_options(configuration_file, gvanno_dir, genome_assembly, logger): - - ## read default options - gvanno_config_options = {} - gvanno_configuration_file_default = os.path.join(gvanno_dir,'data',str(genome_assembly),'gvanno_configuration_default.toml') - if not os.path.exists(gvanno_configuration_file_default): - err_msg = "Default gvanno configuration file " + str(gvanno_configuration_file_default) + " does not exist - exiting" - gvanno_error_message(err_msg,logger) - try: - gvanno_config_options = toml.load(gvanno_configuration_file_default) - except(IndexError,TypeError): - err_msg = 'Configuration file ' + str(configuration_file) + ' is not formatted correctly' - gvanno_error_message(err_msg, logger) - - ## override with options set by the users - try: - user_options = toml.load(configuration_file) - except(IndexError,TypeError): - err_msg = 'Configuration file ' + str(configuration_file) + ' is not formatted correctly' - gvanno_error_message(err_msg, logger) - - - boolean_tags = ['vep_skip_intergenic', 'vcf_validation'] - integer_tags = ['n_vcfanno_proc','n_vep_forks'] - for section in ['other']: - if section in user_options: - for t in boolean_tags: - if t in user_options[section]: - if not isinstance(user_options[section][t],bool): - err_msg = 'Configuration value ' + str(user_options[section][t]) + ' for ' + str(t) + ' cannot be parsed properly (expecting true/false)' - gvanno_error_message(err_msg, logger) - gvanno_config_options[section][t] = int(user_options[section][t]) - for t in integer_tags: - if t in user_options[section]: - if not isinstance(user_options[section][t],int): - err_msg = 'Configuration value ' + str(user_options[section][t]) + ' for ' + str(t) + ' cannot be parsed properly (expecting integer)' - gvanno_error_message(err_msg, logger) - gvanno_config_options[section][t] = user_options[section][t] - - return gvanno_config_options - - - - -def getlogger(logger_name): - logger = logging.getLogger(logger_name) - logger.setLevel(logging.DEBUG) - - # create console handler and set level to debug - ch = logging.StreamHandler(sys.stdout) - ch.setLevel(logging.DEBUG) - - # add ch to logger - logger.addHandler(ch) - - # create formatter - formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s", "20%y-%m-%d %H:%M:%S") - - #add formatter to ch - ch.setFormatter(formatter) - - return logger \ No newline at end of file diff --git a/src/gvanno/lib/gvannoutils.pyc b/src/gvanno/lib/gvannoutils.pyc deleted file mode 100644 index cde0a15..0000000 Binary files a/src/gvanno/lib/gvannoutils.pyc and /dev/null differ diff --git a/src/gvanno/stdin.errors_summary.1530688436836.txt b/src/gvanno/stdin.errors_summary.1530688436836.txt deleted file mode 100644 index e69de29..0000000 diff --git a/src/loftee.tgz b/src/loftee.tgz new file mode 100644 index 0000000..6dc7712 Binary files /dev/null and b/src/loftee.tgz differ diff --git a/src/loftee_custom.tgz b/src/loftee_custom.tgz deleted file mode 100644 index 5410809..0000000 Binary files a/src/loftee_custom.tgz and /dev/null differ