diff --git a/README.md b/README.md index 287bcfc..a23ce3f 100755 --- a/README.md +++ b/README.md @@ -12,11 +12,15 @@ ### Overview -The generic variant annotator (*gvanno*) is a software package intended for simple analysis and interpretation of human DNA variants. Variants and genes are annotated with disease-related and functional associations. Technically, the workflow is built with the [Docker](https://www.docker.com) technology, and it can also be installed through the [Singularity](https://sylabs.io/docs/) framework. +The generic variant annotator (*gvanno*) is a software package intended for simple analysis and interpretation of human DNA variants. Variants and genes are annotated with disease-related and functional associations. Technically, the workflow is developed in Python, and it relies upon [Docker](https://www.docker.com) / [Singularity](https://sylabs.io/docs/) technology for encapsulation of software dependencies. *gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short insertions or deletions (indels). The workflow relies heavily upon [Ensembl's Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record. Note that if your input VCF contains data (genotypes) from multiple samples (i.e. a multisample VCF), the output TSV file will contain one line/record **per sample variant**. -### News +### News +- December 29th 2023 - **1.7.0 release** + - Data updates: ClinVar, GENCODE, GWAS catalog + - Software updates: VEP + - Improved Singularity support - April 27th 2023 - **1.6.0 release** @@ -28,16 +32,16 @@ The generic variant annotator (*gvanno*) is a software package intended for simp - Added option `--vep_coding_only` - only report variants that fall into coding regions of transcripts (VEP option `--coding_only`) -### Annotation resources (v1.6.0) +### Annotation resources (v1.7.0) -- [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v109 (GENCODE v43/v19 as the gene reference dataset) -- [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.2, March 2021) +- [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v110 (GENCODE v44/v19 as the gene reference dataset) +- [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.5, November 2023) - [gnomAD](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (release 2.1, October 2018) - from VEP - [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 154) - from VEP -- [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of variants related to human health/disease phenotypes (April 2023) +- [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of variants related to human health/disease phenotypes (December 2023) - [CancerMine](http://bionlp.bcgsc.ca/cancermine/) - literature-mined database of drivers, oncogenes and tumor suppressors in cancer (version 50, March 2023) - [Mutation hotspots](cancerhotspots.org) - Database of mutation hotspots in cancer -- [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (March 27th 2023) +- [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (November 2023) ### Getting started @@ -49,11 +53,15 @@ The generic variant annotator (*gvanno*) is a software package intended for simp - *Other utilities* - The script that installs the reference data requires that the user has `bgzip` installed. See [here](http://www.htslib.org/download/) for instructions. The script also requires that basic Linux/UNIX commands are available (i.e. `gzip`, `tar`) - - **NOTE**: We strongly recommend that _gvanno_ is installed on a MacOS or Linux/UNIX operating system + The script that installs the reference data requires that the user has `bgzip` and `tabix` installed. See [here](http://www.htslib.org/download/) for instructions. The script also requires that basic Linux/UNIX commands are available (i.e. `gzip`, `tar`) -#### STEP 1: Installation of Docker + **NOTE**: _gvanno_ should be installed on a MacOS or Linux/UNIX operating system + +#### STEP 1: Installation of Docker/Singularity + +- the _gvanno_ workflow can be executed with either _Docker_ or _Singularity_ container technology + +##### Installation of Docker 1. [Install the Docker engine](https://docs.docker.com/engine/installation/) on your preferred platform - installing [Docker on Linux](https://docs.docker.com/engine/installation/linux/) @@ -65,33 +73,33 @@ The generic variant annotator (*gvanno*) is a software package intended for simp - CPUs: minimum 4 - [How to - Mac OS X](https://docs.docker.com/docker-for-mac/#advanced) -##### 1.1: Installation of Singularity (_IN DEVELOPMENT_) - -0. **Note: this works for Singularity version 3.0 and higher**. +##### Installation of Singularity 1. [Install Singularity](https://sylabs.io/docs/) -2. Test that singularity works by running `singularity --version` - -3. If you are in the gvanno directory, build the singularity image like so: - - `cd src` - - `sudo ./buildSingularity.sh` #### STEP 2: Download *gvanno* and data bundle -1. [Download and unpack the latest release](https://github.com/sigven/gvanno/releases/tag/v1.6.0) +1. [Download and unpack the latest release](https://github.com/sigven/gvanno/releases/tag/v1.7.0) 2. Install the assembly-specific VEP cache, and gvanno-specific reference data using the `download_gvanno_refdata.py` script, i.e.: - `python download_gvanno_refdata.py --download_dir --genome_assembly grch38` - **NOTE**: This can take a considerable amount of time depending on your local bandwidth (approx 30Gb pr. assembly-specific bundle) + **NOTE**: This can take a considerable amount of time depending on your local bandwidth (approx 20Gb pr. assembly-specific bundle) + + +3. Pull container images + + * Docker + * Pull the [gvanno Docker image (v1.7.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 3.8Gb): + + * `docker pull sigven/gvanno:1.7.0` (gvanno annotation engine) + + * Singularity + * Download the [gvanno SIF image (v1.7.0)](https://insilico.hpc.uio.no/pcgr/gvanno/gvanno_1.7.0.sif) (approx 1.2Gb) and use this as the argument for `--sif_file` in the `gvanno.py` run script. -3. Pull the [gvanno Docker image (1.6.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 4.11Gb): - - `docker pull sigven/gvanno:1.6.0` (gvanno annotation engine) #### STEP 3: Input preprocessing @@ -105,7 +113,7 @@ We **strongly** recommend that the input VCF is compressed and indexed using [bg Run the workflow with **gvanno.py**, which takes the following arguments and options: -``` +``` usage: gvanno.py -h [options] --query_vcf @@ -121,7 +129,7 @@ Required arguments: --query_vcf QUERY_VCF VCF input file with germline query variants (SNVs/InDels). --gvanno_dir GVANNO_DIR - Directory that contains the gvanno reference data, e.g. ~/gvanno-1.6.0 + Directory that contains the gvanno reference data, e.g. ~/gvanno-1.7.0 --output_dir OUTPUT_DIR Output directory --genome_assembly {grch37,grch38} @@ -132,9 +140,9 @@ Required arguments: Sample identifier - prefix for output files VEP optional arguments: ---vep_regulatory Enable Variant Effect Predictor (VEP) to look for overlap with regulatory regions (option --regulatory in VEP). ---vep_gencode_all Consider all GENCODE transcripts with Variant Effect Predictor (VEP) (option --gencode_basic in VEP is used by default in gvanno). ---vep_lof_prediction Predict loss-of-function variants with Loftee plugin in Variant Effect Predictor (VEP), default: False +--vep_regulatory Enable Variant Effect Predictor (VEP) to look for overlap with regulatory regions (option --regulatory in VEP). +--vep_gencode_basic Consider only basic GENCODE transcripts with Variant Effect Predictor (VEP). +--vep_lof_prediction Predict loss-of-function variants with the LOFTEE plugin in Variant Effect Predictor (VEP), default: False --vep_n_forks VEP_N_FORKS Number of forks for Variant Effect Predictor (VEP) processing, default: 4 --vep_buffer_size VEP_BUFFER_SIZE @@ -143,7 +151,7 @@ VEP optional arguments: --vep_pick_order VEP_PICK_ORDER Comma-separated string of ordered transcript properties for primary variant pick in Variant Effect Predictor (VEP) processing, default: canonical,appris,biotype,ccds,rank,tsl,length,mane ---vep_skip_intergenic +--vep_no_intergenic Skip intergenic variants in Variant Effect Predictor (VEP) processing, default: False --vep_coding_only Only report variants falling into coding regions of transcripts (VEP), default: False @@ -160,25 +168,40 @@ Other optional arguments: --oncogenicity_annotation Classify variants according to oncogenicity (Horak et al., Genet Med, 2022) --debug Print full Docker/Singularity commands to log and do not delete intermediate files with warnings etc. +--sif_file gvanno SIF image file for usage of gvanno workflow with option '--container singularity' ``` -The *examples* folder contains an example VCF file. Analysis of the example VCF can be performed by the following command: +The *examples* folder contains an example VCF file. Analysis of the example VCF can be performed by the following command (Docker-based): -``` -python ~/gvanno-1.6.0/gvanno.py ---query_vcf ~/gvanno-1.6.0/examples/example.grch37.vcf.gz ---gvanno_dir ~/gvanno-1.6.0 ---output_dir ~/gvanno-1.6.0 +``` +python ~/gvanno-1.7.0/gvanno.py +--query_vcf ~/gvanno-1.7.0/examples/example.grch37.vcf.gz +--gvanno_dir ~/gvanno-1.7.0 +--output_dir ~/gvanno-1.7.0 --sample_id example --genome_assembly grch37 --container docker --force_overwrite ``` +or Singularity-based + +``` +python ~/gvanno-1.7.0/gvanno.py +--query_vcf ~/gvanno-1.7.0/examples/example.grch37.vcf.gz +--gvanno_dir ~/gvanno-1.7.0 +--output_dir ~/gvanno-1.7.0 +--sample_id example +--genome_assembly grch37 +--container singularity +--sif_file gvanno_1.7.0.sif +--force_overwrite +``` + This command will run the Docker-based *gvanno* workflow and produce the following output files in the *examples* folder: -1. **example_gvanno_pass_grch37.vcf.gz (.tbi)** - Bgzipped VCF file with rich set of functional/clinical annotations -2. **example_gvanno_pass_grch37.tsv.gz** - Compressed TSV file with rich set of functional/clinical annotations +1. **example_gvanno_grch37.pass.vcf.gz (.tbi)** - Bgzipped VCF file with rich set of functional/clinical variant and gene annotations +2. **example_gvanno_grch37.pass.tsv.gz** - Compressed TSV file with rich set of functional/clinical variant and gene annotations Similar files are produced for all variants, not only variants with a *PASS* designation in the VCF FILTER column. diff --git a/download_gvanno_refdata.py b/download_gvanno_refdata.py index f0b94bb..457d158 100755 --- a/download_gvanno_refdata.py +++ b/download_gvanno_refdata.py @@ -8,16 +8,16 @@ import sys import locale import errno -#import wget import urllib.request as urllib2 from argparse import RawTextHelpFormatter -GVANNO_VERSION = '1.6.0' -REFDATA_VERSION = '20230425' -ENSEMBL_VERSION = '109' -GENCODE_VERSION = 'v43' +GVANNO_VERSION = '1.7.0' +REFDATA_VERSION = '20231224' +ENSEMBL_VERSION = '110' +GENCODE_VERSION = 'v44' VEP_ASSEMBLY = "GRCh38" HOST_GVANNO_REFDATA_URL = "http://insilico.hpc.uio.no/pcgr/gvanno/" +HOST_HUMAN_ANCESTOR = "https://personal.broadinstitute.org/konradk/loftee_data/" def __main__(): @@ -35,7 +35,7 @@ def __main__(): 'download directory already exist.\nYou can force the overwrite of existing download directory by using this flag, default: %(default)s') optional.add_argument('--version', action='version', version='%(prog)s ' + str(GVANNO_VERSION)) optional.add_argument('--clean_raw_files',action="store_true", help="Delete raw compressed tar files (i.e. VEP) after download and unzip + untar has been conducted successfully") - optional.add_argument("--debug", action="store_true", help="Print full commands to log and do not delete intermediate files with warnings etc.") + optional.add_argument('--debug', action="store_true", help="Print full commands to log and do not delete intermediate files with warnings etc.") required.add_argument('--download_dir',help='Destination directory for downloaded reference data', required = True) required.add_argument('--genome_assembly',choices = ['grch37','grch38'], help='Choose build-specific reference data for download: grch37 or grch38', required = True) @@ -65,7 +65,6 @@ def __main__(): os.mkdir(arg_dict['db_assembly_dir']) os.mkdir(arg_dict['vep_assembly_dir']) - download_gvanno_ref_data(arg_dict = arg_dict) @@ -190,7 +189,7 @@ def download_gvanno_ref_data(arg_dict): vep_assembly_dir = os.path.join(os.path.abspath(arg_dict['download_dir']),'data',arg_dict['genome_assembly'], '.vep') datasets = {} - for db in ['vep_cache','vep_fasta','gvanno_custom']: + for db in ['vep_cache','vep_fasta','gvanno_custom','human_ancestor']: datasets[db] = {} datasets[db]['remote_url'] = 'NA' datasets[db]['local_path'] = 'NA' @@ -208,8 +207,10 @@ def download_gvanno_ref_data(arg_dict): ) logger = getlogger('download-vep-cache') - datasets['vep_cache']['local_path'] = os.path.join(arg_dict['vep_assembly_dir'], f"homo_sapiens_vep_{ENSEMBL_VERSION}_{VEP_ASSEMBLY}.tar.gz") - datasets['vep_fasta']['local_path'] = os.path.join(arg_dict['vep_assembly_dir'], "homo_sapiens", f"{ENSEMBL_VERSION}_{VEP_ASSEMBLY}", f"Homo_sapiens.{VEP_ASSEMBLY}.dna.primary_assembly.fa.gz") + datasets['vep_cache']['local_path'] = os.path.join( + arg_dict['vep_assembly_dir'], f"homo_sapiens_vep_{ENSEMBL_VERSION}_{VEP_ASSEMBLY}.tar.gz") + datasets['vep_fasta']['local_path'] = os.path.join( + arg_dict['vep_assembly_dir'], "homo_sapiens", f"{ENSEMBL_VERSION}_{VEP_ASSEMBLY}", f"Homo_sapiens.{VEP_ASSEMBLY}.dna.primary_assembly.fa.gz") datasets['vep_fasta']['local_path_uncompressed'] = re.sub(r'.gz','',datasets['vep_fasta']['local_path']) vep_cache_bytes_remote = get_url_num_bytes(url = datasets['vep_cache']['remote_url'], logger = logger) @@ -273,15 +274,38 @@ def download_gvanno_ref_data(arg_dict): check_subprocess(command = command_unzip_fasta, logger = logger) command_bgzip_fasta = f"bgzip {datasets['vep_fasta']['local_path_uncompressed']}" check_subprocess(command = command_bgzip_fasta, logger = logger) - - + logger = getlogger('download-human-ancestor') + logger.info("Downloading human ancestor FASTA files") + for postfix in ['gz', 'gz.fai', 'gz.gzi']: + datasets['human_ancestor']['remote_url'] = f'{HOST_HUMAN_ANCESTOR}{VEP_ASSEMBLY}/human_ancestor.fa.{postfix}' + datasets['human_ancestor']['local_path'] = os.path.join( + arg_dict['vep_assembly_dir'], "homo_sapiens", f"{ENSEMBL_VERSION}_{VEP_ASSEMBLY}", f'human_ancestor.fa.{postfix}') + + logger = getlogger('download-human-ancestor') + custom_cache_bytes_remote = get_url_num_bytes(url = datasets['human_ancestor']['remote_url'], logger = logger) + logger.info('Human ancestor FASTA - remote target file ' + str(datasets['human_ancestor']['remote_url'])) + logger.info('Human ancestor FASTA - size: ' + pretty_print(custom_cache_bytes_remote, logger = logger)) + logger.info('Human ancestor FASTA - local destination file: ' + str(datasets['human_ancestor']['local_path'])) + + + if os.path.exists(datasets['human_ancestor']['local_path']): + if os.path.getsize(datasets['human_ancestor']['local_path']) == custom_cache_bytes_remote: + logger.info('Human ancestor FASTA already downloaded') + else: + logger.info('Human ancestor FASTA - download in progress - this can take a while ... ') + urllib2.urlretrieve(datasets['human_ancestor']['remote_url'], datasets['human_ancestor']['local_path']) + else: + logger.info('Human ancestor FASTA - download in progress - this can take a while ... ') + urllib2.urlretrieve(datasets['human_ancestor']['remote_url'], datasets['human_ancestor']['local_path']) + + datasets['gvanno_custom']['remote_url'] = f'{HOST_GVANNO_REFDATA_URL}gvanno.databundle.{arg_dict["genome_assembly"]}.{REFDATA_VERSION}.tgz' datasets['gvanno_custom']['local_path'] = os.path.join(arg_dict['download_dir'], f'gvanno.databundle.{arg_dict["genome_assembly"]}.{REFDATA_VERSION}.tgz') logger = getlogger('download-gvanno-custom') custom_cache_bytes_remote = get_url_num_bytes(url = datasets['gvanno_custom']['remote_url'], logger = logger) - logger.info("Downloading custom gvanno variant datasets: Clinvar / dbNSFP / ncER / cancerhotspots ++") + logger.info("Downloading custom gvanno variant datasets: Clinvar / dbNSFP / ncER / GWAS catalog / cancerhotspots ++") logger.info('Custom gvanno datasets - remote target file ' + str(datasets['gvanno_custom']['remote_url'])) logger.info('Custom gvanno datasets - size: ' + pretty_print(custom_cache_bytes_remote, logger = logger)) logger.info('Custom gvanno datasets - local destination file: ' + str(datasets['gvanno_custom']['local_path'])) diff --git a/examples/example_small.grch37.vcf.gz b/examples/example_small.grch37.vcf.gz new file mode 100644 index 0000000..429b8b5 Binary files /dev/null and b/examples/example_small.grch37.vcf.gz differ diff --git a/examples/example_small.grch37.vcf.gz.tbi b/examples/example_small.grch37.vcf.gz.tbi new file mode 100644 index 0000000..542e4b2 Binary files /dev/null and b/examples/example_small.grch37.vcf.gz.tbi differ diff --git a/examples/example.grch38.vcf.gz b/examples/example_small.grch38.vcf.gz similarity index 100% rename from examples/example.grch38.vcf.gz rename to examples/example_small.grch38.vcf.gz diff --git a/examples/example.grch38.vcf.gz.tbi b/examples/example_small.grch38.vcf.gz.tbi old mode 100755 new mode 100644 similarity index 100% rename from examples/example.grch38.vcf.gz.tbi rename to examples/example_small.grch38.vcf.gz.tbi diff --git a/gvanno.py b/gvanno.py index 34ee0f4..8e2eba5 100755 --- a/gvanno.py +++ b/gvanno.py @@ -1,7 +1,5 @@ #!/usr/bin/env python -import csv -import re import argparse import os import subprocess @@ -11,10 +9,10 @@ import platform from argparse import RawTextHelpFormatter -GVANNO_VERSION = '1.6.0' -DB_VERSION = 'GVANNO_DB_VERSION = 20230425' -VEP_VERSION = '109' -GENCODE_VERSION = 'v43' +GVANNO_VERSION = '1.7.0' +DB_VERSION = 'GVANNO_DB_VERSION = 20231224' +VEP_VERSION = '110' +GENCODE_VERSION = 'v44' VEP_ASSEMBLY = "GRCh38" DOCKER_IMAGE_VERSION = 'sigven/gvanno:' + str(GVANNO_VERSION) @@ -35,10 +33,10 @@ def __main__(): optional.add_argument('--force_overwrite', action = "store_true", help='By default, the script will fail with an error if any ' + \ 'output file already exists.\nYou can force the overwrite of existing result files by using this flag, default: %(default)s') optional.add_argument('--version', action='version', version='%(prog)s ' + str(GVANNO_VERSION)) - optional.add_argument('--no_vcf_validate', action = "store_true",help="Skip validation of input VCF with Ensembl's vcf-validator, default: %(default)s") - optional.add_argument('--docker_uid', dest = 'docker_user_id', help = 'Docker user ID. default is the host system user ID. If you are experiencing permission errors, try setting this up to root (`--docker-uid root`)') + optional.add_argument('--docker_uid', dest = 'docker_user_id', help = 'Docker user ID. default is the host system user ID. ' + \ + 'If you are experiencing permission errors, try setting this up to root (`--docker-uid root`)') optional_vep.add_argument('--vep_regulatory', action='store_true', help = 'Enable VEP to look for overlap with regulatory regions (option --regulatory in VEP).') - optional_vep.add_argument('--vep_gencode_all', action='store_true', help = 'Consider all GENCODE transcripts with VEP (option --gencode_basic in VEP is used by default in gvanno).') + optional_vep.add_argument('--vep_gencode_basic', action='store_true', help = 'Consider basic GENCODE transcripts with VEP (option --gencode_basic in VEP).') optional_vep.add_argument('--vep_lof_prediction', action = "store_true", help = "Predict loss-of-function variants with Loftee plugin " + \ "in VEP, default: %(default)s") optional_vep.add_argument('--vep_n_forks', default = 4, help="Number of forks for VEP processing, default: %(default)s") @@ -46,18 +44,19 @@ def __main__(): "for VEP processing\n- set lower to reduce memory usage, higher to increase speed, default: %(default)s") optional_vep.add_argument('--vep_pick_order', default = "mane_select,mane_plus_clinical,canonical,appris,biotype,ccds,rank,tsl,length", help="Comma-separated string " + \ "of ordered transcript properties for primary variant pick in\nVEP processing, default: %(default)s") - optional_vep.add_argument('--vep_skip_intergenic', action = "store_true", help="Skip intergenic variants (VEP), default: %(default)s") + optional_vep.add_argument('--vep_no_intergenic', action = "store_true", help="Skip intergenic variants (VEP), default: %(default)s") optional_vep.add_argument('--vep_coding_only', action = "store_true", help="Only return consequences that fall in the coding regions of transcripts (VEP), default: %(default)s") optional.add_argument('--vcfanno_n_processes', default = 4, help="Number of processes for vcfanno " + \ "processing (see https://github.com/brentp/vcfanno#-p), default: %(default)s") optional.add_argument('--oncogenicity_annotation', action ='store_true', help = 'Classify variants according to oncogenicity (Horak et al., Genet Med, 2022)') optional.add_argument("--debug", action="store_true", help="Print full Docker/Singularity commands to log and do not delete intermediate files with warnings etc.") + optional.add_argument("--sif_file", help="gvanno SIF file for usage of gvanno workflow with option '--container singularity'", default = None) required.add_argument('--query_vcf', help='VCF input file with query variants (SNVs/InDels).', required = True) required.add_argument('--gvanno_dir',help='Directory that contains the gvanno data bundle, e.g. ~/gvanno-' + str(GVANNO_VERSION), required = True) required.add_argument('--output_dir',help='Output directory', required = True) required.add_argument('--genome_assembly',choices = ['grch37','grch38'], help='Genome assembly build: grch37 or grch38', required = True) - required.add_argument('--container', choices = ['docker', 'singularity'], action = "store",help="Run gvanno with docker or singularity") + required.add_argument('--container', choices = ['docker', 'singularity'], action = "store",help="Run gvanno with docker or singularity (the latter requires --sif_file)") required.add_argument('--sample_id',help="Sample identifier - prefix for output files", required = True) args = parser.parse_args() @@ -133,6 +132,15 @@ def verify_input_files(arg_dict, logger): err_msg = "Output directory (" + str(output_dir_full) + ") does not exist" gvanno_error_message(err_msg,logger) + if arg_dict['container'] == 'singularity': + if arg_dict['sif_file'] is None: + err_msg = "Please specify a singularity image when running in singularity mode (option '--sif_file')" + gvanno_error_message(err_msg,logger) + else: + if not os.path.exists(os.path.abspath(arg_dict['sif_file'])): + err_msg = "Singularity image file ('--sif_file' " + str(arg_dict['sif_file']) + ") does not exist" + gvanno_error_message(err_msg,logger) + ## check if input vcf exist if not arg_dict['query_vcf'] is None: if not os.path.exists(os.path.abspath(arg_dict['query_vcf'])): @@ -147,7 +155,8 @@ def verify_input_files(arg_dict, logger): if os.path.abspath(arg_dict['query_vcf']).endswith('.vcf.gz'): tabix_file = arg_dict['query_vcf'] + '.tbi' if not os.path.exists(os.path.abspath(tabix_file)): - err_msg = "Tabix file (i.e. '.gz.tbi') is not present for the bgzipped VCF input file (" + os.path.abspath(arg_dict['query_vcf']) + "). Please make sure your input VCF is properly compressed and indexed (bgzip + tabix)" + err_msg = "Tabix file (i.e. '.gz.tbi') is not present for the bgzipped VCF input file (" + \ + os.path.abspath(arg_dict['query_vcf']) + "). Please make sure your input VCF is properly compressed and indexed (bgzip + tabix)" gvanno_error_message(err_msg,logger) input_vcf_basename = os.path.basename(str(arg_dict['query_vcf'])) @@ -192,7 +201,8 @@ def verify_input_files(arg_dict, logger): f_rel_not.close() if compliant_data_bundle == 0: - err_msg = 'The gvanno data bundle is not compliant with the software version - please download the latest software and data bundle (see https://github.com/sigven/gvanno for instructions)' + err_msg = 'The gvanno data bundle is not compliant with the software version - please download the ' + \ + 'latest software and data bundle (see https://github.com/sigven/gvanno for instructions)' gvanno_error_message(err_msg,logger) host_directories = {} @@ -242,6 +252,7 @@ def run_gvanno(arg_dict, host_directories): output_pass_vcf = 'None' uid = '' docker_user_id = arg_dict['docker_user_id'] + debug = arg_dict['debug'] global GENCODE_VERSION, VEP_ASSEMBLY if arg_dict['genome_assembly'] == 'grch37': @@ -258,30 +269,48 @@ def run_gvanno(arg_dict, host_directories): uid = getpass.getuser() if uid == '': - logger.warning('Was not able to get user id/username for logged-in user on the underlying platform (platform.system(): ' + str(platform.system()) + ', sys.platform: ' + str(sys.platform) + '), now running gvanno as root') + logger.warning('Was not able to get user id/username for logged-in user on the underlying platform (platform.system(): ' + \ + str(platform.system()) + ', sys.platform: ' + str(sys.platform) + '), now running gvanno as root') uid = 'root' vepdb_dir_host = os.path.join(str(host_directories['db_dir_host']),'.vep') - vcf_validation = 1 - if arg_dict['no_vcf_validate']: - vcf_validation = 0 data_dir = '/data' output_dir = '/workdir/output' vep_dir = '/usr/local/share/vep/data' input_vcf_docker = 'None' + data_dir_assembly = os.path.join(data_dir, "data", str(arg_dict["genome_assembly"])) + + #genome_assembly = arg_dict['genome_assembly'] + + conf_options = {} + conf_options['sample_id'] = arg_dict['sample_id'] + conf_options['genome_assembly'] = arg_dict['genome_assembly'] + conf_options['conf'] = {} + conf_options['conf']['vep'] = {} + conf_options['conf']['vep']['vep_n_forks'] = arg_dict['vep_n_forks'] + conf_options['conf']['vep']['vep_pick_order'] = arg_dict['vep_pick_order'] + conf_options['conf']['vep']['vep_buffer_size'] = arg_dict['vep_buffer_size'] + conf_options['conf']['vep']['vep_gencode_basic'] = arg_dict['vep_gencode_basic'] + conf_options['conf']['vep']['vep_regulatory'] = arg_dict['vep_regulatory'] + conf_options['conf']['vep']['vep_lof_prediction'] = arg_dict['vep_lof_prediction'] + conf_options['conf']['vep']['vep_no_intergenic'] = arg_dict['vep_no_intergenic'] + conf_options['conf']['vep']['vep_coding_only'] = arg_dict['vep_coding_only'] + if host_directories['input_vcf_basename_host'] != 'NA': input_vcf_docker = '/workdir/input_vcf/' + str(host_directories['input_vcf_basename_host']) - vep_volume_mapping = str(vepdb_dir_host) + ":/usr/local/share/vep/data" - databundle_volume_mapping = str(host_directories['base_dir_host']) + ":/data" + vep_volume_mapping = str(vepdb_dir_host) + ":" + str(vep_dir) + databundle_volume_mapping = str(host_directories['base_dir_host']) + ":" + str(data_dir) input_vcf_volume_mapping = str(host_directories['input_vcf_dir_host']) + ":/workdir/input_vcf" - output_volume_mapping = str(host_directories['output_dir_host']) + ":/workdir/output" + output_volume_mapping = str(host_directories['output_dir_host']) + ":" + str(output_dir) if arg_dict['container'] == 'docker': - container_command_run1 = "docker run --rm -t -u " + str(uid) + " -v=" + str(databundle_volume_mapping) + " -v=" + str(vep_volume_mapping) + " -v=" + str(output_volume_mapping) + container_command_run1 = "docker run --rm -t -u " + str(uid) + " -v=" + str(databundle_volume_mapping) + \ + " -v=" + str(vep_volume_mapping) + " -v=" + str(output_volume_mapping) elif arg_dict['container'] == 'singularity': - container_command_run1 = "singularity exec " + " -B " + str(databundle_volume_mapping) + " -B " + str(vep_volume_mapping) + " -B " + str(output_volume_mapping) + container_command_run1 = "singularity exec " + " -B " + str(databundle_volume_mapping) + " -B " + \ + str(vep_volume_mapping) + " -B " + str(output_volume_mapping) if host_directories['input_vcf_dir_host'] != 'NA' and arg_dict['container'] == 'docker': container_command_run1 = container_command_run1 + " -v=" + str(input_vcf_volume_mapping) @@ -291,7 +320,7 @@ def run_gvanno(arg_dict, host_directories): if arg_dict['container'] == 'docker': container_command_run1 = container_command_run1 + " -w=/workdir/output " + str(DOCKER_IMAGE_VERSION) + " sh -c \"" elif arg_dict['container'] == 'singularity': - container_command_run1 = container_command_run1 + " -W /workdir/output " + 'src/gvanno.sif' + " sh -c \"" + container_command_run1 = container_command_run1 + " -W /workdir/output " + arg_dict['sif_file'] + " sh -c \"" if arg_dict['container'] == 'docker': container_command_run2 = "docker run --rm -t -u " + str(uid) + " -v=" + str(databundle_volume_mapping) + " -v=" + str(output_volume_mapping) @@ -299,7 +328,7 @@ def run_gvanno(arg_dict, host_directories): docker_command_run_end = '\"' elif arg_dict['container'] == 'singularity': container_command_run2 = "singularity exec " + " -B " + str(databundle_volume_mapping) + " -B " + str(output_volume_mapping) - container_command_run2 = container_command_run2 + " -W /workdir/output " + 'src/gvanno.sif' + " sh -c \"" + container_command_run2 = container_command_run2 + " -W /workdir/output " + arg_dict['sif_file'] + " sh -c \"" docker_command_run_end = '\"' if arg_dict['debug']: @@ -311,96 +340,71 @@ def run_gvanno(arg_dict, host_directories): logger.info("--- Generic variant annotation (gvanno) workflow ----") logger.info("Sample name: " + str(arg_dict['sample_id'])) logger.info("Genome assembly: " + str(arg_dict['genome_assembly'])) - print() - - ## GVANNO|validate - verify input file (contents/format) - logger = getlogger('gvanno-validate-input') - logger.info("STEP 0: Validate input data") - vcf_validate_command = str(container_command_run1) + "gvanno_validate_input.py " + str(data_dir) + " " + str(input_vcf_docker) + " " + \ - str(vcf_validation) + " " + str(arg_dict['genome_assembly']) + docker_command_run_end - if arg_dict['debug']: - logger.info(vcf_validate_command) - - check_subprocess(vcf_validate_command) - logger.info('Finished') if not input_vcf_docker == 'None': - ## Define input, output and temporary file names - output_vcf = os.path.join(output_dir, str(arg_dict['sample_id']) + '_gvanno_' + str(arg_dict['genome_assembly']) + '.vcf.gz') - output_tsv = os.path.join(output_dir, str(arg_dict['sample_id']) + '_gvanno_' + str(arg_dict['genome_assembly']) + '.tsv') - output_pass_vcf = os.path.join(output_dir, str(arg_dict['sample_id']) + '_gvanno_pass_' + str(arg_dict['genome_assembly']) + '.vcf.gz') - output_pass_tsv = os.path.join(output_dir, str(arg_dict['sample_id']) + '_gvanno_pass_' + str(arg_dict['genome_assembly']) + '.tsv') - input_vcf_gvanno_ready = os.path.join(output_dir, re.sub(r'(\.vcf$|\.vcf\.gz$)','.gvanno_ready.vcf.gz',host_directories['input_vcf_basename_host'])) - vep_vcf = re.sub(r'(\.vcf$|\.vcf\.gz$)','.vep.vcf',input_vcf_gvanno_ready) - vep_vcfanno_vcf = re.sub(r'(\.vcf$|\.vcf\.gz$)','.vep.vcfanno.vcf',input_vcf_gvanno_ready) - vep_vcfanno_annotated_vcf = re.sub(r'\.vcfanno','.vcfanno.annotated',vep_vcfanno_vcf) + '.gz' - vep_vcfanno_annotated_pass_vcf = re.sub(r'\.vcfanno','.vcfanno.annotated.pass',vep_vcfanno_vcf) + '.gz' + # Define temporary output file names + prefix = os.path.join(output_dir, f'{conf_options["sample_id"]}_gvanno_{conf_options["genome_assembly"]}') - ## Path for human genome assembly and human ancestor (FASTA) - fasta_assembly = os.path.join(vep_dir, "homo_sapiens", str(VEP_VERSION) + "_" + str(VEP_ASSEMBLY), "Homo_sapiens." + str(VEP_ASSEMBLY) + ".dna.primary_assembly.fa.gz") - ancestor_assembly = os.path.join(vep_dir, "homo_sapiens", str(VEP_VERSION) + "_" + str(VEP_ASSEMBLY), "human_ancestor.fa.gz") - - ## List all VEP flags used when calling VEP - loftee_dir = '/opt/vep/src/ensembl-vep/modules' - plugins_in_use = "NearestExonJB" - vep_flags = "--hgvs --dont_skip --failed 1 --af --af_1kg --af_gnomade --af_gnomadg --variant_class --domains --symbol --protein --ccds " + \ - "--uniprot --appris --biotype --canonical --format vcf --mane --cache --numbers --total_length --allele_number --no_escape " + \ - "--xref_refseq --plugin NearestExonJB,max_range=50000" - vep_options = "--vcf --quiet --check_ref --flag_pick_allele_gene --pick_order " + str(arg_dict['vep_pick_order']) + \ - " --force_overwrite --species homo_sapiens --assembly " + str(VEP_ASSEMBLY) + " --offline --fork " + \ - str(arg_dict['vep_n_forks']) + " " + str(vep_flags) + " --dir /usr/local/share/vep/data" - - gencode_set_in_use = "GENCODE - all transcripts" - if arg_dict['vep_gencode_all'] == 0: - vep_options = vep_options + " --gencode_basic" - gencode_set_in_use = "GENCODE - basic transcript set (--gencode_basic)" - if arg_dict['vep_skip_intergenic'] == 1: - vep_options = vep_options + " --no_intergenic" - if arg_dict['vep_coding_only'] == 1: - vep_options = vep_options + " --coding_only" - if arg_dict['vep_regulatory'] == 1: - vep_options = vep_options + " --regulatory" - if arg_dict['vep_lof_prediction'] == 1: - plugins_in_use = plugins_in_use + ", LoF" - vep_options += " --plugin LoF,loftee_path:" + loftee_dir + ",human_ancestor_fa:" + str(ancestor_assembly) + ",use_gerp_end_trunc:0 --dir_plugins " + loftee_dir - - ## Compose full VEP command - vep_main_command = str(container_command_run1) + "vep --input_file " + str(input_vcf_gvanno_ready) + " --output_file " + str(vep_vcf) + \ - " " + str(vep_options) + " --buffer_size " + str(arg_dict['vep_buffer_size']) + " --fasta " + str(fasta_assembly) + docker_command_run_end - vep_bgzip_command = container_command_run1 + "bgzip -f -c " + str(vep_vcf) + " > " + str(vep_vcf) + ".gz" + docker_command_run_end - vep_tabix_command = str(container_command_run1) + "tabix -f -p vcf " + str(vep_vcf) + ".gz" + docker_command_run_end - - ## GVANNO|VEP - run consequence annotation with Variant Effect Predictor - logger = getlogger('gvanno-vep') - print() - logger.info("STEP 1: Basic variant annotation with Variant Effect Predictor (v" + str(VEP_VERSION) + ", GENCODE " + str(GENCODE_VERSION) + ", " + str(arg_dict['genome_assembly']) + ")") - logger.info("VEP configuration - one primary consequence block pr. alternative allele (--flag_pick_allele)") - logger.info("VEP configuration - transcript pick order: " + str(arg_dict['vep_pick_order'])) - logger.info("VEP configuration - transcript pick order: See more at https://www.ensembl.org/info/docs/tools/vep/script/vep_other.html#pick_options") - logger.info("VEP configuration - GENCODE set: " + str(gencode_set_in_use)) - logger.info("VEP configuration - buffer size: " + str(arg_dict['vep_buffer_size'])) - logger.info("VEP configuration - skip intergenic: " + str(arg_dict['vep_skip_intergenic'])) - logger.info("VEP configuration - coding only: " + str(arg_dict['vep_coding_only'])) - logger.info("VEP configuration - look for overlap with regulatory regions: " + str(arg_dict['vep_regulatory'])) - logger.info("VEP configuration - number of forks: " + str(arg_dict['vep_n_forks'])) - logger.info("VEP configuration - loss-of-function prediction: " + str(arg_dict['vep_lof_prediction'])) - logger.info("VEP configuration - plugins in use: " + str(plugins_in_use)) - - if arg_dict['debug']: - logger.info(vep_main_command) - check_subprocess(vep_main_command) - check_subprocess(vep_bgzip_command) - check_subprocess(vep_tabix_command) - logger.info("Finished") - - ## GVANNO|vcfanno - annotate VCF against a number of variant annotation resources - print() + input_vcf_validated = f'{prefix}.gvanno_ready.vcf' + vep_vcf = f'{prefix}.vep.vcf' + vep_vcfanno_vcf = f'{prefix}.vep.vcfanno.vcf' + vep_vcfanno_summarised_vcf = f'{prefix}.vep.vcfanno.summarised.vcf' + vep_vcfanno_summarised_pass_vcf = f'{prefix}.vep.vcfanno.summarised.pass.vcf' + output_vcf = f'{prefix}.vcf.gz' + output_pass_vcf = f'{prefix}.pass.vcf.gz' + output_vcf2tsv = f'{prefix}.vcf2tsv.tsv' + output_pass_vcf2tsv = f'{prefix}.pass.vcf2tsv.tsv' + output_pass_tsv = f'{prefix}.pass.tsv.gz' + + # gvanno|validate_input - verify that VCF is of appropriate format + logger = getlogger("gvanno-validate-input") + print('') + logger.info("gvanno - STEP 0: Validate input data and options") + + vcf_validate_command = ( + f'{container_command_run1}' + f'gvanno_validate_input.py ' + f'{data_dir} ' + f'{input_vcf_docker} ' + f'{input_vcf_validated} ' + f'{conf_options["genome_assembly"]} ' + f'{conf_options["sample_id"]} ' + f'--output_dir {output_dir} ' + f'{"--debug " if debug else ""}' + f'{docker_command_run_end}' + ) + check_subprocess(vcf_validate_command) + logger.info('Finished gvanno-validate-input') + print('----') + logger = getlogger("gvanno-run-vep") + logger.info("gvanno - STEP 1: Variant Effect Predictor (VEP)") + gvanno_vep_command = ( + f'{container_command_run1}' + f'gvanno_vep.py ' + f'{vep_dir} ' + f'{input_vcf_validated}.gz ' + f'{vep_vcf} ' + f'{conf_options["genome_assembly"]} ' + f'{conf_options["conf"]["vep"]["vep_pick_order"]} ' + f'{"--vep_regulatory" if conf_options["conf"]["vep"]["vep_regulatory"] else ""} ' + f'--vep_buffer_size {int(conf_options["conf"]["vep"]["vep_buffer_size"])} ' + f'--vep_n_forks {int(conf_options["conf"]["vep"]["vep_n_forks"])} ' + f'{"--vep_gencode_basic" if conf_options["conf"]["vep"]["vep_gencode_basic"] else ""} ' + f'{"--vep_lof_prediction" if conf_options["conf"]["vep"]["vep_lof_prediction"] else ""} ' + f'{"--vep_no_intergenic" if conf_options["conf"]["vep"]["vep_no_intergenic"] else ""} ' + f'{"--debug " if debug else ""}' + f'{docker_command_run_end}' + ) + check_subprocess(gvanno_vep_command) + + ## gvanno|vcfanno - annotate VCF against a number of variant annotation resources + print("----") logger = getlogger('gvanno-vcfanno') logger.info("STEP 2: Clinical/functional variant annotations with gvanno-vcfanno (Clinvar, ncER, dbNSFP, GWAS catalog)") logger.info('vcfanno configuration - number of processes (-p): ' + str(arg_dict['vcfanno_n_processes'])) gvanno_vcfanno_command = str(container_command_run2) + "gvanno_vcfanno.py --num_processes " + str(arg_dict['vcfanno_n_processes']) + \ - " --dbnsfp --clinvar --ncer --gvanno_xref --gwas " + str(vep_vcf) + ".gz " + str(vep_vcfanno_vcf) + \ + " --dbnsfp --gene_transcript_xref --clinvar --ncer --gwas " + str(vep_vcf) + ".gz " + str(vep_vcfanno_vcf) + \ " " + os.path.join(data_dir, "data", str(arg_dict['genome_assembly'])) + docker_command_run_end if arg_dict['debug']: @@ -408,28 +412,37 @@ def run_gvanno(arg_dict, host_directories): check_subprocess(gvanno_vcfanno_command) logger.info("Finished") - ## GVANNO|summarise - expand annotations in VEP and vcfanno-annotated VCF file - print() + ## gvanno|summarise - expand annotations in VEP and vcfanno-annotated VCF file + print("----") logger = getlogger("gvanno-summarise") logger.info("STEP 3: Summarise gene and variant annotations with gvanno-summarise") logger.info("Configuration - oncogenicity classification: " + str(int(arg_dict['oncogenicity_annotation']))) - gvanno_summarise_command = str(container_command_run2) + "gvanno_summarise.py " + str(vep_vcfanno_vcf) + ".gz " + \ - os.path.join(data_dir, "data", str(arg_dict['genome_assembly'])) + " " + str(int(arg_dict['vep_lof_prediction'])) + \ - " " + str(int(arg_dict['oncogenicity_annotation'])) + " " + str(int(arg_dict['vep_regulatory'])) + " " + \ - str(int(arg_dict['debug'])) + docker_command_run_end + gvanno_summarise_command = ( + f'{container_command_run2}' + f'gvanno_summarise.py ' + f'{vep_vcfanno_vcf}.gz ' + f'{vep_vcfanno_summarised_vcf} ' + f'{int(arg_dict["vep_regulatory"])} ' + f'{int(arg_dict["oncogenicity_annotation"])} ' + f'{conf_options["conf"]["vep"]["vep_pick_order"]} ' + f'{data_dir_assembly} ' + f'{"--debug " if debug else ""}' + f'--compress_output_vcf ' + f'{docker_command_run_end}' + ) if arg_dict['debug']: logger.info(gvanno_summarise_command) check_subprocess(gvanno_summarise_command) logger.info("Finished") - ## GVANNO|clean - move output files and clean up temporary files - create_output_vcf_command1 = str(container_command_run2) + 'mv ' + str(vep_vcfanno_annotated_vcf) + ' ' + str(output_vcf) + "\"" - create_output_vcf_command2 = str(container_command_run2) + 'mv ' + str(vep_vcfanno_annotated_vcf) + '.tbi ' + str(output_vcf) + '.tbi' + "\"" - create_output_vcf_command3 = str(container_command_run2) + 'mv ' + str(vep_vcfanno_annotated_pass_vcf) + ' ' + str(output_pass_vcf) + "\"" - create_output_vcf_command4 = str(container_command_run2) + 'mv ' + str(vep_vcfanno_annotated_pass_vcf) + '.tbi ' + str(output_pass_vcf) + '.tbi' + "\"" - clean_command = str(container_command_run2) + 'rm -f ' + str(vep_vcf) + '* ' + str(vep_vcfanno_annotated_vcf) + ' ' + \ - str(vep_vcfanno_annotated_pass_vcf) + '* ' + str(vep_vcfanno_vcf) + '* ' + str(input_vcf_gvanno_ready) + "* " + docker_command_run_end + ## gvanno|clean - move output files and clean up temporary files + create_output_vcf_command1 = str(container_command_run2) + 'mv ' + str(vep_vcfanno_summarised_vcf) + '.gz ' + str(output_vcf) + "\"" + create_output_vcf_command2 = str(container_command_run2) + 'mv ' + str(vep_vcfanno_summarised_vcf) + '.gz.tbi ' + str(output_vcf) + '.tbi' + "\"" + create_output_vcf_command3 = str(container_command_run2) + 'mv ' + str(vep_vcfanno_summarised_pass_vcf) + '.gz ' + str(output_pass_vcf) + "\"" + create_output_vcf_command4 = str(container_command_run2) + 'mv ' + str(vep_vcfanno_summarised_pass_vcf) + '.gz.tbi ' + str(output_pass_vcf) + '.tbi' + "\"" + clean_command = str(container_command_run2) + 'rm -f ' + str(vep_vcf) + '* ' + str(vep_vcfanno_summarised_vcf) + ' ' + \ + str(vep_vcfanno_summarised_pass_vcf) + '* ' + str(vep_vcfanno_vcf) + '* ' + str(input_vcf_validated) + "* " + docker_command_run_end check_subprocess(create_output_vcf_command1) check_subprocess(create_output_vcf_command2) check_subprocess(create_output_vcf_command3) @@ -437,21 +450,41 @@ def run_gvanno(arg_dict, host_directories): if not arg_dict['debug']: check_subprocess(clean_command) - print() - ## GVANNO|vcf2tsv - convert VCF to TSV with https://github.com/sigven/vcf2tsv + print("----") + ## gvanno|vcf2tsv - convert VCF to TSV with https://github.com/sigven/vcf2tsv logger = getlogger("gvanno-vcf2tsv") logger.info("STEP 4: Converting genomic VCF to TSV with https://github.com/sigven/vcf2tsvpy") - gvanno_vcf2tsv_command_pass = str(container_command_run2) + "vcf2tsvpy --input_vcf " + str(output_pass_vcf) + " --compress --out_tsv " + str(output_pass_tsv) + docker_command_run_end - gvanno_vcf2tsv_command_all = str(container_command_run2) + "vcf2tsvpy --input_vcf " + str(output_vcf) + " --compress --keep_rejected --out_tsv " + str(output_tsv) + docker_command_run_end + gvanno_vcf2tsv_command_pass = str(container_command_run2) + "vcf2tsvpy --input_vcf " + str(output_pass_vcf) + " --compress --out_tsv " + str(output_vcf2tsv) + docker_command_run_end + gvanno_vcf2tsv_command_all = str(container_command_run2) + "vcf2tsvpy --input_vcf " + str(output_vcf) + " --compress --keep_rejected --out_tsv " + str(output_pass_vcf2tsv) + docker_command_run_end logger.info("Conversion of VCF variant data to records of tab-separated values - PASS variants only") check_subprocess(gvanno_vcf2tsv_command_pass) logger.info("Conversion of VCF variant data to records of tab-separated values - PASS and non-PASS variants") check_subprocess(gvanno_vcf2tsv_command_all) logger.info("Finished") + + print("----") + ## gvanno|append - move output files and clean up temporary fileslogger.info("Finished") + logger = getlogger("gvanno-finalize") + logger.info("STEP 5: Appending ClinVar traits, official gene names, and protein domain annotations") + gvanno_append_tsv_command = ( + f'{container_command_run2}' + f'gvanno_finalize.py ' + f'{data_dir_assembly} ' + f'{output_pass_vcf2tsv}.gz ' + f'{output_pass_tsv} ' + f'{arg_dict["genome_assembly"]} ' + f'{arg_dict["sample_id"]}' + f'{docker_command_run_end}' + ) + check_subprocess(gvanno_append_tsv_command) + logger.info("Finished") + if not debug: + clean_command2 = str(container_command_run2) + 'rm -f ' + str(output_pass_vcf2tsv) + '* ' + \ + str(output_vcf2tsv) + "* " + docker_command_run_end + check_subprocess(clean_command2) + - #return - print if __name__=="__main__": __main__() diff --git a/src/Dockerfile b/src/Dockerfile index 1a0a2e2..c0a3f36 100644 --- a/src/Dockerfile +++ b/src/Dockerfile @@ -1,3 +1,4 @@ +ARG BRANCH=release/110 ################################################### # Stage 1 - docker container to build ensembl-vep # @@ -23,7 +24,7 @@ RUN apt-get update && apt-get -y install \ ENV OPT /opt/vep ENV OPT_SRC $OPT/src ENV HTSLIB_DIR $OPT_SRC/htslib -ENV BRANCH=release/109 +ARG BRANCH # Working directory WORKDIR $OPT_SRC @@ -100,7 +101,6 @@ RUN apt-get update && apt-get -y install \ perl \ perl-base \ unzip \ - wget \ vim && \ apt-get -y purge manpages-dev && \ apt-get clean && \ @@ -113,10 +113,10 @@ ENV PERL5LIB_TMP $PERL5LIB:$OPT_SRC/ensembl-vep:$OPT_SRC/ensembl-vep/modules ENV PERL5LIB $PERL5LIB_TMP:$OPT_SRC/bioperl-live ENV KENT_SRC $OPT/src/kent-335_base/src ENV HTSLIB_DIR $OPT_SRC/htslib -ENV MACHTYPE x86_64 ENV DEPS $OPT_SRC ENV PATH $OPT_SRC/ensembl-vep:$OPT_SRC/var_c_code:$PATH ENV LANG_VAR en_US.UTF-8 +ARG BRANCH # Create vep user RUN useradd -r -m -U -d "$OPT" -s /bin/bash -c "VEP User" -p '' vep && usermod -a -G sudo vep && mkdir -p $OPT_SRC @@ -139,7 +139,8 @@ RUN perl Makefile.PL && make && make install && rm -f Makefile* cpanfile WORKDIR $OPT_SRC # Install/compile more libraries -RUN ensembl-vep/travisci/build_c.sh && \ +RUN export MACHTYPE=$(uname -m) &&\ + ensembl-vep/travisci/build_c.sh && \ # Remove unused Bio-DB-HTS files rm -rf Bio-HTS/cpanfile Bio-HTS/Build.PL Bio-HTS/Build Bio-HTS/_build Bio-HTS/INSTALL.pl && \ # Install ensembl perl dependencies (cpanm) @@ -151,7 +152,9 @@ RUN ensembl-vep/travisci/build_c.sh && \ echo "$LANG_VAR UTF-8" >> /etc/locale.gen && locale-gen en_US.utf8 && \ /usr/sbin/update-locale LANG=$LANG_VAR && \ # Copy htslib executables. It also requires the packages 'zlib1g-dev', 'libbz2-dev' and 'liblzma-dev' - cp $HTSLIB_DIR/bgzip $HTSLIB_DIR/tabix $HTSLIB_DIR/htsfile /usr/local/bin/ + cp $HTSLIB_DIR/bgzip $HTSLIB_DIR/tabix $HTSLIB_DIR/htsfile /usr/local/bin/ && \ + # Remove CPAN cache + rm -rf /root/.cpanm ENV LC_ALL $LANG_VAR ENV LANG $LANG_VAR @@ -160,43 +163,32 @@ ENV LANG $LANG_VAR USER vep ENV PERL5LIB $PERL5LIB_TMP - -USER vep -ENV PERL5LIB $PERL5LIB_TMP +# Setup Docker environment for when users run VEP and INSTALL.pl in Docker image: +# - skip VEP updates in INSTALL.pl +ENV VEP_NO_UPDATE 1 +# - avoid Faidx/HTSLIB installation in INSTALL.pl +ENV VEP_NO_HTSLIB 1 +# - skip plugin installation in INSTALL.pl +ENV VEP_NO_PLUGINS 1 +# - set plugins directory for VEP and INSTALL.pl +ENV VEP_DIR_PLUGINS /plugins +ENV VEP_PLUGINSDIR $VEP_DIR_PLUGINS +WORKDIR $VEP_DIR_PLUGINS WORKDIR / ADD loftee_1.0.3.tgz $OPT/src/ensembl-vep/modules ADD UTRannotator.tgz $OPT/src/ensembl-vep/modules ADD NearestExonJB.pm $OPT/src/ensembl-vep/modules -#RUN wget -q "https://raw.githubusercontent.com/Ensembl/VEP_plugins/$BRANCH/NearestExonJB.pm" -O $OPT/src/ensembl-vep/modules/NearestExonJB.pm - -# Final steps -WORKDIR $OPT_SRC/ensembl-vep # Update bash profile +WORKDIR $OPT_SRC/ensembl-vep RUN echo >> $OPT/.profile && \ echo PATH=$PATH:\$PATH >> $OPT/.profile && \ echo export PATH >> $OPT/.profile && \ - # Run INSTALL.pl and remove the ensemb-vep tests and travis - ./INSTALL.pl -a a -l -n && rm -rf t travisci .travis.yml - - -# Install dependencies for VEP plugins: -#USER root -#ENV PLUGIN_DEPS "https://raw.githubusercontent.com/Ensembl/VEP_plugins/$BRANCH/config" -## - Ubuntu packages -#RUN curl -O "$PLUGIN_DEPS/ubuntu-packages.txt" && \ -# apt-get update && apt-get install -y --no-install-recommends \ -# $(sed s/\#.*//g ubuntu-packages.txt) && \ -# rm -rf /var/lib/apt/lists/* ubuntu-packages.txt -## - Perl modules -#RUN curl -O "$PLUGIN_DEPS/cpanfile" && \ -# cpanm --installdeps --with-recommends . && \ -# rm -rf /root/.cpanm cpanfile -## - Python packages -#RUN curl -O "$PLUGIN_DEPS/requirements.txt" && \ -# python -m pip install --no-cache-dir -r requirements.txt && \ -# rm requirements.txt + # Install Ensembl API and plugins + ./INSTALL.pl --auto ap --plugins all --pluginsdir $VEP_DIR_PLUGINS --no_update --no_htslib && \ + # Remove the ensemb-vep tests and travis + rm -rf t travisci .travis.yml ###################################################### # Stage 3 - adding dependencies for gvanno analysis # @@ -210,17 +202,15 @@ RUN apt-get update \ && ln -s /usr/bin/python3 python \ && pip3 install --upgrade pip -RUN apt-get update && apt-get -y install apache2 apt-utils build-essential cpanminus curl git libmysqlclient-dev libpng-dev libssl-dev manpages mysql-client openssl perl perl-base unzip vim wget sudo +RUN apt-get update && apt-get -y install apache2 apt-utils build-essential curl git libpng-dev libssl-dev manpages openssl unzip vim wget sudo # install ensembl dependencies -RUN cpanm Test::Object PPI::Document Task::Weaken Test::SubCalls Test::Object DBI DBD::mysql Archive::Zip Perl::Critic Set::IntervalTree -RUN apt-get update && apt-get install apt-transport-https USER root WORKDIR / ENV PACKAGE_BIO="libhts2 bedtools" ENV PACKAGE_DEV="gfortran gcc-multilib autoconf liblzma-dev libncurses5-dev libblas-dev liblapack-dev libssh2-1-dev libxml2-dev vim libssl-dev libcairo2-dev libbz2-dev libcurl4-openssl-dev" -ENV PYTHON_MODULES="numpy==1.19.2 cython==0.29.21 scipy==1.5.3 pandas==1.1.3 cyvcf2==0.20.9" +ENV PYTHON_MODULES="numpy cython scipy pandas cyvcf2" RUN apt-get update \ && apt-get install -y --no-install-recommends \ nano ed locales vim-tiny fonts-texgyre \ @@ -229,17 +219,12 @@ RUN apt-get update \ RUN apt-get autoremove -## Install vcfanno version v0.3.2 -RUN wget https://github.com/brentp/vcfanno/releases/download/v0.3.2/vcfanno_linux64 && \ +## Install vcfanno version v0.3.5 +RUN wget https://github.com/brentp/vcfanno/releases/download/v0.3.5/vcfanno_linux64 && \ mv vcfanno_linux64 vcfanno && \ mv vcfanno /usr/local/bin && \ chmod 755 /usr/local/bin/vcfanno -## Install Ensembl's Vcf-validator -RUN wget https://github.com/EBIvariation/vcf-validator/releases/download/v0.9.3/vcf_validator_linux && \ -mv vcf_validator_linux /usr/local/bin/vcf_validator && \ -chmod 755 /usr/local/bin/vcf_validator - USER root WORKDIR / @@ -259,20 +244,22 @@ RUN apt-get update \ USER root WORKDIR / +USER root ## vt - variant tool set + vcf2tsvpy - install using Conda ## primary use of vt in GVANNO: decomposition of multiallelic variants in a VCF file # install Conda -RUN wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh \ +RUN wget http://repo.continuum.io/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh -O miniconda.sh \ && chmod 0755 miniconda.sh RUN ["/bin/bash", "-c", "/miniconda.sh -b -p /conda"] RUN rm miniconda.sh # update conda & install vt + vcf2tsvpy -RUN /conda/bin/conda update conda -RUN /conda/bin/conda update python -RUN /conda/bin/conda install -c bioconda vt vcf2tsvpy python=3.6 +#RUN /conda/bin/conda update conda +#RUN /conda/bin/conda update python +#RUN /conda/bin/conda install python=3.6 +RUN /conda/bin/conda install -c bioconda vt vcf2tsvpy bcftools ## Clean Up RUN apt-get clean autoclean diff --git a/src/buildDocker.sh b/src/buildDocker.sh index 8962fd6..5e2522b 100755 --- a/src/buildDocker.sh +++ b/src/buildDocker.sh @@ -3,5 +3,5 @@ tar czvfh gvanno.tgz gvanno/ echo "Build the Docker Image" TAG=`date "+%Y%m%d"` -docker build -t sigven/gvanno:$TAG --rm=true . +docker build --no-cache -t sigven/gvanno:$TAG --rm=true . diff --git a/src/gvanno/gvanno_finalize.py b/src/gvanno/gvanno_finalize.py new file mode 100755 index 0000000..fa27cc6 --- /dev/null +++ b/src/gvanno/gvanno_finalize.py @@ -0,0 +1,32 @@ +#!/usr/bin/env python + +import os,re +import argparse + +from lib.gvanno.variant import append_annotations, clean_annotations +from lib.gvanno.utils import getlogger + + +def __main__(): + parser = argparse.ArgumentParser(description='Finalize gvanno-annotated TSV file - append final annotations') + parser.add_argument('gvanno_db_dir', help='Assembly-specific directory with gvanno data bundle files') + parser.add_argument('tsv_file_in', help='vcf2tsv-converted VCF file with gvanno-annotated variants (SNVs/InDels)') + parser.add_argument('tsv_file_out', help='TSV file with cleaned gvanno-annotated variants (SNVs/InDels)') + parser.add_argument('genome_assembly', help='Genome assembly') + parser.add_argument('sample_id', help='Sample identifier') + parser.add_argument("--debug", action="store_true", default=False, help="Print full commands to log, default: %(default)s") + args = parser.parse_args() + + logger = getlogger('gvanno-vep') + + arg_dict = vars(args) + + variant_set = \ + append_annotations( + arg_dict['tsv_file_in'], gvanno_db_dir = arg_dict['gvanno_db_dir'], logger = logger) + variant_set = clean_annotations(variant_set, arg_dict['sample_id'], arg_dict['genome_assembly'], logger = logger) + variant_set.fillna('.').to_csv(arg_dict['tsv_file_out'], sep="\t", compression="gzip", index=False) + + +if __name__=="__main__": __main__() + diff --git a/src/gvanno/gvanno_summarise.py b/src/gvanno/gvanno_summarise.py index bd57c2e..7f550d4 100755 --- a/src/gvanno/gvanno_summarise.py +++ b/src/gvanno/gvanno_summarise.py @@ -3,155 +3,200 @@ import csv import re import argparse -from cyvcf2 import VCF, Writer -import gzip +import cyvcf2 import os -import annoutils -import oncogenicity -logger = annoutils.getlogger('gvanno-gene-annotate') -csv.field_size_limit(500 * 1024 * 1024) +from lib.gvanno.annoutils import read_infotag_file, make_transcript_xref_map, read_genexref_namemap, write_pass_vcf, map_regulatory_variant_annotations +from lib.gvanno.vep import parse_vep_csq +from lib.gvanno.dbnsfp import vep_dbnsfp_meta_vcf, map_variant_effect_predictors +from lib.gvanno.oncogenicity import assign_oncogenicity_evidence +from lib.gvanno.mutation_hotspot import load_mutation_hotspots, match_csq_mutation_hotspot +from lib.gvanno.utils import error_message, check_subprocess, getlogger +from lib.gvanno.vep import parse_vep_csq +csv.field_size_limit(500 * 1024 * 1024) def __main__(): - - parser = argparse.ArgumentParser(description='Gene annotations from gvanno pipeline (SNVs/InDels)') - parser.add_argument('vcf_file', help='VCF file with VEP-annotated query variants (SNVs/InDels)') - parser.add_argument('gvanno_db_dir',help='gvanno data directory') - parser.add_argument('lof_prediction',default=0,type=int,help='VEP LoF prediction setting (0/1)') - parser.add_argument('oncogenicity_annotation',default=0,type=int,help='Include oncogenicity annotation (0/1)') - parser.add_argument('regulatory_annotation',default=0,type=int,help='Inclusion of VEP regulatory annotations (0/1)') - parser.add_argument('debug',default=0,type=int,help='Output extensive debugging information') - args = parser.parse_args() - - extend_vcf_annotations(args.vcf_file, args.gvanno_db_dir, args.lof_prediction, args.oncogenicity_annotation, args.regulatory_annotation, args.debug) - -def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0, oncogenicity_annotation = 0, regulatory_annotation = 0, debug = 0): - """ - Function that reads VEP/vcfanno-annotated VCF and extends the VCF INFO column with tags from - 1. CSQ elements within the primary transcript consequence picked by VEP, e.g. SYMBOL, Feature, Gene, Consequence etc. - 2. Gene annotations, e.g. known oncogenes/tumor suppressors - 3. Cancer hotspot mutation sites - 4. Variant effect predictions - """ - - ## read VEP and PCGR tags to be appended to VCF file - vcf_infotags_meta = annoutils.read_infotag_file(logger, os.path.join(gvanno_db_directory,'gvanno_infotags.tsv')) - gvanno_xref_map = annoutils.read_genexref_namemap(logger, os.path.join(gvanno_db_directory,'gvanno_xref', 'gvanno_xref_namemap.tsv')) - out_vcf = re.sub(r'\.vcf(\.gz){0,}$','.annotated.vcf',query_vcf) - - meta_vep_dbnsfp_info = annoutils.vep_dbnsfp_meta_vcf(query_vcf, vcf_infotags_meta) - dbnsfp_prediction_algorithms = meta_vep_dbnsfp_info['dbnsfp_prediction_algorithms'] - vep_csq_fields_map = meta_vep_dbnsfp_info['vep_csq_fieldmap'] - vcf = VCF(query_vcf) - for tag in vcf_infotags_meta: - if lof_prediction == 0 and regulatory_annotation == 0: - if not tag.startswith('LoF') and not tag.startswith('REGULATORY_'): - vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']), \ - 'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])}) - elif lof_prediction == 1 and regulatory_annotation == 0: - if not tag.startswith('REGULATORY_'): - vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']), \ - 'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])}) - elif lof_prediction == 0 and regulatory_annotation == 1: - if not tag.startswith('LoF'): - vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']), \ - 'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])}) - else: - vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']), \ - 'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])}) - - - w = Writer(out_vcf, vcf) - current_chrom = None - num_chromosome_records_processed = 0 - num_records_filtered = 0 - - cancer_hotspots = annoutils.read_cancer_hotspots(logger, os.path.join(gvanno_db_directory,'cancer_hotspots', 'cancer_hotspots.tsv')) - - vcf_info_element_types = {} - for e in vcf.header_iter(): - header_element = e.info() - if 'ID' in header_element and 'HeaderType' in header_element and 'Type' in header_element: - identifier = str(header_element['ID']) - fieldtype = str(header_element['Type']) - vcf_info_element_types[identifier] = fieldtype - - for rec in vcf: - if current_chrom is None: - current_chrom = str(rec.CHROM) - num_chromosome_records_processed = 0 - else: - if str(rec.CHROM) != current_chrom: - logger.info('Completed summary of functional annotations for ' + str(num_chromosome_records_processed) + ' variants on chromosome ' + str(current_chrom)) + parser = argparse.ArgumentParser(description='Summarise VEP annotations (gene/variant) from gvanno pipeline (SNVs/InDels)') + parser.add_argument('vcf_file_in', help='Bgzipped VCF file with VEP-annotated query variants (SNVs/InDels)') + parser.add_argument('vcf_file_out', help='Bgzipped VCF file with extended VEP-annotated query variants (SNVs/InDels)') + parser.add_argument('regulatory_annotation',default=0,type=int,help='Inclusion of VEP regulatory annotations (0/1)') + parser.add_argument('oncogenicity_annotation',default=0,type=int,help='Include oncogenicity annotation (0/1)') + parser.add_argument('vep_pick_order', default="mane_select,mane_plus_clinical,canonical,appris,biotype,ccds,rank,tsl,length", + help=f"Comma-separated string of ordered transcript/variant properties for selection of primary variant consequence") + parser.add_argument('gvanno_db_dir',help='gvanno data directory') + parser.add_argument('--compress_output_vcf', action="store_true", default=False, help="Compress output VCF file") + + parser.add_argument("--debug", action="store_true", default=False, help="Print full commands to log, default: %(default)s") + args = parser.parse_args() + + logger = getlogger('gvanno-gene-annotate') + + arg_dict = vars(args) + + extend_vcf_annotations(arg_dict, logger) + +def extend_vcf_annotations(arg_dict, logger): + """ + Function that reads VEP/vcfanno-annotated VCF and extends the VCF INFO column with tags from + 1. CSQ elements for the primary (i.e. "picked") gene transcript consequence from VEP, e.g. SYMBOL, Feature, Gene, Consequence etc. + 2. Gene annotations (GENE_TRANSCRIPT_XREF), + 3. Variant effect predictions - dbNSFP + + Moreover, it performs two important matching procedures, using + 4. Information from VEP's CSQ information (HGVSp/HGVSc) to match known mutation hotspots in cancer + + Finally, it assesses somatic variant oncogenicity, using + 5. Gene annotations (tumor suppressor, oncogene) and variant annotations (loss-of-function, gnomAD variant frequencies, variant effect predictions). + Variant oncogenicity levels are provided for all variants using a recommended five-level scheme ("Oncogenic", "Likely oncogenic", "VUS", "Likely Benign", "Benign") + - Recommended scoring scheme for variant oncogenicity classification outlined by VICC/ClinGen consortia (Horak et al., Genet Med, 2022) + + List of VCF INFO tags appended by this procedure is defined by the 'infotags' files in the gvanno_db_dir + """ + + vcf_infotags = {} + + vcf_infotags['other'] = read_infotag_file(os.path.join(arg_dict['gvanno_db_dir'], 'vcf_infotags_gvanno.tsv'), scope = "gvanno") + vcf_infotags['vep'] = read_infotag_file(os.path.join(arg_dict['gvanno_db_dir'], 'vcf_infotags_vep.tsv'), scope = "vep") + vcf_infotags['other'].update(vcf_infotags['vep']) + vcf_info_metadata = vcf_infotags['other'] + + gene_transcript_xref_map = read_genexref_namemap( + os.path.join(arg_dict['gvanno_db_dir'], 'gene','tsv','gene_transcript_xref', 'gene_transcript_xref_bedmap.tsv.gz'), logger) + cancer_hotspots = load_mutation_hotspots( + os.path.join(arg_dict['gvanno_db_dir'], 'misc','tsv','hotspot', 'hotspot.tsv.gz'), logger) + + out_vcf = re.sub(r'(\.gz)$','',arg_dict['vcf_file_out']) + + meta_vep_dbnsfp_info = vep_dbnsfp_meta_vcf(arg_dict['vcf_file_in'], vcf_info_metadata) + dbnsfp_prediction_algorithms = meta_vep_dbnsfp_info['dbnsfp_prediction_algorithms'] + vep_csq_fields_map = meta_vep_dbnsfp_info['vep_csq_fieldmap'] + + vcf = cyvcf2.VCF(arg_dict['vcf_file_in']) + for tag in sorted(vcf_info_metadata): + if arg_dict['regulatory_annotation'] == 0: + if not tag.startswith('REGULATORY_'): + vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_info_metadata[tag]['description']),'Type':str(vcf_info_metadata[tag]['type']), 'Number': str(vcf_info_metadata[tag]['number'])}) + else: + vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_info_metadata[tag]['description']),'Type':str(vcf_info_metadata[tag]['type']), 'Number': str(vcf_info_metadata[tag]['number'])}) + + w = cyvcf2.Writer(arg_dict['vcf_file_out'], vcf) + current_chrom = None + num_chromosome_records_processed = 0 + + vcf_info_element_types = {} + for e in vcf.header_iter(): + header_element = e.info() + if 'ID' in header_element and 'HeaderType' in header_element and 'Type' in header_element: + identifier = str(header_element['ID']) + fieldtype = str(header_element['Type']) + vcf_info_element_types[identifier] = fieldtype + + vars_no_csq = list() + for rec in vcf: + alt_allele = ','.join(rec.ALT) + pos = rec.start + 1 + variant_id = f"g.{rec.CHROM}:{pos}{rec.REF}>{alt_allele}" + if current_chrom is None: current_chrom = str(rec.CHROM) num_chromosome_records_processed = 0 - if rec.INFO.get('CSQ') is None: - num_records_filtered = num_records_filtered + 1 - #logger.warning('Variant record ' + str(variant_id) + ' does not have CSQ tag from Variant Effect Predictor (--vep_skip_intergenic or --vep_coding_only turned ON?) - variant will be skipped') - continue - num_chromosome_records_processed += 1 - gvanno_xref = annoutils.make_transcript_xref_map(rec, gvanno_xref_map, xref_tag = "GVANNO_XREF") - - if regulatory_annotation == 1: - csq_record_results_all = annoutils.parse_vep_csq(rec, gvanno_xref, vep_csq_fields_map, logger, pick_only = False, csq_identifier = 'CSQ') - - if 'picked_gene_csq' in csq_record_results_all: - vep_csq_records_all = csq_record_results_all['picked_gene_csq'] - rec.INFO['REGULATORY_ANNOTATION'] = annoutils.map_regulatory_variant_annotations(vep_csq_records_all) - - vep_csq_record_results = annoutils.parse_vep_csq(rec, gvanno_xref, vep_csq_fields_map, logger, pick_only = True, csq_identifier = 'CSQ', debug = debug) - - - principal_hgvsp = '.' - principal_hgvsc = '.' - if 'picked_csq' in vep_csq_record_results: - csq_record = vep_csq_record_results['picked_csq'] - for k in csq_record: - if k in vcf_info_element_types: - if vcf_info_element_types[k] == "Flag": - #rec.INFO[k] = False - if csq_record[k] == "1": - rec.INFO[k] = True - else: - if not csq_record[k] is None: - rec.INFO[k] = csq_record[k] - - if k == 'HGVSp_short': - principal_hgvsp = csq_record[k] + else: + if str(rec.CHROM) != current_chrom: + if not current_chrom is None: + logger.info(f"Completed summary of functional annotations for {num_chromosome_records_processed} variants on chr{current_chrom}") + current_chrom = str(rec.CHROM) + num_chromosome_records_processed = 0 + if rec.INFO.get('CSQ') is None: + + vars_no_csq.append(variant_id) + continue + + num_chromosome_records_processed += 1 + transcript_xref_map = make_transcript_xref_map(rec, gene_transcript_xref_map, xref_tag = "GENE_TRANSCRIPT_XREF") + + vep_csq_record_results = {} + vep_csq_record_results = \ + parse_vep_csq(rec, transcript_xref_map, vep_csq_fields_map, arg_dict['vep_pick_order'], + logger, pick_only = False, csq_identifier = 'CSQ') + + if 'picked_gene_csq' in vep_csq_record_results and bool(arg_dict['regulatory_annotation']) is True: + rec.INFO['REGULATORY_ANNOTATION'] = map_regulatory_variant_annotations( + vep_csq_record_results['picked_gene_csq']) + + principal_csq_properties = {} + principal_csq_properties['hgvsp'] = '.' + principal_csq_properties['hgvsc'] = '.' + principal_csq_properties['entrezgene'] = '.' + principal_csq_properties['exon'] = '.' + principal_csq_properties['codon'] = '.' + principal_csq_properties['lof'] = '.' + + if 'picked_csq' in vep_csq_record_results: + csq_record = vep_csq_record_results['picked_csq'] + for k in csq_record: + if k in vcf_info_element_types: + if vcf_info_element_types[k] == "Flag" and csq_record[k] == "1": + rec.INFO[k] = True + else: + if not csq_record[k] is None: + rec.INFO[k] = csq_record[k] + + if k == 'HGVSp_short': + principal_csq_properties['hgvsp'] = csq_record[k] + if re.match(r'^(p.[A-Z]{1}[0-9]{1,}[A-Za-z]{1,})', principal_csq_properties['hgvsp']): + codon_match = re.findall(r'[A-Z][0-9]{1,}', principal_csq_properties['hgvsp']) + if len(codon_match) == 1: + principal_csq_properties['codon'] = 'p.' + codon_match[0] - if k == 'HGVSc': - principal_hgvsc = csq_record[k].split(':')[1] - #else: - # print("missing\t" + str(k)) - - if 'all_csq' in vep_csq_record_results: - rec.INFO['VEP_ALL_CSQ'] = ','.join(vep_csq_record_results['all_csq']) - annoutils.map_cancer_hotspots(vep_csq_record_results['all_csq'], cancer_hotspots, rec, principal_hgvsp, principal_hgvsc) - - if not rec.INFO.get('DBNSFP') is None: - annoutils.map_variant_effect_predictors(rec, dbnsfp_prediction_algorithms) - - if oncogenicity_annotation == 1: - oncogenicity.assign_oncogenicity_evidence(rec, tumortype = "Any") - w.write_record(rec) - w.close() - logger.info('Completed summary of functional annotations for ' + str(num_chromosome_records_processed) + ' variants on chromosome ' + str(current_chrom)) - vcf.close() - logger.info("Number of variant calls filtered by VEP (No CSQ tag, '--vep_coding_only' / '--vep_skip_intergenic'): " + str(num_records_filtered)) - - if os.path.exists(out_vcf): - if os.path.getsize(out_vcf) > 0: - os.system('bgzip -f ' + str(out_vcf)) - os.system('tabix -f -p vcf ' + str(out_vcf) + '.gz') - annotated_vcf = out_vcf + '.gz' - annoutils.write_pass_vcf(annotated_vcf, logger) - else: - annoutils.error_message('No remaining PASS variants found in query VCF - exiting and skipping STEP 4 (gvanno-writer)', logger) - else: - annoutils.error_message('No remaining PASS variants found in query VCF - exiting and skipping STEP 4 (gvanno-writer)', logger) - -if __name__=="__main__": __main__() - - - + if k == 'HGVSc': + principal_csq_properties['hgvsc'] = csq_record[k].split(':')[1] + + if k == 'ENTREZGENE': + principal_csq_properties['entrezgene'] = csq_record[k] + + if k == 'LOSS_OF_FUNCTION': + principal_csq_properties['lof'] = csq_record[k] + + if k == 'EXON': + if "/" in csq_record[k]: + principal_csq_properties['exon'] = csq_record[k].split('/')[0] + + if 'all_csq' in vep_csq_record_results: + rec.INFO['VEP_ALL_CSQ'] = ','.join(vep_csq_record_results['all_csq']) + match_csq_mutation_hotspot(vep_csq_record_results['all_csq'], cancer_hotspots, rec, principal_csq_properties) + + if not rec.INFO.get('DBNSFP') is None: + map_variant_effect_predictors(rec, dbnsfp_prediction_algorithms) + + if arg_dict['oncogenicity_annotation'] == 1: + assign_oncogenicity_evidence(rec, tumortype = "Any") + + if "GENE_TRANSCRIPT_XREF" in vcf_info_element_types: + gene_xref_tag = rec.INFO.get('GENE_TRANSCRIPT_XREF') + if not gene_xref_tag is None: + del rec.INFO['GENE_TRANSCRIPT_XREF'] + w.write_record(rec) + if vars_no_csq: + logger.warning(f"There were {len(vars_no_csq)} records with no CSQ tag from VEP (was --vep_no_intergenic flag set?). Skipping them and showing (up to) the first 100:") + print('----') + print(', '.join(vars_no_csq[:100])) + print('----') + w.close() + if current_chrom is not None: + logger.info(f"Completed summary of functional annotations for {num_chromosome_records_processed} variants on chr{current_chrom}") + vcf.close() + + if os.path.exists(arg_dict['vcf_file_out']): + if os.path.getsize(arg_dict['vcf_file_out']) > 0: + if arg_dict['compress_output_vcf'] is True: + check_subprocess(logger, f'bgzip -f {out_vcf}', debug=arg_dict['debug']) + check_subprocess(logger, f'tabix -f -p vcf {out_vcf}.gz', debug=arg_dict['debug']) + summarized_vcf = f'{arg_dict["vcf_file_out"]}.gz' + write_pass_vcf(summarized_vcf, logger) + else: + error_message('No remaining PASS variants found in query VCF - exiting and skipping STEP 4', logger) + else: + error_message('No remaining PASS variants found in query VCF - exiting and skipping STEP 4', logger) + +if __name__=="__main__": + __main__() diff --git a/src/gvanno/gvanno_validate_input.py b/src/gvanno/gvanno_validate_input.py index 8701a64..1d77835 100755 --- a/src/gvanno/gvanno_validate_input.py +++ b/src/gvanno/gvanno_validate_input.py @@ -7,118 +7,145 @@ import subprocess import logging import sys -import annoutils import pandas as np from cyvcf2 import VCF +from lib.gvanno.vcf import check_existing_vcf_info_tags +from lib.gvanno.annoutils import read_infotag_file +from lib.gvanno.utils import getlogger, random_id_generator, check_subprocess, remove_file + + def __main__(): parser = argparse.ArgumentParser(description='Verify input data for gvanno') parser.add_argument('gvanno_dir',help='Docker location of gvanno base directory with accompanying data directory, e.g. /data') parser.add_argument('input_vcf', help='VCF input file with query variants (SNVs/InDels)') - parser.add_argument('vcf_validation',type=int, help="Perform VCF validation with Ensembl's vcf-validator") + parser.add_argument('validated_vcf', help="Validated VCF file with query variants (SNVs/InDels)") parser.add_argument('genome_assembly',help='grch37 or grch38') + parser.add_argument('sample_id',help='Sample identifier') parser.add_argument('--output_dir', dest='output_dir', help='Output directory', default='/workdir/output') - - + parser.add_argument('--debug', action='store_true', help="Print debug messages") args = parser.parse_args() - ret = validate_gvanno_input(args.gvanno_dir, args.input_vcf, args.vcf_validation, args.genome_assembly, args.output_dir) + ret = validate_gvanno_input(args.gvanno_dir, args.input_vcf, args.validated_vcf, args.sample_id, args.genome_assembly, args.output_dir, args.debug) if ret != 0: sys.exit(-1) -def check_existing_vcf_info_tags(input_vcf, gvanno_directory, genome_assembly, logger): - - """ - Function that compares the INFO tags in the query VCF and the INFO tags generated by gvanno - If any coinciding tags, an error will be returned - """ - - gvanno_infotags_desc = annoutils.read_infotag_file(logger, os.path.join(gvanno_directory,'data',genome_assembly,'gvanno_infotags.tsv')) - - vcf = VCF(input_vcf) - logger.info('Checking if existing INFO tags of query VCF file coincide with gvanno INFO tags') - ret = 1 - for e in vcf.header_iter(): - header_element = e.info() - if 'ID' in header_element.keys() and 'HeaderType' in header_element.keys(): - if header_element['HeaderType'] == 'INFO': - if header_element['ID'] in gvanno_infotags_desc.keys(): - err_msg = 'INFO tag ' + str(header_element['ID']) + ' in the query VCF coincides with a VCF annotation tag produced by gvanno - please remove or rename this tag in your query VCF' - return annoutils.error_message(err_msg, logger) - - logger.info('No query VCF INFO tags coincide with gvanno INFO tags') - return ret -def simplify_vcf(input_vcf, vcf, output_dir, logger): - - """ - Function that performs tre things on the validated input VCF: - 1. Strip of any genotype data - 2. If VCF have variants with multiple alternative alleles ("multiallelic", e.g. 'A,T'), these are decomposed into variants with a single alternative allele - 3. Final VCF file is sorted and indexed (bgzip + tabix) - """ - - input_vcf_gvanno_ready = os.path.join(output_dir, re.sub(r'(\.vcf$|\.vcf\.gz$)','.gvanno_ready.tmp.vcf',os.path.basename(input_vcf))) - input_vcf_gvanno_ready_decomposed = os.path.join(output_dir, re.sub(r'(\.vcf$|\.vcf\.gz$)','.gvanno_ready.vcf',os.path.basename(input_vcf))) - - multiallelic_alt = 0 - for rec in vcf: - POS = rec.start + 1 - alt = ",".join(str(n) for n in rec.ALT) - if len(rec.ALT) > 1: - logger.warning("Multiallelic site detected:" + str(rec.CHROM) + '\t' + str(POS) + '\t' + str(rec.REF) + '\t' + str(alt)) - multiallelic_alt = 1 - - command_vcf_sample_free1 = 'egrep \'^#\' ' + str(input_vcf) + ' > ' + str(input_vcf_gvanno_ready) - command_vcf_sample_free2 = 'egrep -v \'^#\' ' + str(input_vcf) + ' | sed \'s/^chr//\' | egrep \'^[0-9]\' | sort -k1,1n -k2,2n -k3,3 -k4,4 >> ' + str(input_vcf_gvanno_ready) - command_vcf_sample_free3 = 'egrep -v \'^#\' ' + str(input_vcf) + ' | sed \'s/^chr//\' | egrep -v \'^[0-9]\' | egrep \'^[XYM]\' | sort -k1,1 -k2,2n -k3,3 -k4,4 >> ' + str(input_vcf_gvanno_ready) - command_vcf_sample_free4 = 'egrep -v \'^#\' ' + str(input_vcf) + ' | sed \'s/^chr//\' | egrep -v \'^[0-9]\' | egrep -v \'^[XYM]\' | sort -k1,1 -k2,2n -k3,3 -k4,4 >> ' + str(input_vcf_gvanno_ready) - if input_vcf.endswith('.gz'): - command_vcf_sample_free1 = 'bgzip -dc ' + str(input_vcf) + ' | egrep \'^#\' > ' + str(input_vcf_gvanno_ready) - command_vcf_sample_free2 = 'bgzip -dc ' + str(input_vcf) + ' | egrep -v \'^#\' | sed \'s/^chr//\' | egrep \'^[0-9]\' | sort -k1,1n -k2,2n -k3,3 -k4,4 >> ' + str(input_vcf_gvanno_ready) - command_vcf_sample_free3 = 'bgzip -dc ' + str(input_vcf) + ' | egrep -v \'^#\' | sed \'s/^chr//\' | egrep -v \'^[0-9]\' | egrep \'^[XYM]\' | sort -k1,1 -k2,2n -k3,3 -k4,4 >> ' + str(input_vcf_gvanno_ready) - command_vcf_sample_free4 = 'bgzip -dc ' + str(input_vcf) + ' | egrep -v \'^#\' | sed \'s/^chr//\' | egrep -v \'^[0-9]\' | egrep -v \'^[XYM]\' | sort -k1,1 -k2,2n -k3,3 -k4,4 >> ' + str(input_vcf_gvanno_ready) - - os.system(command_vcf_sample_free1) - os.system(command_vcf_sample_free2) - os.system(command_vcf_sample_free3) - os.system(command_vcf_sample_free4) - - if multiallelic_alt == 1: - logger.info('Decomposing multi-allelic sites in input VCF file using \'vt decompose\'') - command_decompose = 'vt decompose -s ' + str(input_vcf_gvanno_ready) + ' > ' + str(input_vcf_gvanno_ready_decomposed) + ' 2> ' + os.path.join(output_dir, 'decompose.log') - os.system(command_decompose) - else: - command_copy = 'cp ' + str(input_vcf_gvanno_ready) + ' ' + str(input_vcf_gvanno_ready_decomposed) - os.system(command_copy) - os.system('bgzip -f ' + str(input_vcf_gvanno_ready_decomposed)) - os.system('tabix -p vcf ' + str(input_vcf_gvanno_ready_decomposed) + '.gz') - os.system('rm -f ' + str(input_vcf_gvanno_ready) + ' /workdir/output/decompose.log') - -def validate_gvanno_input(gvanno_directory, input_vcf, vcf_validation, genome_assembly, output_dir): + +def simplify_vcf(input_vcf, validated_vcf, vcf, output_dir, sample_id, logger, debug): + """ + input_vcf: path to input VCF + validated_vcf: path to validated VCF + vcf: parsed cyvcf2 object + Function that performs the following on the validated input VCF: + 1. Strip of any genotype data + 2. If VCF has variants with multiple alternative alleles ("multiallelic", e.g. 'A,T'), + these are decomposed into variants with a single alternative allele + 3. Final VCF file is sorted and indexed (bgzip + tabix) + """ + + random_id = random_id_generator(15) + + temp_files = {} + temp_files['vcf_1'] = \ + os.path.join(output_dir, f'{sample_id}.gvanno_validate.bcftools.{random_id}_1.vcf') + temp_files['vcf_2'] = \ + os.path.join(output_dir, f'{sample_id}.gvanno_validate.bcftools.{random_id}_2.vcf.gz') + temp_files['vcf_3'] = \ + os.path.join(output_dir, f'{sample_id}.gvanno_validate.bftools.{random_id}_3.vcf.gz') + bcftools_simplify_log = \ + os.path.join(output_dir, f'{sample_id}.gvanno_validate.bcftools.{random_id}.log') + vt_decompose_log = \ + os.path.join(output_dir, f'{sample_id}.gvanno_validate.vt_decompose.{random_id}.log') + + multiallelic_list = list() + for rec in vcf: + POS = rec.start + 1 + alt = ",".join(str(n) for n in rec.ALT) + if len(rec.ALT) > 1: + variant_id = f"{rec.CHROM}:{POS}_{rec.REF}->{alt}" + multiallelic_list.append(variant_id) + + logger.info('Extracting variants on autosomal/sex/mito chromosomes only (1-22,X,Y, M/MT) with bcftools') + # bgzip + tabix required for sorting + cmd_vcf1 = f'bcftools view {input_vcf} | bgzip -cf > {temp_files["vcf_2"]} && tabix -p vcf {temp_files["vcf_2"]} && ' + \ + f'bcftools sort --temp-dir {output_dir} -Oz {temp_files["vcf_2"]} > {temp_files["vcf_3"]} 2> {bcftools_simplify_log} && ' + \ + f'tabix -p vcf {temp_files["vcf_3"]}' + # Keep only autosomal/sex/mito chrom (handle hg38 and hg19), remove FORMAT metadata lines, keep cols 1-8, sub chr prefix + chrom_to_keep = [str(x) for x in [*range(1,23), 'X', 'Y', 'M', 'MT']] + chrom_to_keep = ','.join([*['chr' + chrom for chrom in chrom_to_keep], *[chrom for chrom in chrom_to_keep]]) + cmd_vcf2 = f'bcftools view --regions {chrom_to_keep} {temp_files["vcf_3"]} | egrep -v \'^##FORMAT=\' ' + \ + f'| cut -f1-8 | sed \'s/^chr//\' > {temp_files["vcf_1"]}' + + check_subprocess(logger, cmd_vcf1, debug) + check_subprocess(logger, cmd_vcf2, debug) + + if multiallelic_list: + logger.warning(f"There were {len(multiallelic_list)} multiallelic sites detected. Showing (up to) the first 100:") + print('----') + print(', '.join(multiallelic_list[:100])) + print('----') + logger.info('Decomposing multi-allelic sites in input VCF file using \'vt decompose\'') + command_decompose = f'vt decompose -s {temp_files["vcf_1"]} > {validated_vcf} 2> {vt_decompose_log}' + check_subprocess(logger, command_decompose, debug) + else: + logger.info('All sites seem to be decomposed - skipping decomposition!') + check_subprocess(logger, f'cp {temp_files["vcf_1"]} {validated_vcf}', debug) + + keep_uncompressed = False + # need to keep uncompressed copy for vcf2maf.pl if selected + bgzip_cmd = f"bgzip -cf {validated_vcf} > {validated_vcf}.gz" if keep_uncompressed else f"bgzip -f {validated_vcf}" + check_subprocess(logger, bgzip_cmd, debug) + check_subprocess(logger, f'tabix -p vcf {validated_vcf}.gz', debug) + + if os.path.exists(f'{validated_vcf}.gz') and os.path.getsize(f'{validated_vcf}.gz') > 0: + vcf = VCF(f'{validated_vcf}.gz') + i = 0 + for rec in vcf: + i = i + 1 + if len(vcf.seqnames) == 0 or i == 0: + logger.info('') + logger.info("Input VCF contains NO valid variants after VCF cleaning - quitting workflow") + logger.info('') + exit(1) + + if not debug: + remove_file(temp_files["vcf_1"]) + remove_file(temp_files["vcf_2"]) + remove_file(temp_files["vcf_3"]) + remove_file(temp_files["vcf_2"] + str('.tbi')) + remove_file(temp_files["vcf_3"] + str('.tbi')) + remove_file(bcftools_simplify_log) + remove_file(vt_decompose_log) + +def validate_gvanno_input(gvanno_directory, input_vcf, validated_vcf, sample_id, genome_assembly, output_dir, debug): """ Function that reads the input file to gvanno (VCF file) and performs the following checks: - 1. Check that VCF file is properly formatted (according to EBIvariation/vcf-validator - VCF v4.2) - 2. Check that no INFO annotation tags in the query VCF coincides with those generated by gvanno - 3. Check that if VCF have variants with multiple alternative alleles (e.g. 'A,T') run vt decompose - 4. Any genotype data from VCF input file is stripped, and the resulting VCF file is sorted and indexed (bgzip + tabix) + 1. Check that no INFO annotation tags in the query VCF coincides with those generated by gvanno + 2. Check that if VCF have variants with multiple alternative alleles (e.g. 'A,T') run vt decompose + 3. Any genotype data from VCF input file is stripped, and the resulting VCF file is sorted and indexed (bgzip + tabix) """ - logger = annoutils.getlogger('gvanno-validate-input') + logger = getlogger('gvanno-validate-input') if not input_vcf == 'None': - if vcf_validation == 1: - valid_vcf = annoutils.is_valid_vcf(input_vcf, output_dir, logger, False) - if valid_vcf == -1: - return -1 - else: - logger.info('Skipping validation of VCF file - as provided by option --no_vcf_validate') - tag_check = check_existing_vcf_info_tags(input_vcf, gvanno_directory, genome_assembly, logger) + vcf_object = VCF(input_vcf) + + ## Check that VCF does not contain INFO tags that will be appended with gvanno annotation + vcf_infotags = {} + vcf_infotags['gvanno'] = read_infotag_file( + os.path.join(gvanno_directory,'data',genome_assembly, 'vcf_infotags_gvanno.tsv'), scope = "gvanno") + vcf_infotags['vep'] = read_infotag_file( + os.path.join(gvanno_directory,'data',genome_assembly, 'vcf_infotags_vep.tsv'), scope = "vep") + vcf_infotags['gvanno'].update(vcf_infotags['vep']) + vcf_tags_gvanno = vcf_infotags['gvanno'] + + tag_check = check_existing_vcf_info_tags(vcf_object, vcf_tags_gvanno, logger) if tag_check == -1: - return -1 + return -1 vcf = VCF(input_vcf) - simplify_vcf(input_vcf, vcf, output_dir, logger) + simplify_vcf(input_vcf, validated_vcf, vcf, output_dir, sample_id, logger, debug) return 0 diff --git a/src/gvanno/gvanno_vcfanno.py b/src/gvanno/gvanno_vcfanno.py index d972a5d..1b5bfc7 100755 --- a/src/gvanno/gvanno_vcfanno.py +++ b/src/gvanno/gvanno_vcfanno.py @@ -1,138 +1,167 @@ #!/usr/bin/env python import argparse -from cyvcf2 import VCF import random -import annoutils -import os -import re -import sys +import re, os +import glob -logger = annoutils.getlogger('gvanno-vcfanno') +from lib.gvanno.vcf import get_vcf_info_tags, print_vcf_header +from lib.gvanno.utils import check_subprocess, random_id_generator, getlogger, remove_file +from lib.gvanno.annoutils import read_vcfanno_tag_file def __main__(): - parser = argparse.ArgumentParser(description='Run brentp/vcfanno - annotate a VCF file against multiple VCF files in parallel', formatter_class=argparse.ArgumentDefaultsHelpFormatter) - parser.add_argument('query_vcf', help='Bgzipped input VCF file with query variants (SNVs/InDels)') - parser.add_argument('out_vcf', help='Output VCF file with appended annotations from multiple VCF files') - parser.add_argument('gvanno_db_dir', help='gvanno data directory') - parser.add_argument('--num_processes', help="Number of processes vcfanno can use during annotation", default=4) - parser.add_argument("--ncer",action = "store_true", help="Annotate VCF with ranking of variant deleteriousness in non-coding regions (ncER)") - parser.add_argument("--clinvar",action = "store_true", help="Annotate VCF with annotations from ClinVar") - parser.add_argument("--dbnsfp",action = "store_true", help="Annotate VCF with annotations from database of non-synonymous functional predictions") - parser.add_argument("--gvanno_xref",action = "store_true", help="Annotate VCF with transcript annotations from gvanno (protein complexes, disease associations, etc)") - parser.add_argument("--gwas",action = "store_true", help="Annotate VCF with against known loci associated with cancer, as identified from genome-wide association studies (GWAS)") - - - args = parser.parse_args() - query_info_tags = get_vcf_info_tags(args.query_vcf) - vcfheader_file = args.out_vcf + '.tmp.' + str(random.randrange(0,10000000)) + '.header.txt' - vcfanno_conf_fname = args.out_vcf + '.tmp.conf.toml' - print_vcf_header(args.query_vcf, vcfheader_file, chromline_only = False) - run_vcfanno(args.num_processes, args.query_vcf, vcfanno_conf_fname, query_info_tags, vcfheader_file, args.gvanno_db_dir, args.out_vcf, args.ncer, args.clinvar, args.dbnsfp, args.gvanno_xref,args.gwas) - - -def prepare_vcfanno_configuration(vcfanno_data_directory, vcfanno_conf_fname, vcfheader_file, logger, datasource_info_tags, query_info_tags, datasource): - for t in datasource_info_tags: - if t in query_info_tags: - logger.warning("Query VCF has INFO tag " + str(t) + ' - this is also present in the ' + str(datasource) + ' VCF/BED annotation file. This tag will be overwritten if not renamed in the query VCF') - append_to_conf_file(datasource, datasource_info_tags, vcfanno_data_directory, vcfanno_conf_fname) - append_to_vcf_header(vcfanno_data_directory, datasource, vcfheader_file) - -def run_vcfanno(num_processes, query_vcf, vcfanno_conf_fname, query_info_tags, vcfheader_file, gvanno_db_directory, output_vcf, ncer, clinvar, dbnsfp, gvanno_xref, gwas): - """ - Function that annotates a VCF file with vcfanno against a user-defined set of germline and somatic VCF files - """ - clinvar_info_tags = ["CLINVAR_MSID","CLINVAR_PMID","CLINVAR_CLNSIG","CLINVAR_VARIANT_ORIGIN","CLINVAR_CONFLICTED","CLINVAR_UMLS_CUI","CLINVAR_HGVSP", - "CLINVAR_UMLS_CUI_SOMATIC","CLINVAR_CLNSIG_SOMATIC","CLINVAR_PMID_SOMATIC","CLINVAR_ALLELE_ID","CLINVAR_MOLECULAR_EFFECT", - "CLINVAR_REVIEW_STATUS_STARS","CLINVAR_CLASSIFICATION"] - dbnsfp_info_tags = ["DBNSFP"] - gvanno_xref_info_tags = ["GVANNO_XREF"] - gwas_info_tags = ["GWAS_HIT"] - ncer_info_tags = ["NCER_PERCENTILE"] - - if clinvar is True: - prepare_vcfanno_configuration(gvanno_db_directory, vcfanno_conf_fname, vcfheader_file, logger, clinvar_info_tags, query_info_tags, "clinvar") - if dbnsfp is True: - prepare_vcfanno_configuration(gvanno_db_directory, vcfanno_conf_fname, vcfheader_file, logger, dbnsfp_info_tags, query_info_tags, "dbnsfp") - if gvanno_xref is True: - prepare_vcfanno_configuration(gvanno_db_directory, vcfanno_conf_fname, vcfheader_file, logger, gvanno_xref_info_tags, query_info_tags, "gvanno_xref") - if gwas is True: - prepare_vcfanno_configuration(gvanno_db_directory, vcfanno_conf_fname, vcfheader_file, logger, gwas_info_tags, query_info_tags, "gwas") - if ncer is True: - prepare_vcfanno_configuration(gvanno_db_directory, vcfanno_conf_fname, vcfheader_file, logger, ncer_info_tags, query_info_tags, "ncer") - - out_vcf_vcfanno_unsorted1 = output_vcf + '.tmp.unsorted.1' - query_prefix = re.sub('\.vcf.gz$','',query_vcf) - print_vcf_header(query_vcf, vcfheader_file, chromline_only = True) - command1 = "vcfanno -p=" + str(num_processes) + " " + str(vcfanno_conf_fname) + " " + str(query_vcf) + " > " + str(out_vcf_vcfanno_unsorted1) + " 2> " + str(query_prefix) + '.vcfanno.log' - os.system(command1) - - os.system('cat ' + str(vcfheader_file) + ' > ' + str(output_vcf)) - os.system('cat ' + str(out_vcf_vcfanno_unsorted1) + ' | grep -v \'^#\' >> ' + str(output_vcf)) - os.system('rm -f ' + str(output_vcf) + '.tmp*') - os.system('bgzip -f ' + str(output_vcf)) - os.system('tabix -f -p vcf ' + str(output_vcf) + '.gz') - return 0 - -def append_to_vcf_header(gvanno_db_directory, datasource, vcfheader_file): - """ - Function that appends the VCF header information for a given 'datasource' (containing INFO tag formats/descriptions, and datasource version) - """ - vcf_info_tags_file = str(gvanno_db_directory) + '/' + str(datasource) + '/' + str(datasource) + '.vcfanno.vcf_info_tags.txt' - os.system('cat ' + str(vcf_info_tags_file) + ' >> ' + str(vcfheader_file)) - - -def append_to_conf_file(datasource, datasource_info_tags, gvanno_db_directory, vcfanno_conf_fname): - """ - Function that appends data to a vcfanno conf file ('vcfanno_conf_fname') according to user-defined ('datasource'). - The datasource defines the set of tags that will be appended during annotation - """ - fh = open(vcfanno_conf_fname,'a') - if datasource != 'gvanno_xref' and datasource != 'ncer': - fh.write('[[annotation]]\n') - fh.write('file="' + str(gvanno_db_directory) + '/' + str(datasource) + '/' + str(datasource) + '.vcf.gz"\n') - fields_string = 'fields = ["' + '","'.join(datasource_info_tags) + '"]' - ops = ['concat'] * len(datasource_info_tags) - ops_string = 'ops=["' + '","'.join(ops) + '"]' - fh.write(fields_string + '\n') - fh.write(ops_string + '\n\n') - else: - if datasource == 'gvanno_xref' or datasource == 'ncer': - if datasource == 'gvanno_xref': - fh.write('[[annotation]]\n') - fh.write('file="' + str(gvanno_db_directory) + '/' + str(datasource) + '/' + str(datasource) + '.bed.gz"\n') - fh.write('columns=[4]\n') - names_string = 'names=["' + '","'.join(datasource_info_tags) + '"]' - fh.write(names_string +'\n') - fh.write('ops=["concat"]\n\n') - - else: ## ncER - fh.write('[[annotation]]\n') - fh.write('file="' + str(gvanno_db_directory) + '/' + str(datasource) + '/' + str(datasource) + '.bed.gz"\n') - fh.write('columns=[4]\n') - names_string = 'names=["' + '","'.join(datasource_info_tags) + '"]' - fh.write(names_string +'\n') - fh.write('ops=["mean"]\n\n') - fh.close() - return - -def get_vcf_info_tags(vcffile): - vcf = VCF(vcffile) - info_tags = {} - for e in vcf.header_iter(): - header_element = e.info() - if 'ID' in header_element.keys() and 'HeaderType' in header_element.keys(): - if header_element['HeaderType'] == 'INFO': - info_tags[str(header_element['ID'])] = 1 - - return info_tags - - -def print_vcf_header(query_vcf, vcfheader_file, chromline_only = False): - if chromline_only == True: - os.system('bgzip -dc ' + str(query_vcf) + ' | egrep \'^#\' | egrep \'^#CHROM\' >> ' + str(vcfheader_file)) - else: - os.system('bgzip -dc ' + str(query_vcf) + ' | egrep \'^#\' | egrep -v \'^#CHROM\' > ' + str(vcfheader_file)) - -if __name__=="__main__": __main__() + parser = argparse.ArgumentParser(description='Run brentp/vcfanno - annotate a VCF file against multiple VCF files in parallel', + formatter_class=argparse.ArgumentDefaultsHelpFormatter) + parser.add_argument( + 'query_vcf', help='Bgzipped input VCF file with query variants (SNVs/InDels)') + parser.add_argument( + 'out_vcf', help='Output VCF file with appended annotations from multiple VCF files') + parser.add_argument( + 'gvanno_db_dir', help='gvanno assembly-specific data directory') + parser.add_argument( + '--num_processes', help="Number of processes vcfanno can use during annotation", default=4) + parser.add_argument("--clinvar", action="store_true", + help="Annotate VCF with annotations from ClinVar") + parser.add_argument("--ncer", action="store_true", + help="Annotate VCF with ranking of variant deleteriousness in non-coding regions (ncER)") + parser.add_argument("--dbnsfp", action="store_true", + help="Annotate VCF with annotations from database of non-synonymous functional predictions") + parser.add_argument("--gene_transcript_xref", action="store_true", + help="Annotate VCF with transcript annotations from PCGR (drug targets, actionable genes, cancer gene roles, etc)") + parser.add_argument("--gwas", action="store_true", + help="Annotate VCF against moderate-to-low cancer risk variants, as identified from genome-wide association studies (GWAS)") + parser.add_argument("--debug", action="store_true", default=False, + help="Print full commands to log, keep temporary and log files, default: %(default)s") + + args = parser.parse_args() + + logger = getlogger('gvanno-vcfanno') + + query_info_tags = get_vcf_info_tags(args.query_vcf) + vcfheader_file = args.out_vcf + '.tmp.' + \ + str(random.randrange(0, 10000000)) + '.header.txt' + conf_fname = args.out_vcf + '.tmp.conf.toml' + print_vcf_header(args.query_vcf, vcfheader_file, + logger, chromline_only=False) + + vcfanno_tracks = {} + ## BED + vcfanno_tracks['gene_transcript_xref'] = args.gene_transcript_xref + + ## VCF + vcfanno_tracks['gwas'] = args.gwas + vcfanno_tracks['dbnsfp'] = args.dbnsfp + vcfanno_tracks['clinvar'] = args.clinvar + vcfanno_tracks['ncer'] = args.ncer + vcfanno_tracks['gene_transcript_xref'] = args.gene_transcript_xref + + run_vcfanno(args.num_processes, args.query_vcf, vcfanno_tracks, query_info_tags, vcfheader_file, + args.gvanno_db_dir, conf_fname, args.out_vcf, args.debug, logger) + + +def run_vcfanno(num_processes, query_vcf, vcfanno_tracks, query_info_tags, vcfheader_file, gvanno_db_dir, conf_fname, + output_vcf, debug, logger): + + """ + Function that annotates a VCF file with vcfanno against a user-defined set of germline and somatic VCF files + """ + + ## Collect metadata (VCF INFO tags) for annotations populated with vcfanno + metadata_vcf_infotags = {} + infotags = {} + + track_file_info = {} + + ## INFO tags used for vcfanno annotation + track_file_info['tags_fname'] = {} + + ## Source file (VCF/BED) used for vcfanno annotation + track_file_info['track_fname'] = {} + + for variant_track in ['clinvar','gwas','dbnsfp']: + track_file_info['tags_fname'][variant_track] = os.path.join(gvanno_db_dir,'variant','vcf', variant_track, f'{variant_track}.vcfanno.vcf_info_tags.txt') + track_file_info['track_fname'][variant_track] = os.path.join(gvanno_db_dir,'variant','vcf', variant_track, f'{variant_track}.vcf.gz') + + track_file_info['tags_fname']['gene_transcript_xref'] = os.path.join(gvanno_db_dir,'gene','bed', 'gene_transcript_xref', 'gene_transcript_xref.vcfanno.vcf_info_tags.txt') + track_file_info['track_fname']['gene_transcript_xref'] = os.path.join(gvanno_db_dir,'gene','bed', 'gene_transcript_xref', 'gene_transcript_xref.bed.gz') + track_file_info['tags_fname']['ncer'] = os.path.join(gvanno_db_dir,'misc','bed', 'ncer', 'ncer.vcfanno.vcf_info_tags.txt') + track_file_info['track_fname']['ncer'] = os.path.join(gvanno_db_dir,'misc','bed', 'ncer', 'ncer.bed.gz') + + + for track in track_file_info['tags_fname']: + + if not vcfanno_tracks[track] is True: + continue + + infotags_vcfanno = read_vcfanno_tag_file(track_file_info['tags_fname'][track], logger) + infotags[track] = infotags_vcfanno.keys() + for tag in infotags_vcfanno: + if tag in query_info_tags: + logger.warning("Query VCF has INFO tag " + str(tag) + ' - this is also present in the ' + str( + track) + ' VCF/BED annotation file. This tag will be overwritten if not renamed in the query VCF') + metadata_vcf_infotags[tag] = infotags_vcfanno[tag] + + ## append track to vcfanno configuration file + append_to_conf_file(track, infotags[track], track_file_info['track_fname'][track], conf_fname) + + ## Append VCF INFO tags to VCF header file + check_subprocess(logger, f'cat {track_file_info["tags_fname"][track]} >> {vcfheader_file}', debug=False) + + random_id = random_id_generator(10) + query_prefix = re.sub(r'\.vcf.gz$', '', query_vcf) + print_vcf_header(query_vcf, vcfheader_file, logger, chromline_only=True) + + vcfanno_command = ( + f"vcfanno -p={num_processes} {conf_fname} {query_vcf} > {query_prefix}.{random_id}.tmp.vcfanno.unsorted.vcf 2> " + f"{query_prefix}.{random_id}.tmp.vcfanno.log" + ) + + if debug: + logger.info(f"vcfanno command: {vcfanno_command}") + check_subprocess(logger, vcfanno_command, debug) + + check_subprocess( + logger, f'cat {vcfheader_file} > {output_vcf}', debug=False) + check_subprocess( + logger, f"cat {query_prefix}.{random_id}.tmp.vcfanno.unsorted.vcf | grep -v '^#' >> {output_vcf}", debug=False) + check_subprocess(logger, f'bgzip -f {output_vcf}', debug) + check_subprocess(logger, f'tabix -f -p vcf {output_vcf}.gz', debug) + if not debug: + for intermediate_file in glob.glob(f"{query_prefix}.{random_id}.tmp.vcfanno*"): + remove_file(intermediate_file) + + return + + +def append_to_conf_file(datasource, datasource_info_tags, datasource_track_fname, conf_fname): + """ + Function that appends data to a vcfanno conf file ('conf_fname') according to user-defined ('datasource'). + The datasource defines the set of tags that will be appended during annotation + """ + fh = open(conf_fname, 'a') + fh.write('[[annotation]]\n') + fh.write('file="' + str(datasource_track_fname) + '"\n') + if datasource == 'ncer' or datasource == 'gerp': + fh.write('columns=[4]\n') + names_string = 'names=["' + '","'.join(datasource_info_tags) + '"]' + fh.write(names_string + '\n') + fh.write('ops=["mean"]\n\n') + elif datasource == 'gene_transcript_xref' or datasource == 'rmsk': + fh.write('columns=[4]\n') + names_string = 'names=["' + '","'.join(datasource_info_tags) + '"]' + fh.write(names_string + '\n') + fh.write('ops=["concat"]\n\n') + else: + fields_string = 'fields = ["' + '","'.join(datasource_info_tags) + '"]' + ops = ['concat'] * len(datasource_info_tags) + ops_string = 'ops=["' + '","'.join(ops) + '"]' + fh.write(fields_string + '\n') + fh.write(ops_string + '\n\n') + fh.close() + return + + + +if __name__ == "__main__": + __main__() diff --git a/src/gvanno/gvanno_vep.py b/src/gvanno/gvanno_vep.py new file mode 100755 index 0000000..ea85f58 --- /dev/null +++ b/src/gvanno/gvanno_vep.py @@ -0,0 +1,125 @@ +#!/usr/bin/env python + +import os,re +import argparse + +from lib.gvanno.utils import get_loftee_dir, getlogger, check_subprocess +from lib.gvanno import gvanno_vars + + +def __main__(): + parser = argparse.ArgumentParser(description='Run VEP on a VCF file (SNVs/InDels)') + parser.add_argument('vep_cache_dir', help='Directory with VEP cache files)') + parser.add_argument('vcf_file_in', help='Bgzipped VCF file with validated query variants (SNVs/InDels)') + parser.add_argument('vcf_file_out', help='Bgzipped VCF file with VEP-annotated query variants (SNVs/InDels)') + parser.add_argument('genome_assembly', help='Genome assembly') + parser.add_argument('vep_pick_order', default="mane_select,mane_plus_clinical,canonical,appris,biotype,ccds,rank,tsl,length", + help=f"Comma-separated string of ordered transcript/variant properties for selection of primary variant consequence") + parser.add_argument('--vep_regulatory',action = "store_true",help='Inclusion of VEP regulatory annotations') + + parser.add_argument('--vep_buffer_size',default=1000,type=int,help='Buffer size for VEP') + parser.add_argument('--vep_gencode_basic',action="store_true",help='Only consier basic GENCODE transcripts)') + parser.add_argument('--vep_n_forks',default=4,type=int,help='Number of forks for VEP processing') + parser.add_argument('--vep_lof_prediction',action="store_true",help='Perform LoF prediction with the LOFTEE plugin in VEP') + parser.add_argument('--vep_coding_only', action="store_true", help="Only consider coding variants") + parser.add_argument('--vep_no_intergenic', action="store_true", help="Skip intergenic variants") + parser.add_argument("--debug", action="store_true", default=False, help="Print full commands to log, default: %(default)s") + args = parser.parse_args() + + logger = getlogger('gvanno-vep') + + arg_dict = vars(args) + conf_options = {} + conf_options['genome_assembly'] = arg_dict['genome_assembly'] + conf_options['conf'] = {} + conf_options['conf']['vep'] = {} + conf_options['conf']['vep']['vep_n_forks'] = arg_dict['vep_n_forks'] + conf_options['conf']['vep']['vep_pick_order'] = arg_dict['vep_pick_order'] + conf_options['conf']['vep']['vep_buffer_size'] = arg_dict['vep_buffer_size'] + conf_options['conf']['vep']['vep_gencode_basic'] = arg_dict['vep_gencode_basic'] + conf_options['conf']['vep']['vep_regulatory'] = arg_dict['vep_regulatory'] + conf_options['conf']['vep']['vep_lof_prediction'] = arg_dict['vep_lof_prediction'] + conf_options['conf']['vep']['vep_coding_only'] = arg_dict['vep_coding_only'] + conf_options['conf']['vep']['vep_no_intergenic'] = arg_dict['vep_no_intergenic'] + + run_vep(arg_dict['vep_cache_dir'], conf_options, arg_dict['vcf_file_in'], arg_dict['vcf_file_out'], logger, arg_dict['debug']) + +def run_vep(vep_cache_dir, conf_options, input_vcf, output_vcf, logger, debug = False): + + output_vcf_gz = f'{output_vcf}.gz' + genome_assembly = conf_options['genome_assembly'] + + fasta_assembly = os.path.join( + vep_cache_dir, 'homo_sapiens', + f'{gvanno_vars.VEP_VERSION}_{gvanno_vars.VEP_ASSEMBLY[genome_assembly]}', + f'Homo_sapiens.{gvanno_vars.VEP_ASSEMBLY[genome_assembly]}.dna.primary_assembly.fa.gz') + ancestor_assembly = os.path.join( + vep_cache_dir, 'homo_sapiens', + f'{gvanno_vars.VEP_VERSION}_{gvanno_vars.VEP_ASSEMBLY[genome_assembly]}', + f'human_ancestor.fa.gz') + + plugins_in_use = "NearestExonJB, LoF" + + # List all VEP flags used when calling VEP + vep_flags = ( + f'--hgvs --af_gnomad --variant_class --domains --symbol --protein --ccds --mane ' + f'--uniprot --appris --biotype --tsl --canonical --format vcf --cache --numbers ' + f'--total_length --allele_number --failed 1 --no_stats --no_escape --xref_refseq --vcf ' + f'--check_ref --dont_skip --flag_pick_allele_gene --plugin NearestExonJB,max_range=50000 ' + f'--force_overwrite --species homo_sapiens --offline') + vep_options = ( + f'--dir {vep_cache_dir} --assembly {gvanno_vars.VEP_ASSEMBLY[genome_assembly]} --cache_version {gvanno_vars.VEP_VERSION} ' + f'--fasta {fasta_assembly} ' + f'--pick_order {conf_options["conf"]["vep"]["vep_pick_order"]} ' + f'--buffer_size {conf_options["conf"]["vep"]["vep_buffer_size"]} ' + f'--fork {conf_options["conf"]["vep"]["vep_n_forks"]} ' + f'{vep_flags} ' + f'{"--verbose" if debug else "--quiet"}') + + gencode_set_in_use = "GENCODE - all transcripts" + if conf_options['conf']['vep']['vep_no_intergenic'] is True: + vep_options += ' --no_intergenic' + if conf_options['conf']['vep']['vep_regulatory'] is True: + vep_options += ' --regulatory' + if conf_options['conf']['vep']['vep_coding_only'] is True: + vep_options += ' --coding_only' + if conf_options['conf']['vep']['vep_gencode_basic'] is True: + vep_options += ' --gencode_basic' + gencode_set_in_use = "GENCODE - basic transcript set (--gencode_basic)" + + ## LOFTEE plugin - variant loss-of-function annotation + loftee_dir = get_loftee_dir() + assert os.path.isdir(loftee_dir), f'LoF VEP plugin is not found in {loftee_dir}. Please make sure you installed pcgr conda package and have corresponding conda environment active.' + vep_options += f" --plugin LoF,loftee_path:{loftee_dir},human_ancestor_fa:{ancestor_assembly},use_gerp_end_trunc:0 --dir_plugins {loftee_dir}" + + logger.info(f'VEP configuration: Version: {gvanno_vars.VEP_VERSION}' + \ + f', GENCODE release {gvanno_vars.GENCODE_VERSION[genome_assembly]}, genome assembly {conf_options["genome_assembly"]}') + logger.info(f'VEP configuration - one primary consequence block pr. alternative allele (--flag_pick_allele)') + logger.info(f'VEP configuration - transcript pick order: {conf_options["conf"]["vep"]["vep_pick_order"]}') + logger.info(f'VEP configuration - transcript pick order: See more at https://www.ensembl.org/info/docs/tools/vep/script/vep_other.html#pick_options') + logger.info(f'VEP configuration - GENCODE set: {gencode_set_in_use}') + logger.info(f'VEP configuration - skip intergenic variants: {"ON" if conf_options["conf"]["vep"]["vep_no_intergenic"] == 1 else "OFF"}') + logger.info(f'VEP configuration - regulatory variant annotation: {"ON" if conf_options["conf"]["vep"]["vep_regulatory"] == 1 else "OFF"}') + logger.info(f'VEP configuration - loss-of-function prediction: {"ON" if conf_options["conf"]["vep"]["vep_lof_prediction"] == 1 else "OFF"}') + + logger.info(( + f'VEP configuration - buffer size/number of forks: ' + f'{conf_options["conf"]["vep"]["vep_buffer_size"]}/{conf_options["conf"]["vep"]["vep_n_forks"]}')) + logger.info(f'VEP - plugins in use: {plugins_in_use}') + + # Compose full VEP command + vep_main_command = f'vep --input_file {input_vcf} --output_file {output_vcf} {vep_options}' + vep_bgzip_command = f'bgzip -f -c {output_vcf} > {output_vcf_gz}' + vep_tabix_command = f'tabix -f -p vcf {output_vcf_gz}' + if debug: + print(vep_main_command) + + check_subprocess(logger, vep_main_command, debug) + check_subprocess(logger, vep_bgzip_command, debug) + check_subprocess(logger, vep_tabix_command, debug) + logger.info('Finished gvanno-vep') + + return 0 + +if __name__=="__main__": __main__() + diff --git a/src/gvanno/lib/__pycache__/annoutils.cpython-36.pyc b/src/gvanno/lib/__pycache__/annoutils.cpython-36.pyc deleted file mode 100644 index 0da3ccd..0000000 Binary files a/src/gvanno/lib/__pycache__/annoutils.cpython-36.pyc and /dev/null differ diff --git a/src/gvanno/lib/annoutils.py b/src/gvanno/lib/annoutils.py deleted file mode 100755 index 5678c9a..0000000 --- a/src/gvanno/lib/annoutils.py +++ /dev/null @@ -1,985 +0,0 @@ -#!/usr/bin/env python - -import os,re,sys -import csv -import logging -import gzip -from cyvcf2 import VCF, Writer -import subprocess - - -csv.field_size_limit(500 * 1024 * 1024) -threeLettertoOneLetterAA = {'Ala':'A','Arg':'R','Asn':'N','Asp':'D','Cys':'C','Glu':'E','Gln':'Q','Gly':'G','His':'H', - 'Ile':'I','Leu':'L','Lys':'K', 'Met':'M','Phe':'F','Pro':'P','Ser':'S','Thr':'T','Trp':'W', - 'Tyr':'Y','Val':'V','Ter':'X'} - - - -def read_genexref_namemap(logger, gene_xref_namemap_tsv): - namemap_xref = {} ##dictionary returned - if not os.path.exists(gene_xref_namemap_tsv): - logger.info("ERROR: File '" + str(gene_xref_namemap_tsv) + "' does not exist - exiting") - exit(1) - #return namemap_xref - tsvfile = open(gene_xref_namemap_tsv, 'r') - reader = csv.DictReader(tsvfile, delimiter='\t') - for row in reader: - namemap_xref[row['name']] = int(row['index']) - tsvfile.close() - - return namemap_xref - -def read_infotag_file(logger, vcf_info_tags_tsv): - """ - Function that reads a file that lists VCF INFO tags produced by PCGR/CPSR/gvanno. - An example line of the VCF info tag file is the following: - - tag number type description category - Consequence . String "Impact modifier for the consequence type (picked by VEP's --flag_pick_allele option)." vep - - A dictionary is returned, with the tag as the key, and the full dictionary record as the value - """ - info_tag_xref = {} ##dictionary returned - if not os.path.exists(vcf_info_tags_tsv): - logger.info("ERROR: File '" + str(vcf_info_tags_tsv) + "' does not exist - exiting") - return info_tag_xref - tsvfile = open(vcf_info_tags_tsv, 'r') - reader = csv.DictReader(tsvfile, delimiter='\t') - for row in reader: - if not row['tag'] in info_tag_xref: - info_tag_xref[row['tag']] = row - tsvfile.close() - - return info_tag_xref - -def check_subprocess(command): - # if debug: - # logger.info(command) - try: - output = subprocess.check_output(str(command), stderr=subprocess.STDOUT, shell=True) - if len(output) > 0: - print (str(output.decode()).rstrip()) - except subprocess.CalledProcessError as e: - print (e.output.decode()) - exit(0) - - -def is_integer(n): - try: - float(n) - except ValueError: - return False - else: - return float(n).is_integer() - -def error_message(message, logger): - logger.error('') - logger.error(message) - logger.error('') - exit(1) - -def write_pass_vcf(annotated_vcf, logger): - """ - Function prints all PASS variants from a VCF file. New VCF file appends '.pass.' to the filename. - - """ - out_vcf = re.sub(r'\.vcf\.gz$','.pass.vcf',annotated_vcf) - vcf = VCF(annotated_vcf) - w = Writer(out_vcf, vcf) - - num_rejected = 0 - num_pass = 0 - for rec in vcf: - if rec.FILTER is None or rec.FILTER == 'None': - w.write_record(rec) - num_pass += 1 - else: - num_rejected +=1 - - vcf.close() - w.close() - - logger.info('Number of non-PASS/REJECTED variant calls: ' + str(num_rejected)) - logger.info('Number of PASSed variant calls: ' + str(num_pass)) - if num_pass == 0: - logger.warning('There are zero variants with a \'PASS\' filter in the VCF file') - os.system('bgzip -dc ' + str(annotated_vcf) + ' egrep \'^#\' > ' + str(out_vcf)) - #else: - os.system('bgzip -f ' + str(out_vcf)) - os.system('tabix -f -p vcf ' + str(out_vcf) + '.gz') - - return - -def warn_message(message, logger): - logger.warning(message) - -def getlogger(logger_name): - logger = logging.getLogger(logger_name) - logger.setLevel(logging.DEBUG) - - # create console handler and set level to debug - ch = logging.StreamHandler(sys.stdout) - ch.setLevel(logging.DEBUG) - - # add ch to logger - logger.addHandler(ch) - - # create formatter - formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s", "20%y-%m-%d %H:%M:%S") - - #add formatter to ch - ch.setFormatter(formatter) - - return logger - - -def is_valid_vcf(input_vcf, output_dir, logger, debug): - """ - Function that reads the output file of EBIvariation/vcf-validator and reports potential errors and validation status - """ - - logger.info('Validating VCF file with EBIvariation/vcf-validator (v0.9.3)') - - ## Filename for output file from VCF validation - vcf_validation_output_file = os.path.join(output_dir, re.sub(r'(\.vcf$|\.vcf\.gz$)', '.vcf_validator_output', os.path.basename(input_vcf))) - - ## Command for running vcf-validator - command_v42 = 'vcf_validator -i ' + str(input_vcf) + ' -l warning -r text -o ' + str(output_dir) + ' > ' + str(vcf_validation_output_file) + ' 2>&1' - if debug: - logger.info(command_v42) - try: - subprocess.run(command_v42, stderr=subprocess.STDOUT, shell=True) - except subprocess.CalledProcessError as e: - print (e.output.decode()) - exit(0) - - ## read output file from vcf-validator and get name of log/error file - validation_results = {} - validation_results['validation_status'] = 0 - validation_results['error_messages'] = [] - validation_log_fname = None - if os.path.exists(vcf_validation_output_file): - f = open(vcf_validation_output_file, 'r') - for line in f: - if 'Text report written to' in line.rstrip(): - validation_log_fname = re.split(r' : ',line.rstrip(),maxsplit=2)[1] - f.close() - if not debug: - check_subprocess('rm -f ' + str(vcf_validation_output_file)) - - if validation_log_fname is None: - err_msg = 'Cannot find file with error messages from vcf_validator' - return error_message(err_msg, logger) - - if os.path.exists(validation_log_fname): - f = open(validation_log_fname, 'r') - for line in f: - if not re.search(r' \(warning\)$|^Reading from ',line.rstrip()): ## ignore warnings - if line.startswith('Line '): - validation_results['error_messages'].append('ERROR: ' + line.rstrip()) - if 'the input file is valid' in line.rstrip(): ## valid VCF - validation_results['validation_status'] = 1 - if 'the input file is not valid' in line.rstrip(): ## non-valid VCF - validation_results['validation_status'] = 0 - f.close() - if not debug: - check_subprocess('rm -f ' + str(validation_log_fname)) - else: - err_msg = str(validation_log_fname) + ' does not exist' - return error_message(err_msg, logger) - - if validation_results['validation_status'] == 0: - error_string_42 = '\n'.join(validation_results['error_messages']) - validation_status = 'According to the VCF specification, the VCF file (' + str(input_vcf) + ') is NOT valid' - err_msg = validation_status + ':\n' + str(error_string_42) - return error_message(err_msg, logger) - else: - validation_status = 'According to the VCF specification, the VCF file ' + str(input_vcf) + ' is valid' - logger.info(validation_status) - return 0 - - -def map_regulatory_variant_annotations(vep_csq_records): - """ - Function that considers an array of VEP CSQ records and appends all regulatory variant consequent annotations (open chromatin, TF_binding_site, - CTCF_binding_site, promoter (flanks), enhancers ) into a single comma-separated string. Each individual regulatory annotation is formatted as: - |||||||| - """ - - regulatory_annotation = '.' - if len(vep_csq_records) == 1: - return regulatory_annotation - - j = 0 - regulatory_annotations = [] - while j < len(vep_csq_records): - missing_key_annotation = False - for k in ['Feature_type','Consequence','Feature']: - if not k in vep_csq_records[j].keys(): - missing_key_annotation = True - - if missing_key_annotation is False: - - ## RegulatoryFeature annotations - open chromatin, promoters (flanks), enhancers, CTCF binding sites - if vep_csq_records[j]['Feature_type'] == 'RegulatoryFeature': - biotype = "" - if re.match(r"^(enhancer|promoter|open|CTCF|TF_)", vep_csq_records[j]['BIOTYPE']): - biotype = vep_csq_records[j]['BIOTYPE'] - - annotation = str(vep_csq_records[j]['Consequence']) + '|' + \ - str(vep_csq_records[j]['Feature_type']) + '|' + \ - str(vep_csq_records[j]['Feature']) + '|' + \ - str(biotype) + '|||||' - - regulatory_annotations.append(annotation) - - ## MotifFeature annotations (TF) - if vep_csq_records[j]['Feature_type'] == 'MotifFeature': - missing_motif_annotation = False - for annotation in ['MOTIF_NAME','MOTIF_POS','HIGH_INF_POS','MOTIF_SCORE_CHANGE','TRANSCRIPTION_FACTORS']: - if not annotation in vep_csq_records[j].keys(): - missing_motif_annotation = True - - if missing_motif_annotation is False: - annotation = str(vep_csq_records[j]['Consequence']) + '|' + \ - str(vep_csq_records[j]['Feature_type']) + '|' + \ - str(vep_csq_records[j]['Feature']) + '|TF_binding_site|' + \ - str(vep_csq_records[j]['MOTIF_NAME']) + '|' + \ - str(vep_csq_records[j]['MOTIF_POS']) + '|' + \ - str(vep_csq_records[j]['HIGH_INF_POS']) + '|' + \ - str(vep_csq_records[j]['MOTIF_SCORE_CHANGE']) + '|' + \ - str(vep_csq_records[j]['TRANSCRIPTION_FACTORS']) - - regulatory_annotations.append(annotation) - - j = j + 1 - - if len(regulatory_annotations) > 0: - regulatory_annotation = ','.join(regulatory_annotations) - - return(regulatory_annotation) - -def get_correct_cpg_transcript(vep_csq_records): - """ - Function that considers an array of VEP CSQ records and picks most relevant consequence (and gene) from - neighbouring genes/transcripts of relevance for cancer predisposition (cpg = cancer predisposition gene) - """ - - - csq_idx = 0 - if len(vep_csq_records) == 1: - return csq_idx - - - ## some variants are assigned multiple transcript consequences - ## if cancer predisposition genes are in the vicinity of other genes, choose the cancer predisposition gene - ## if there are neighbouring cancer-predispositon genes, choose custom gene, preferring coding change (see below, KLLN/PTEN, XPC/TMEM43, NTHL1/TSC2) - csq_idx_dict = {} - for g in ['KLLN','PTEN','XPC','TMEM43','NTHL1','TSC2']: - csq_idx_dict[g] = {} - csq_idx_dict[g]['idx'] = -1 - csq_idx_dict[g]['coding'] = False - - j = 0 - while j < len(vep_csq_records): - if 'CANCER_PREDISPOSITION_SOURCE' in vep_csq_records[j].keys() or 'GE_PANEL_ID' in vep_csq_records[j].keys(): - csq_idx = j - if 'SYMBOL' in vep_csq_records[j].keys(): - if vep_csq_records[j]['SYMBOL'] in csq_idx_dict.keys(): - csq_idx_dict[str(vep_csq_records[j]['SYMBOL'])]['idx'] = j - if vep_csq_records[j]['CODING_STATUS'] == 'coding': - csq_idx = j # prefer coding on over anything else - csq_idx_dict[str(vep_csq_records[j]['SYMBOL'])]['coding'] = True - j = j + 1 - - if csq_idx_dict['KLLN']['idx'] != -1 and csq_idx_dict['PTEN']['idx'] != -1: - csq_idx = csq_idx_dict['PTEN']['idx'] - if csq_idx_dict['KLLN']['coding'] is True: - csq_idx = csq_idx_dict['KLLN']['idx'] - - if csq_idx_dict['XPC']['idx'] != -1 and csq_idx_dict['TMEM43']['idx'] != -1: - csq_idx = csq_idx_dict['XPC']['idx'] - if csq_idx_dict['TMEM43']['coding'] is True: - csq_idx = csq_idx_dict['TMEM43']['idx'] - - if csq_idx_dict['TSC2']['idx'] != -1 and csq_idx_dict['NTHL1']['idx'] != -1: - csq_idx = csq_idx_dict['TSC2']['idx'] - if csq_idx_dict['NTHL1']['coding'] is True: - csq_idx = csq_idx_dict['NTHL1']['idx'] - - if csq_idx is None: - csq_idx = 0 - return csq_idx - - -def threeToOneAA(aa_change): - - for three_letter_aa in threeLettertoOneLetterAA.keys(): - aa_change = aa_change.replace(three_letter_aa,threeLettertoOneLetterAA[three_letter_aa]) - - aa_change = re.sub(r'[A-Z]{1}fsX([0-9]{1,}|\?)','fs',aa_change) - return aa_change - -def map_variant_effect_predictors(rec, algorithms): - - dbnsfp_predictions = map_dbnsfp_predictions(str(rec.INFO.get('DBNSFP')), algorithms) - if rec.INFO.get('Gene') is None or rec.INFO.get('Consequence') is None: - return - gene_id = str(rec.INFO.get('Gene')) - consequence = str(rec.INFO.get('Consequence')) - - dbnsfp_key = '' - - found_key = 0 - if not rec.INFO.get('HGVSp_short') is None and not rec.INFO.get('HGVSp_short') == '.': - aa_change = str(rec.INFO.get('HGVSp_short')) - dbnsfp_key = gene_id + ':' + str(aa_change) - if dbnsfp_key in dbnsfp_predictions: - found_key = 1 - - if found_key == 0 and re.search('splice_',consequence): - dbnsfp_key = gene_id - - if dbnsfp_key != '': - if dbnsfp_key in dbnsfp_predictions: - rec.INFO['EFFECT_PREDICTIONS'] = dbnsfp_predictions[dbnsfp_key] - for algo_pred in rec.INFO['EFFECT_PREDICTIONS'].split('&'): - if algo_pred.startswith('sift:'): - rec.INFO['DBNSFP_SIFT'] = str(algo_pred.split(':')[1]) - if algo_pred.startswith('provean:'): - rec.INFO['DBNSFP_PROVEAN'] = str(algo_pred.split(':')[1]) - if algo_pred.startswith('m-cap:'): - rec.INFO['DBNSFP_M_CAP'] = str(algo_pred.split(':')[1]) - if algo_pred.startswith('mutpred:'): - rec.INFO['DBNSFP_MUTPRED'] = str(algo_pred.split(':')[1]) - if algo_pred.startswith('metarnn:'): - rec.INFO['DBNSFP_META_RNN'] = str(algo_pred.split(':')[1]) - if algo_pred.startswith('fathmm:'): - rec.INFO['DBNSFP_FATHMM'] = str(algo_pred.split(':')[1]) - if algo_pred.startswith('fathmm_mkl_coding:'): - rec.INFO['DBNSFP_FATHMM_MKL'] = str(algo_pred.split(':')[1]) - if algo_pred.startswith('mutationtaster:'): - rec.INFO['DBNSFP_MUTATIONTASTER'] = str(algo_pred.split(':')[1]) - if algo_pred.startswith('mutationassessor:'): - rec.INFO['DBNSFP_MUTATIONASSESSOR'] = str(algo_pred.split(':')[1]) - if algo_pred.startswith('deogen2:'): - rec.INFO['DBNSFP_DEOGEN2'] = str(algo_pred.split(':')[1]) - if algo_pred.startswith('primateai:'): - rec.INFO['DBNSFP_PRIMATEAI'] = str(algo_pred.split(':')[1]) - if algo_pred.startswith('list_s2:'): - rec.INFO['DBNSFP_LIST_S2'] = str(algo_pred.split(':')[1]) - if algo_pred.startswith('gerp_rs:'): - rec.INFO['DBNSFP_GERP'] = str(algo_pred.split(':')[1]) - if algo_pred.startswith('bayesdel_addaf:'): - rec.INFO['DBNSFP_BAYESDEL_ADDAF'] = str(algo_pred.split(':')[1]) - if algo_pred.startswith('aloft:'): - rec.INFO['DBNSFP_ALOFTPRED'] = str(algo_pred.split(':')[1]) - if algo_pred.startswith('splice_site_rf:'): - rec.INFO['DBNSFP_SPLICE_SITE_RF'] = str(algo_pred.split(':')[1]) - if algo_pred.startswith('splice_site_ada:'): - rec.INFO['DBNSFP_SPLICE_SITE_ADA'] = str(algo_pred.split(':')[1]) - - - -def detect_reserved_info_tag(tag, tag_name, logger): - reserved_tags = ['AA','AC','AF','AN','BQ','CIGAR','DB','DP','END','H2','H3','MQ','MQ0','NS','SB','SOMATIC','VALIDATED','1000G'] - if tag in reserved_tags: - err_msg = 'Custom INFO tag (' + str(tag_name) + ') needs another name - ' + str(tag) + ' is a reserved field in the VCF specification (INFO)' - return error_message(err_msg, logger) - - reserved_format_tags = ['GT','DP','FT','GL','GLE','GQ','PL','HQ','PS','PQ','EC','MQ'] - if tag in reserved_format_tags: - err_msg = 'Custom INFO tag (' + str(tag_name) + ') needs another name - ' + str(tag) + ' is a reserved field in the VCF specification (FORMAT)' - return error_message(err_msg, logger) - - -def assign_cds_exon_intron_annotations(csq_record): - csq_record['CODING_STATUS'] = 'noncoding' - csq_record['EXONIC_STATUS'] = 'nonexonic' - csq_record['SPLICE_DONOR_RELEVANT'] = False - csq_record['NULL_VARIANT'] = False - csq_record['INTRON_POSITION'] = 0 - csq_record['EXON_POSITION'] = 0 - - coding_csq_pattern = r"^(stop_|start_lost|frameshift_|missense_|splice_donor|splice_acceptor|protein_altering|inframe_)" - wes_csq_pattern = r"^(stop_|start_lost|frameshift_|missense_|splice_donor|splice_acceptor|inframe_|protein_altering|synonymous)" - null_pattern = r"^(stop_|frameshift_)" - if re.match(coding_csq_pattern, str(csq_record['Consequence'])): - csq_record['CODING_STATUS'] = 'coding' - - if re.match(wes_csq_pattern, str(csq_record['Consequence'])): - csq_record['EXONIC_STATUS'] = 'exonic' - - if re.match(null_pattern, str(csq_record['Consequence'])): - csq_record['NULL_VARIANT'] = True - - if re.match(r"^splice_region_variant", str(csq_record['Consequence'])) and re.search(r'(\+3(A|G)>|\+4G>|\+5G>)', str(csq_record['HGVSc'])): - csq_record['SPLICE_DONOR_RELEVANT'] = True - - if re.match(r"splice_region_variant|intron_variant", str(csq_record['Consequence'])): - match = re.search(r"((-|\+)[0-9]{1,}(dup|del|inv|((ins|del|dup|inv|delins)(A|G|C|T){1,})|(A|C|T|G){1,}>(A|G|C|T){1,}))$", str(csq_record['HGVSc'])) - if match is not None: - pos = re.sub(r"(\+|dup|del|delins|ins|inv|(A|G|C|T){1,}|>)","",match.group(0)) - if is_integer(pos): - csq_record['INTRON_POSITION'] = int(pos) - - if 'NearestExonJB' in csq_record.keys(): - if not csq_record['NearestExonJB'] is None: - if re.match(r"synonymous_|missense_|stop_|inframe_|start_", str(csq_record['Consequence'])) and str(csq_record['NearestExonJB']) != "": - exon_pos_info = csq_record['NearestExonJB'].split("+") - if len(exon_pos_info) == 4: - if is_integer(exon_pos_info[1]) and str(exon_pos_info[2]) == "end": - csq_record['EXON_POSITION'] = -int(exon_pos_info[1]) - if is_integer(exon_pos_info[1]) and str(exon_pos_info[2]) == "start": - csq_record['EXON_POSITION'] = int(exon_pos_info[1]) - - - for m in ['HGVSp_short','CDS_CHANGE']: - csq_record[m] = '.' - if not csq_record['HGVSc'] is None: - if csq_record['HGVSc'] != '.': - if 'splice_acceptor_variant' in csq_record['Consequence'] or 'splice_donor_variant' in csq_record['Consequence']: - key = str(csq_record['Consequence']) + ':' + str(csq_record['HGVSc']) - csq_record['CDS_CHANGE'] = key - if csq_record['Amino_acids'] is None or csq_record['Protein_position'] is None or csq_record['Consequence'] is None: - return - if not csq_record['Protein_position'] is None: - if csq_record['Protein_position'].startswith('-'): - return - - protein_change = '.' - if '/' in csq_record['Protein_position']: - protein_position = str(csq_record['Protein_position'].split('/')[0]) - if '-' in protein_position: - if protein_position.split('-')[0].isdigit(): - csq_record['AMINO_ACID_START'] = protein_position.split('-')[0] - if protein_position.split('-')[1].isdigit(): - csq_record['AMINO_ACID_END'] = protein_position.split('-')[1] - else: - if protein_position.isdigit(): - csq_record['AMINO_ACID_START'] = protein_position - csq_record['AMINO_ACID_END'] = protein_position - - if not csq_record['HGVSp'] is None: - if csq_record['HGVSp'] != '.': - if ':' in csq_record['HGVSp']: - protein_identifier = str(csq_record['HGVSp'].split(':')[0]) - if protein_identifier.startswith('ENSP'): - protein_change_VEP = str(csq_record['HGVSp'].split(':')[1]) - protein_change = threeToOneAA(protein_change_VEP) - - if 'synonymous_variant' in csq_record['Consequence']: - protein_change = 'p.' + str(csq_record['Amino_acids']) + str(protein_position) + str(csq_record['Amino_acids']) - if 'stop_lost' in str(csq_record['Consequence']) and '/' in str(csq_record['Amino_acids']): - protein_change = 'p.X' + str(protein_position) + str(csq_record['Amino_acids']).split('/')[1] - - csq_record['HGVSp_short'] = protein_change - exon_number = 'NA' - if not csq_record['EXON'] is None: - if csq_record['EXON'] != '.': - if '/' in csq_record['EXON']: - exon_number = str(csq_record['EXON']).split('/')[0] - num_exons = str(csq_record['EXON']).split('/')[1] - if exon_number == num_exons: - csq_record['LAST_EXON'] = True - - if not csq_record['INTRON'] is None: - if csq_record['INTRON'] != '.': - if '/' in csq_record['INTRON']: - intron_number = str(csq_record['INTRON']).split('/')[0] - num_introns = str(csq_record['INTRON']).split('/')[1] - if intron_number == num_introns: - csq_record['LAST_INTRON'] = True - - if not csq_record['HGVSc'] is None: - if csq_record['HGVSc'] != '.': - if protein_change != '.': - key = str(csq_record['Consequence']) + ':' + str(csq_record['HGVSc']) + ':exon' + str(exon_number) + ':' + str(protein_change) - csq_record['CDS_CHANGE'] = key - - return - -def map_dbnsfp_predictions(dbnsfp_tag, algorithms): - - effect_predictions = {} - - for v in dbnsfp_tag.split(','): - - dbnsfp_info = v.split('|') - if len(dbnsfp_info) == 1: - return effect_predictions - ref_aa = dbnsfp_info[0] - alt_aa = dbnsfp_info[1] - all_ids = dbnsfp_info[4].split('&') - unique_ids = {} - for s in all_ids: - unique_ids[s] = 1 - - isoform_aa_keys = [] - if ref_aa != '.' and alt_aa != '.' and ref_aa != '' and alt_aa != '': - aa_pos = dbnsfp_info[6].split('&') - for pos in aa_pos: - for gene_id in unique_ids: - k = str(gene_id) + ':p.' + str(ref_aa) + pos + str(alt_aa) - isoform_aa_keys.append(k) - else: - #continue - for gene_id in unique_ids: - isoform_aa_keys.append(gene_id) - - algorithm_raw_predictions = {} - - i = 7 - v = 0 - - if len(algorithms) != len(dbnsfp_info[7:]): - return effect_predictions - - while i < len(dbnsfp_info): - algorithm_raw_predictions[str(algorithms[v]).lower()] = dbnsfp_info[i].split('&') - i = i + 1 - v = v + 1 - dbnsfp_predictions = {} - - for k in isoform_aa_keys: - if not k in dbnsfp_predictions: - dbnsfp_predictions[k] = {} - all_preds = [] - for algo in algorithm_raw_predictions.keys(): - unique_algo_predictions = {} - for pred in algorithm_raw_predictions[algo]: - if pred != '': - if not pred in unique_algo_predictions: - unique_algo_predictions[pred] = 1 - else: - unique_algo_predictions['.'] = 1 - if len(unique_algo_predictions.keys()) > 1 and '.' in unique_algo_predictions.keys(): - del unique_algo_predictions['.'] - dbnsfp_predictions[k][algo] = str(algo) + ':' + '|'.join(unique_algo_predictions.keys()) - all_preds.append(dbnsfp_predictions[k][algo]) - effect_predictions[k] = '&'.join(all_preds) - - return effect_predictions - - -def make_transcript_xref_map(rec, fieldmap, xref_tag = 'PCGR_ONCO_XREF'): - transcript_xref_map = {} - if not rec.INFO.get(xref_tag) is None: - for tref in rec.INFO.get(xref_tag).split(','): - xrefs = tref.split('|') - ensembl_transcript_id = str(xrefs[0]) - transcript_xref_map[ensembl_transcript_id] = {} - for annotation in fieldmap.keys(): - annotation_index = fieldmap[annotation] - if annotation_index > (len(xrefs) - 1): - continue - if xrefs[annotation_index] != '': - transcript_xref_map[ensembl_transcript_id][annotation] = xrefs[annotation_index] - - return(transcript_xref_map) - -def vep_dbnsfp_meta_vcf(query_vcf, info_tags_wanted): - vep_to_pcgr_af = {'gnomADe_AMR_AF':'AMR_AF_GNOMAD', - 'gnomADe_AFR_AF':'AFR_AF_GNOMAD', - 'gnomADe_EAS_AF':'EAS_AF_GNOMAD', - 'gnomADe_NFE_AF':'NFE_AF_GNOMAD', - 'gnomADe_AF':'GLOBAL_AF_GNOMAD', - 'gnomADe_SAS_AF':'SAS_AF_GNOMAD', - 'gnomADe_OTH_AF':'OTH_AF_GNOMAD', - 'gnomADe_ASJ_AF':'ASJ_AF_GNOMAD', - 'gnomADe_FIN_AF':'FIN_AF_GNOMAD', - 'gnomADG_AMR_AF':'AMR_AF_GNOMADg', - 'gnomADg_AFR_AF':'AFR_AF_GNOMADg', - 'gnomADg_EAS_AF':'EAS_AF_GNOMADg', - 'gnomADg_NFE_AF':'NFE_AF_GNOMADg', - 'gnomADg_AF':'GLOBAL_AF_GNOMADg', - 'gnomADg_SAS_AF':'SAS_AF_GNOMADg', - 'gnomADg_OTH_AF':'OTH_AF_GNOMADg', - 'gnomADg_ASJ_AF':'ASJ_AF_GNOMADg', - 'gnomADg_FIN_AF':'FIN_AF_GNOMADg', - 'AFR_AF':'AFR_AF_1KG', - 'AMR_AF':'AMR_AF_1KG', - 'SAS_AF':'SAS_AF_1KG', - 'EUR_AF':'EUR_AF_1KG', - 'EAS_AF':'EAS_AF_1KG', - 'AF':'GLOBAL_AF_1KG'} - - vcf = VCF(query_vcf) - vep_csq_index2fields = {} - vep_csq_fields2index = {} - dbnsfp_prediction_algorithms = [] - for e in vcf.header_iter(): - header_element = e.info() - if 'ID' in header_element.keys(): - identifier = str(header_element['ID']) - if identifier == 'CSQ' or identifier == 'DBNSFP': - description = str(header_element['Description']) - if 'Format: ' in description: - subtags = description.split('Format: ')[1].split('|') - if identifier == 'CSQ': - i = 0 - for t in subtags: - v = t.replace('"','') - if t in vep_to_pcgr_af: - v = str(vep_to_pcgr_af[t]) - if v in info_tags_wanted: - vep_csq_index2fields[i] = v - vep_csq_fields2index[v] = i - i = i + 1 - if identifier == 'DBNSFP': - i = 7 - while(i < len(subtags)): - dbnsfp_prediction_algorithms.append(str(re.sub(r'((_score)|(_pred))"*$','',subtags[i]))) - i = i + 1 - - vep_dbnsfp_meta_info = {} - vep_dbnsfp_meta_info['vep_csq_fieldmap'] = {} - vep_dbnsfp_meta_info['vep_csq_fieldmap']['field2index'] = vep_csq_fields2index - vep_dbnsfp_meta_info['vep_csq_fieldmap']['index2field'] = vep_csq_index2fields - vep_dbnsfp_meta_info['dbnsfp_prediction_algorithms'] = dbnsfp_prediction_algorithms - - return vep_dbnsfp_meta_info - -def parse_vep_csq(rec, transcript_xref_map, vep_csq_fields_map, logger, pick_only = True, csq_identifier = 'CSQ', debug = 0): - - all_csq_pick = [] - all_transcript_consequences = [] - - - varkey = str(rec.CHROM) + '_' + str(rec.POS) + '_' + str(rec.REF) + '_' + str(','.join(rec.ALT)) - - for csq in rec.INFO.get(csq_identifier).split(','): - csq_fields = csq.split('|') - - entrezgene = '.' - - ## Entrez gene identifier is not provided directly by VEP, pull out this from 'transcript_xref_map' for a given transcript-specific CSQ block - ## - used for 'consequence_entry' object that are added to 'vep_all_csq' array - k = 0 - while(k < len(csq_fields)): - if k in vep_csq_fields_map['index2field']: - if vep_csq_fields_map['index2field'][k] == 'Feature': - ensembl_transcript_id = csq_fields[k] - if ensembl_transcript_id != '' and ensembl_transcript_id.startswith('ENST'): - if ensembl_transcript_id in transcript_xref_map.keys(): - if 'ENTREZGENE' in transcript_xref_map[ensembl_transcript_id].keys(): - entrezgene = transcript_xref_map[ensembl_transcript_id]['ENTREZGENE'] - k = k + 1 - - - if pick_only is False: - j = 0 - csq_record = {} - - while(j < len(csq_fields)): - if j in vep_csq_fields_map['index2field']: - if csq_fields[j] != '': - csq_record[vep_csq_fields_map['index2field'][j]] = str(csq_fields[j]) - j = j + 1 - all_csq_pick.append(csq_record) - - else: - ## loop over VEP consequence blocks PICK'ed according to VEP's ranking scheme - if csq_fields[vep_csq_fields_map['field2index']['PICK']] == "1": ## only consider the primary/picked consequence(s) when expanding with annotation tags - j = 0 - csq_record = {} - - ## loop over block annotation elements (separated with '|'), and assign their values to the csq_record dictionary object - while(j < len(csq_fields)): - if j in vep_csq_fields_map['index2field']: - if csq_fields[j] != '': - csq_record[vep_csq_fields_map['index2field'][j]] = str(csq_fields[j]) - if vep_csq_fields_map['index2field'][j] == 'Feature': - ensembl_transcript_id = str(csq_fields[j]) - if ensembl_transcript_id in transcript_xref_map: - for annotation in transcript_xref_map[ensembl_transcript_id].keys(): - if annotation != 'SYMBOL': - ## assign additional gene/transcript annotations from the custom transcript - ## xref map (PCGR/CPSR/gvanno) as key,value pairs in the csq_record object - csq_record[annotation] = transcript_xref_map[ensembl_transcript_id][annotation] - #print(str(annotation) + "\t" + str(transcript_xref_map[ensembl_transcript_id][annotation])) - - else: - #print(str(annotation) + "\t" + str(transcript_xref_map[ensembl_transcript_id][annotation])) - ## If not VEP provides SYMBOL, append SYMBOL provided by xref map - if csq_record['SYMBOL'] is None: - csq_record[annotation] = transcript_xref_map[ensembl_transcript_id][annotation] - else: - if re.match(r'ENST', ensembl_transcript_id): - logger.warning('Could not find transcript xrefs for ' + str(ensembl_transcript_id)) - - ## Specifically assign PFAM protein domain as a csq_record key - if vep_csq_fields_map['index2field'][j] == 'DOMAINS': - domain_identifiers = str(csq_fields[j]).split('&') - for v in domain_identifiers: - if v.startswith('Pfam'): - csq_record['PFAM_DOMAIN'] = str(re.sub(r'\.[0-9]{1,}$','',re.sub(r'Pfam:','',v))) - - csq_record['DOMAINS'] = None - ## Assign COSMIC/DBSNP mutation ID's as individual key,value pairs in the csq_record object - if vep_csq_fields_map['index2field'][j] == 'Existing_variation': - var_identifiers = str(csq_fields[j]).split('&') - cosmic_identifiers = [] - dbsnp_identifiers = [] - for v in var_identifiers: - if v.startswith('COSV'): - cosmic_identifiers.append(v) - if v.startswith('COSM'): - cosmic_identifiers.append(v) - if v.startswith('rs'): - dbsnp_identifiers.append(v) - if len(cosmic_identifiers) > 0: - csq_record['COSMIC_MUTATION_ID'] = '&'.join(cosmic_identifiers) - if len(dbsnp_identifiers) > 0: - csq_record['DBSNPRSID'] = '&'.join(dbsnp_identifiers) - else: - csq_record[vep_csq_fields_map['index2field'][j]] = None - j = j + 1 - - ## Assign coding status, protein change, coding sequence change, last exon/intron status etc - assign_cds_exon_intron_annotations(csq_record) - ## Append transcript consequence to all_csq_pick - all_csq_pick.append(csq_record) - symbol = "." - hgvsc = "." - hgvsp = "." - if csq_fields[vep_csq_fields_map['field2index']['SYMBOL']] != "": - symbol = str(csq_fields[vep_csq_fields_map['field2index']['SYMBOL']]) - if csq_fields[vep_csq_fields_map['field2index']['HGVSc']] != "": - hgvsc = str(csq_fields[vep_csq_fields_map['field2index']['HGVSc']].split(':')[1]) - if csq_fields[vep_csq_fields_map['field2index']['HGVSp']] != "": - hgvsp = str(csq_fields[vep_csq_fields_map['field2index']['HGVSp']].split(':')[1]) - consequence_entry = (str(csq_fields[vep_csq_fields_map['field2index']['Consequence']]) + ":" + - str(symbol) + ":" + - str(entrezgene) + ":" + - str(hgvsc) + ":" + - str(hgvsp) + ":" + - str(csq_fields[vep_csq_fields_map['field2index']['Feature_type']]) + ":" + - str(csq_fields[vep_csq_fields_map['field2index']['Feature']]) + ":" + - str(csq_fields[vep_csq_fields_map['field2index']['BIOTYPE']])) - all_transcript_consequences.append(consequence_entry) - - - vep_chosen_csq_idx = 0 - vep_csq_results = {} - vep_csq_results['picked_gene_csq'] = all_csq_pick - vep_csq_results['all_csq'] = all_transcript_consequences - vep_csq_results['picked_csq'] = None - - ## IF multiple transcript-specific variant consequences highlighted by --pick_allele_gene , - ## prioritize/choose block of consequence which has - ## - A gene with BIOTYPE equal to 'protein-coding' (the other picked transcript/gene may potentialy carry another BIOTYPE nature) - ## - A gene consequence classified as 'exonic' (the other picked transcript/gene likely carries a nonexonic consequence) - if len(vep_csq_results['picked_gene_csq']) > 0: - vep_selected_idx = {} - vep_selected_idx['exonic_status'] = {} - vep_selected_idx['consequence'] = {} - - #if debug: - #print("Picked VEP blocks: " + str(vep_csq_results['picked_gene_csq'])) - - i = 0 - picked_blocks = [] - for trans_rec in vep_csq_results['picked_gene_csq']: - - if debug: - biotype = '.' - consequence = '.' - exonic_status = '.' - genesymbol = '.' - distance = '.' - feature = '.' - if not trans_rec['Consequence'] is None: - consequence = trans_rec['Consequence'] - if not trans_rec['SYMBOL'] is None: - genesymbol = trans_rec['SYMBOL'] - if not trans_rec['BIOTYPE'] is None: - biotype = trans_rec['BIOTYPE'] - if not trans_rec['EXONIC_STATUS'] is None: - exonic_status = trans_rec['EXONIC_STATUS'] - if not trans_rec['DISTANCE'] is None: - distance = trans_rec['DISTANCE'] - if not trans_rec['Feature'] is None: - feature = trans_rec['Feature'] - block = str(genesymbol) + ':' + str(feature) + ':' + str(consequence) + ':' + str(distance) + ':' + str(biotype) + ':' + str(exonic_status) - picked_blocks.append(block) - - - if 'BIOTYPE' in trans_rec and 'Consequence' in trans_rec and 'EXONIC_STATUS' in trans_rec: - if not trans_rec['BIOTYPE'] is None and not trans_rec['Consequence'] is None: - if trans_rec['BIOTYPE'] == "protein_coding": - ## for protein-coding genes - record the exonic variant consequence status (exonic/nonexonic) - vep_selected_idx['exonic_status'][i] = trans_rec['EXONIC_STATUS'] - vep_selected_idx['consequence'][i] = trans_rec['Consequence'] - i = i + 1 - - - if debug: - print(str(varkey) + " - picked CSQ blocks: " + ' /// '.join(picked_blocks)) - - - ## when multiple transcript gene blocks are picked by VEP, prioritize the block with 'exonic' consequence - if len(vep_selected_idx['exonic_status'].keys()) > 1: - exonic_cons_found = 0 - for j in vep_selected_idx['exonic_status'].keys(): - if vep_selected_idx['exonic_status'][j] == 'exonic': - exonic_cons_found = 1 - vep_chosen_csq_idx = j - - ## if multiple non-exonic variants are found, prioritize UTR variants over other nonexonic - if exonic_cons_found == 0: - for j in vep_selected_idx['consequence'].keys(): - if vep_selected_idx['consequence'][j] == '5_prime_UTR_variant' or vep_selected_idx['consequence'][j] == '3_prime_UTR_variant': - vep_chosen_csq_idx = j - - else: - if len(vep_selected_idx['exonic_status'].keys()) == 1: - for k in vep_selected_idx['exonic_status'].keys(): - vep_chosen_csq_idx = k - - vep_csq_results['picked_csq'] = vep_csq_results['picked_gene_csq'][vep_chosen_csq_idx] - - else: - print('ERROR: No VEP block chosen by --pick_allele_gene') - - return(vep_csq_results) - - -def read_cancer_hotspots(logger, hotspots_tsv): - - ## load cancer hotspot entries from 'cancer_hotspots.tsv' (provided by github.com/sigven/cancerHotspots) - - hotspots = {} ##dictionary returned - mutation_hotspots = {} - codon_hotspots = {} - splice_hotspots = {} - if not os.path.exists(hotspots_tsv): - logger.info("ERROR: File '" + str(hotspots_tsv) + "' does not exist - exiting") - exit(1) - tsvfile = open(hotspots_tsv, 'r') - reader = csv.DictReader(tsvfile, delimiter='\t') - for row in reader: - mutation_hotspots[str(row['entrezgene']) + '-' + row['hgvsp2']] = row - codon_hotspots[str(row['entrezgene']) + '-' + row['codon']] = row - if row['hgvsc'] != '.': - splice_hotspots[str(row['entrezgene']) + '-' + str(row['hgvsc'])] = row - tsvfile.close() - - hotspots['mutation'] = mutation_hotspots - hotspots['codon'] = codon_hotspots - hotspots['splice'] = splice_hotspots - return hotspots - - -def map_cancer_hotspots(transcript_csq_elements, cancer_hotspots, rec, principal_hgvsp, principal_hgvsc): - - unique_hotspot_mutations = {} - unique_hotspot_codons = {} - - principal_codon = '.' - if re.match(r'^(p.[A-Z]{1}[0-9]{1,}[A-Za-z]{1,})', principal_hgvsp): - codon_match = re.findall(r'[A-Z][0-9]{1,}',principal_hgvsp) - if len(codon_match) == 1: - principal_codon = codon_match[0] - - ## loop through all transcript-specific consequences ('csq_elements') for a given variant, check for the presence of - ## 1. Exonic, protein-altering mutations (dictionary key = entrezgene + hgvsp) that overlap known cancer hotspots (https://github.com/sigven/cancerHotspots) - ## - assert whether a potentially hit reflects the principal hgvsp ('by_hgvsp_principal') or an alternative hgvsp ('by_hgvsp_nonprincipal') - ## 2. Splice site mutations (dictionary key = entrezgene + hgvsc) that overlap known cancer hotspots (NOTE: only a subset have been curated) - ## - assert whether a potentially hit reflects the principal hgvsc ('by_hgvsc_principal') or an alternative hgvsc ('by_hgvsc_nonprincipal') - - for csq in transcript_csq_elements: - (consequence, symbol, entrezgene, hgvsc, hgvsp, feature_type, feature, biotype) = csq.split(':') - - if not bool(re.search(r'^(missense|stop|start|inframe|splice_donor|splice_acceptor|frameshift)', consequence)) is True: - continue - - hgvsp_short = threeToOneAA(hgvsp) - hotspot_key_mutation = "." - codon_match = [] - if entrezgene != "." and hgvsp != ".": - hotspot_key_mutation = str(entrezgene) + '-' + str(hgvsp_short) - codon_match = re.findall(r'p.[A-Z][0-9]{1,}',hgvsp_short) - - if entrezgene != "." and (consequence == 'splice_donor_variant' or consequence == 'splice_acceptor_variant'): - hgvsc_key = re.sub(r'>(A|G|C|T)$','',hgvsc) - hotspot_key_mutation = str(entrezgene) + '-' + str(hgvsc_key) - - if hotspot_key_mutation == ".": - continue - - if hotspot_key_mutation in cancer_hotspots['mutation']: - hotspot_info = cancer_hotspots['mutation'][hotspot_key_mutation]['MUTATION_HOTSPOT2'] - hotspot_info_ttype = cancer_hotspots['mutation'][hotspot_key_mutation]['MUTATION_HOTSPOT_CANCERTYPE'] - unique_hotspot_mutations['exonic|' + str(hotspot_info)] = hotspot_info_ttype - - if hotspot_key_mutation in cancer_hotspots['splice']: - hotspot_info = cancer_hotspots['splice'][hotspot_key_mutation]['MUTATION_HOTSPOT2'] - hotspot_info_ttype = cancer_hotspots['splice'][hotspot_key_mutation]['MUTATION_HOTSPOT_CANCERTYPE'] - unique_hotspot_mutations['splice|' + str(hotspot_info)] = hotspot_info_ttype - - - if len(codon_match) > 0: - hotspot_key_codon = str(entrezgene) + '-' + str(codon_match[0]) - - if hotspot_key_codon in cancer_hotspots['codon']: - unique_hotspot_codons[str('exonic|') + cancer_hotspots['codon'][hotspot_key_codon]['MUTATION_HOTSPOT2']] = \ - cancer_hotspots['codon'][hotspot_key_codon]['MUTATION_HOTSPOT_CANCERTYPE'] - - if len(unique_hotspot_mutations.keys()) > 0: - if len(unique_hotspot_mutations.keys()) == 1: - for gene_mutation_key in unique_hotspot_mutations.keys(): - rec.INFO['MUTATION_HOTSPOT'] = gene_mutation_key - rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_mutations[gene_mutation_key] - - if gene_mutation_key.split('|')[0] == 'exonic': - rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsp_nonprincipal' - hgvsp_candidate = 'p.' + str(gene_mutation_key.split('|')[3]) + str(gene_mutation_key.split('|')[4]) - if hgvsp_candidate == principal_hgvsp: - rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsp_principal' - else: - rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsc_nonprincipal' - hgvsc_candidate = re.sub(r'>(A|G|C|T){1,}$', '' , str(gene_mutation_key.split('|')[4])) - if hgvsc_candidate == principal_hgvsc: - rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsc_principal' - else: - ## multiple hotspot matches for alternative hgvsp keys - ## - keep only match against principal HGVSp - for hotspot_info in unique_hotspot_mutations.keys(): - if hotspot_info.split('|')[0] == 'exonic': - hgvsp_candidate = "p." + str(hotspot_info.split('|')[3]) + str(hotspot_info.split('|')[4]) - - if hgvsp_candidate == principal_hgvsp: - rec.INFO['MUTATION_HOTSPOT'] = hotspot_info - rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_mutations[hotspot_info] - rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsp_principal' - else: - hgvsc_candidate = re.sub(r'>(A|G|C|T){1,}$', '' , str(hotspot_info.split('|')[4])) - - if hgvsc_candidate == principal_hgvsc: - rec.INFO['MUTATION_HOTSPOT'] = hotspot_info - rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_mutations[hotspot_info] - rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsc_principal' - - else: - if len(unique_hotspot_codons.keys()) > 0: - if len(unique_hotspot_codons.keys()) == 1: - for gene_codon_key in unique_hotspot_codons.keys(): - - if '|' in gene_codon_key: - - codon = str(gene_codon_key.split('|')[3]) - - if codon == principal_codon: - rec.INFO['MUTATION_HOTSPOT'] = gene_codon_key - rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_codons[gene_codon_key] - rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_codon_principal' - else: - rec.INFO['MUTATION_HOTSPOT'] = gene_codon_key - rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_codons[gene_codon_key] - rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_codon_nonprincipal' - - - return - - diff --git a/src/gvanno/lib/gvanno/annoutils.py b/src/gvanno/lib/gvanno/annoutils.py new file mode 100755 index 0000000..77464c8 --- /dev/null +++ b/src/gvanno/lib/gvanno/annoutils.py @@ -0,0 +1,349 @@ +#!/usr/bin/env python + +import os,re,sys +import csv +import logging +import gzip +from cyvcf2 import VCF, Writer + +from lib.gvanno import gvanno_vars +from lib.gvanno.utils import error_message, is_integer + +csv.field_size_limit(500 * 1024 * 1024) +threeLettertoOneLetterAA = {'Ala':'A','Arg':'R','Asn':'N','Asp':'D','Cys':'C','Glu':'E','Gln':'Q','Gly':'G','His':'H', + 'Ile':'I','Leu':'L','Lys':'K', 'Met':'M','Phe':'F','Pro':'P','Ser':'S','Thr':'T','Trp':'W', + 'Tyr':'Y','Val':'V','Ter':'X'} + + + +def read_infotag_file(vcf_info_tags_tsv, scope = "vep"): + """ + Function that reads a file that lists VCF INFO tags produced by PCGR/CPSR/gvanno. + An example line of the VCF info tag file is the following: + + tag number type description category + Consequence . String "Impact modifier for the consequence type (picked by VEP's --flag_pick_allele option)." vep + + A dictionary is returned, with the tag as the key, and the full dictionary record as the value + """ + info_tag_xref = {} # dictionary returned + if not os.path.exists(vcf_info_tags_tsv): + ## issue warning + return info_tag_xref + tsvfile = open(vcf_info_tags_tsv, 'r') + reader = csv.DictReader(tsvfile, delimiter='\t') + for row in reader: + if not row['tag'] in info_tag_xref: + if scope in row['category']: + info_tag_xref[row['tag']] = row + + return info_tag_xref + +def read_vcfanno_tag_file(vcfanno_tag_file, logger): + + infotag_results = {} + + if not os.path.exists(vcfanno_tag_file): + err_msg = f"vcfanno configuration file ({vcfanno_tag_file}) does not exist" + error_message(err_msg, logger) + + f = open(vcfanno_tag_file, "r") + info_elements = f.readlines() + f.close() + + + for i in info_elements: + elem_info_content = i.rstrip().split(',') + row = {} + for elem in elem_info_content: + if elem.startswith('Type'): + row['type'] = elem.replace('Type=', '') + if elem.startswith('##INFO'): + row['tag'] = elem.replace('##INFO=|Description="', '', elem) + + if 'tag' in row and not row['tag'] in infotag_results: + infotag_results[row['tag']] = row + + return(infotag_results) + + +def read_genexref_namemap(gene_xref_namemap_tsv, logger): + """ + Function that reads a file that lists names of tags in GENE_TRANSCRIPT_XREF annotation. + """ + + namemap_xref = {} # dictionary returned + if not os.path.exists(gene_xref_namemap_tsv): + err_msg = f"gene_transcript_xref BED mapping file ({gene_xref_namemap_tsv}) does not exist" + error_message(err_msg, logger) + + with gzip.open(gene_xref_namemap_tsv, mode='rt') as f: + reader = csv.DictReader(f, delimiter='\t') + for row in reader: + namemap_xref[row['name']] = int(row['index']) + f.close() + + return namemap_xref + + +def write_pass_vcf(annotated_vcf, logger): + """ + Function prints all PASS variants from a VCF file. New VCF file appends '.pass.' to the filename. + """ + #out_vcf = re.sub(r'\.annotated\.vcf\.gz$','.annotated.pass.vcf',annotated_vcf) + out_vcf = re.sub(r'\.vcf\.gz$', '.pass.vcf', annotated_vcf) + vcf = VCF(annotated_vcf) + w = Writer(out_vcf, vcf) + + num_rejected = 0 + num_pass = 0 + for rec in vcf: + if rec.FILTER is None or rec.FILTER == 'None': + w.write_record(rec) + num_pass += 1 + else: + num_rejected += 1 + + vcf.close() + w.close() + + logger.info('Number of non-PASS/REJECTED variant calls: ' + + str(num_rejected)) + logger.info('Number of PASSed variant calls: ' + str(num_pass)) + if num_pass == 0: + logger.warning( + 'There are zero variants with a \'PASS\' filter in the VCF file') + os.system('bgzip -dc ' + str(annotated_vcf) + + ' egrep \'^#\' > ' + str(out_vcf)) + # else: + os.system('bgzip -f ' + str(out_vcf)) + os.system('tabix -f -p vcf ' + str(out_vcf) + '.gz') + + return + + +def map_regulatory_variant_annotations(vep_csq_records): + """ + Function that considers an array of VEP CSQ records and appends all regulatory variant consequent annotations (open chromatin, TF_binding_site, + CTCF_binding_site, promoter (flanks), enhancers ) into a single comma-separated string. Each individual regulatory annotation is formatted as: + |||||||| + """ + + regulatory_annotation = '.' + if len(vep_csq_records) == 1: + return regulatory_annotation + + j = 0 + regulatory_annotations = [] + while j < len(vep_csq_records): + missing_key_annotation = False + for k in ['Feature_type', 'Consequence', 'Feature']: + if not k in vep_csq_records[j].keys(): + missing_key_annotation = True + + if missing_key_annotation is False: + + # RegulatoryFeature annotations - open chromatin, promoters (flanks), enhancers, CTCF binding sites + if vep_csq_records[j]['Feature_type'] == 'RegulatoryFeature': + biotype = "" + if re.match(r"^(enhancer|promoter|open|CTCF|TF_)", vep_csq_records[j]['BIOTYPE']): + biotype = vep_csq_records[j]['BIOTYPE'] + + annotation = str(vep_csq_records[j]['Consequence']) + '|' + \ + str(vep_csq_records[j]['Feature_type']) + '|' + \ + str(vep_csq_records[j]['Feature']) + '|' + \ + str(biotype) + '|||||' + + regulatory_annotations.append(annotation) + + # MotifFeature annotations (TF) + if vep_csq_records[j]['Feature_type'] == 'MotifFeature': + missing_motif_annotation = False + for annotation in ['MOTIF_NAME', 'MOTIF_POS', 'HIGH_INF_POS', + 'MOTIF_SCORE_CHANGE', 'TRANSCRIPTION_FACTORS']: + if not annotation in vep_csq_records[j].keys(): + missing_motif_annotation = True + + if missing_motif_annotation is False: + annotation = str(vep_csq_records[j]['Consequence']) + '|' + \ + str(vep_csq_records[j]['Feature_type']) + '|' + \ + str(vep_csq_records[j]['Feature']) + '|TF_binding_site|' + \ + str(vep_csq_records[j]['MOTIF_NAME']) + '|' + \ + str(vep_csq_records[j]['MOTIF_POS']) + '|' + \ + str(vep_csq_records[j]['HIGH_INF_POS']) + '|' + \ + str(vep_csq_records[j]['MOTIF_SCORE_CHANGE']) + '|' + \ + str(vep_csq_records[j]['TRANSCRIPTION_FACTORS']) + + regulatory_annotations.append(annotation) + + j = j + 1 + + if len(regulatory_annotations) > 0: + regulatory_annotation = ','.join(regulatory_annotations) + + return (regulatory_annotation) + +def threeToOneAA(aa_change): + + for three_letter_aa in threeLettertoOneLetterAA.keys(): + aa_change = aa_change.replace( + three_letter_aa, threeLettertoOneLetterAA[three_letter_aa]) + + aa_change = re.sub(r'[A-Z]{1}fsX([0-9]{1,}|\?)', 'fs', aa_change) + return aa_change + + +def assign_cds_exon_intron_annotations(csq_record): + + csq_record['CODING_STATUS'] = 'noncoding' + csq_record['EXONIC_STATUS'] = 'nonexonic' + csq_record['SPLICE_DONOR_RELEVANT'] = False + csq_record['NULL_VARIANT'] = False + csq_record['INTRON_POSITION'] = 0 + csq_record['EXON_POSITION'] = 0 + csq_record['CDS_CHANGE'] = '.' + csq_record['HGVSp_short'] = '.' + csq_record['PROTEIN_CHANGE'] = '.' + csq_record['EXON_AFFECTED'] = '.' + csq_record['LOSS_OF_FUNCTION'] = False + + if re.search(gvanno_vars.CSQ_CODING_PATTERN, str(csq_record['Consequence'])) is not None: + csq_record['CODING_STATUS'] = 'coding' + + if re.search(gvanno_vars.CSQ_CODING_SILENT_PATTERN, str(csq_record['Consequence'])) is not None: + csq_record['EXONIC_STATUS'] = 'exonic' + + if 'LoF' in csq_record: + csq_record['LOSS_OF_FUNCTION'] = False + if not csq_record['LoF'] is None: + if csq_record['LoF'] == 'HC': + csq_record['LOSS_OF_FUNCTION'] = True + + ## Don't list LoF as True if consequence is assigned as missense + if re.search(r'^missense_variant$', csq_record['Consequence']) is not None: + if csq_record['LOSS_OF_FUNCTION'] is True: + csq_record['LOSS_OF_FUNCTION'] = False + + + if re.search(gvanno_vars.CSQ_NULL_PATTERN, str(csq_record['Consequence'])) is not None: + csq_record['NULL_VARIANT'] = True + + if re.search(gvanno_vars.CSQ_SPLICE_DONOR_PATTERN, str(csq_record['Consequence'])) is not None \ + and re.search(r'(\+3(A|G)>|\+4A>|\+5G>)', str(csq_record['HGVSc'])) is not None: + csq_record['SPLICE_DONOR_RELEVANT'] = True + + if re.search(gvanno_vars.CSQ_SPLICE_REGION_PATTERN, str(csq_record['Consequence'])) is not None: + match = re.search( + r"((-|\+)[0-9]{1,}(dup|del|inv|((ins|del|dup|inv|delins)(A|G|C|T){1,})|(A|C|T|G){1,}>(A|G|C|T){1,}))$", str(csq_record['HGVSc'])) + if match is not None: + pos = re.sub( + r"(\+|dup|del|delins|ins|inv|(A|G|C|T){1,}|>)", "", match.group(0)) + if is_integer(pos): + csq_record['INTRON_POSITION'] = int(pos) + + if 'NearestExonJB' in csq_record.keys(): + if not csq_record['NearestExonJB'] is None: + if re.search(r"synonymous_|missense_|stop_|inframe_|start_", str(csq_record['Consequence'])) is not None and str(csq_record['NearestExonJB']) != "": + exon_pos_info = csq_record['NearestExonJB'].split("+") + if len(exon_pos_info) == 4: + if is_integer(exon_pos_info[1]) and str(exon_pos_info[2]) == "end": + csq_record['EXON_POSITION'] = -int(exon_pos_info[1]) + if is_integer(exon_pos_info[1]) and str(exon_pos_info[2]) == "start": + csq_record['EXON_POSITION'] = int(exon_pos_info[1]) + + if not csq_record['HGVSc'] is None: + if csq_record['HGVSc'] != '.': + if 'splice_acceptor_variant' in csq_record['Consequence'] or 'splice_donor_variant' in csq_record['Consequence'] \ + or 'splice_donor_5th_base_variant' in csq_record['Consequence'] or 'splice_region_variant' in csq_record['Consequence'] \ + or 'splice_polypyrimidine_tract_variant' in csq_record['Consequence']: + key = str(csq_record['Consequence']) + \ + ':' + str(csq_record['HGVSc']) + csq_record['CDS_CHANGE'] = key + + if csq_record['Amino_acids'] is None or csq_record['Protein_position'] is None or csq_record['Consequence'] is None: + return(csq_record) + + if not csq_record['Protein_position'] is None: + if csq_record['Protein_position'].startswith('-'): + return(csq_record) + + protein_change = '.' + if '/' in csq_record['Protein_position']: + protein_position = str(csq_record['Protein_position'].split('/')[0]) + if '-' in protein_position: + if protein_position.split('-')[0].isdigit(): + csq_record['AMINO_ACID_START'] = protein_position.split('-')[0] + if protein_position.split('-')[1].isdigit(): + csq_record['AMINO_ACID_END'] = protein_position.split('-')[1] + else: + if protein_position.isdigit(): + csq_record['AMINO_ACID_START'] = protein_position + csq_record['AMINO_ACID_END'] = protein_position + + if not csq_record['HGVSp'] is None: + if csq_record['HGVSp'] != '.': + if ':' in csq_record['HGVSp']: + protein_identifier = str(csq_record['HGVSp'].split(':')[0]) + if protein_identifier.startswith('ENSP'): + protein_change_VEP = str(csq_record['HGVSp'].split(':')[1]) + protein_change = threeToOneAA(protein_change_VEP) + csq_record['PROTEIN_CHANGE'] = protein_change_VEP + + if 'synonymous_variant' in csq_record['Consequence']: + protein_change = 'p.' + \ + str(csq_record['Amino_acids']) + \ + str(protein_position) + str(csq_record['Amino_acids']) + if 'stop_lost' in str(csq_record['Consequence']) and '/' in str(csq_record['Amino_acids']): + protein_change = 'p.X' + \ + str(protein_position) + \ + str(csq_record['Amino_acids']).split('/')[1] + + csq_record['HGVSp_short'] = protein_change + exon_number = 'NA' + if not csq_record['EXON'] is None: + if csq_record['EXON'] != '.': + if '/' in csq_record['EXON']: + exon_number = str(csq_record['EXON']).split('/')[0] + csq_record['EXON_AFFECTED'] = exon_number + if '-' in exon_number: + csq_record['EXON_AFFECTED'] = exon_number.split('-')[0] + num_exons = str(csq_record['EXON']).split('/')[1] + if exon_number == num_exons: + csq_record['LAST_EXON'] = True + + if not csq_record['INTRON'] is None: + if csq_record['INTRON'] != '.': + if '/' in csq_record['INTRON']: + intron_number = str(csq_record['INTRON']).split('/')[0] + num_introns = str(csq_record['INTRON']).split('/')[1] + if intron_number == num_introns: + csq_record['LAST_INTRON'] = True + + if not csq_record['HGVSc'] is None: + if csq_record['HGVSc'] != '.': + if protein_change != '.': + key = str(csq_record['Consequence']) + ':' + str(csq_record['HGVSc']) + ':exon' + str(exon_number) + ':' + str(protein_change) + csq_record['CDS_CHANGE'] = key + + return(csq_record) + + +def make_transcript_xref_map(rec, fieldmap, xref_tag='GENE_TRANSCRIPT_XREF'): + transcript_xref_map = {} + if not rec.INFO.get(xref_tag) is None: + for tref in rec.INFO.get(xref_tag).split(','): + xrefs = tref.split('|') + ensembl_transcript_id = str(xrefs[0]) + transcript_xref_map[ensembl_transcript_id] = {} + for annotation in fieldmap.keys(): + annotation_index = fieldmap[annotation] + if annotation_index > (len(xrefs) - 1): + continue + if xrefs[annotation_index] != '': + transcript_xref_map[ensembl_transcript_id][annotation] = xrefs[annotation_index] + + return (transcript_xref_map) diff --git a/src/gvanno/lib/gvanno/dbnsfp.py b/src/gvanno/lib/gvanno/dbnsfp.py new file mode 100755 index 0000000..64375ba --- /dev/null +++ b/src/gvanno/lib/gvanno/dbnsfp.py @@ -0,0 +1,156 @@ +#!/usr/bin/env python + +import os +import re + +from cyvcf2 import VCF + +def map_variant_effect_predictors(rec, algorithms): + + dbnsfp_predictions = map_dbnsfp_predictions( + str(rec.INFO.get('DBNSFP')), algorithms) + if rec.INFO.get('Gene') is None or rec.INFO.get('Consequence') is None: + return + gene_id = str(rec.INFO.get('Gene')) + consequence = str(rec.INFO.get('Consequence')) + + dbnsfp_key = '' + + found_key = 0 + if not rec.INFO.get('HGVSp_short') is None and not rec.INFO.get('HGVSp_short') == '.': + aa_change = str(rec.INFO.get('HGVSp_short')) + dbnsfp_key = gene_id + ':' + str(aa_change) + if dbnsfp_key in dbnsfp_predictions: + found_key = 1 + + if found_key == 0 and re.search('splice_', consequence): + dbnsfp_key = gene_id + + algo_mapping = { + 'sift': 'DBNSFP_SIFT', + 'provean': 'DBNSFP_PROVEAN', + 'm-cap': 'DBNSFP_M_CAP', + 'mutpred': 'DBNSFP_MUTPRED', + 'metarnn': 'DBNSFP_META_RNN', + 'fathmm': 'DBNSFP_FATHMM', + 'fathmm_mkl_coding': 'DBNSFP_FATHMM_MKL', + 'mutationassessor': 'DBNSFP_MUTATIONASSESSOR', + 'mutationtaster': 'DBNSFP_MUTATIONTASTER', + 'deogen2': 'DBNSFP_DEOGEN2', + 'primateai': 'DBNSFP_PRIMATEAI', + 'list_s2': 'DBNSFP_LIST_S2', + 'gerp_rs': 'DBNSFP_GERP', + 'aloft': 'DBNSFP_ALOFTPRED', + 'bayesdel_addaf': 'DBNSFP_BAYESDEL_ADDAF', + 'splice_site_ada': 'DBNSFP_SPLICE_SITE_ADA', + 'splice_site_rf': 'DBNSFP_SPLICE_SITE_RF' + } + + if dbnsfp_key != '': + if dbnsfp_key in dbnsfp_predictions: + rec.INFO['EFFECT_PREDICTIONS'] = dbnsfp_predictions[dbnsfp_key] + for algo_pred in rec.INFO['EFFECT_PREDICTIONS'].split('&'): + if algo_pred.split(':')[0] in algo_mapping: + rec.INFO[algo_mapping[algo_pred.split(':')[0]]] = str(algo_pred.split(':')[1]) + + +def map_dbnsfp_predictions(dbnsfp_tag, algorithms): + + effect_predictions = {} + for v in dbnsfp_tag.split(','): + dbnsfp_info = v.split('|') + if len(dbnsfp_info) == 1: + return effect_predictions + ref_aa = dbnsfp_info[0] + alt_aa = dbnsfp_info[1] + all_ids = dbnsfp_info[3].split('&') + unique_ids = {} + for s in all_ids: + unique_ids[s] = 1 + + isoform_aa_keys = [] + if ref_aa != '.' and alt_aa != '.' and ref_aa != '' and alt_aa != '': + aa_pos = dbnsfp_info[5].split('&') + for pos in aa_pos: + for gene_id in unique_ids: + k = str(gene_id) + ':p.' + str(ref_aa) + pos + str(alt_aa) + isoform_aa_keys.append(k) + else: + # continue + for gene_id in unique_ids: + isoform_aa_keys.append(gene_id) + + algorithm_raw_predictions = {} + + i = 6 + v = 0 + + if len(algorithms) != len(dbnsfp_info[6:]): + return effect_predictions + + while i < len(dbnsfp_info): + algorithm_raw_predictions[str( + algorithms[v]).lower()] = dbnsfp_info[i].split('&') + i = i + 1 + v = v + 1 + dbnsfp_predictions = {} + + for k in isoform_aa_keys: + if not k in dbnsfp_predictions: + dbnsfp_predictions[k] = {} + all_preds = [] + for algo in algorithm_raw_predictions.keys(): + unique_algo_predictions = {} + for pred in algorithm_raw_predictions[algo]: + if pred != '': + if not pred in unique_algo_predictions: + unique_algo_predictions[pred] = 1 + else: + unique_algo_predictions['.'] = 1 + if len(unique_algo_predictions.keys()) > 1 and '.' in unique_algo_predictions.keys(): + del unique_algo_predictions['.'] + dbnsfp_predictions[k][algo] = str( + algo) + ':' + '|'.join(unique_algo_predictions.keys()) + all_preds.append(dbnsfp_predictions[k][algo]) + effect_predictions[k] = '&'.join(all_preds) + + return effect_predictions + + +def vep_dbnsfp_meta_vcf(query_vcf, info_tags_wanted): + + vcf = VCF(query_vcf) + vep_csq_index2fields = {} + vep_csq_fields2index = {} + dbnsfp_prediction_algorithms = [] + for e in vcf.header_iter(): + header_element = e.info() + if 'ID' in header_element.keys(): + identifier = str(header_element['ID']) + if identifier == 'CSQ' or identifier == 'DBNSFP': + description = str(header_element['Description']) + if 'Format: ' in description: + subtags = description.split('Format: ')[1].split('|') + if identifier == 'CSQ': + i = 0 + for t in subtags: + v = t.replace('"', '') + if v in info_tags_wanted: + vep_csq_index2fields[i] = v + vep_csq_fields2index[v] = i + i = i + 1 + if identifier == 'DBNSFP': + i = 6 + while (i < len(subtags)): + dbnsfp_prediction_algorithms.append( + str(re.sub(r'((_score)|(_pred))"*$', '', subtags[i]))) + i = i + 1 + + vep_dbnsfp_meta_info = {} + vep_dbnsfp_meta_info['vep_csq_fieldmap'] = {} + vep_dbnsfp_meta_info['vep_csq_fieldmap']['field2index'] = vep_csq_fields2index + vep_dbnsfp_meta_info['vep_csq_fieldmap']['index2field'] = vep_csq_index2fields + vep_dbnsfp_meta_info['dbnsfp_prediction_algorithms'] = dbnsfp_prediction_algorithms + + return vep_dbnsfp_meta_info + diff --git a/src/gvanno/lib/gvanno/gvanno_vars.py b/src/gvanno/lib/gvanno/gvanno_vars.py new file mode 100644 index 0000000..81b0c8c --- /dev/null +++ b/src/gvanno/lib/gvanno/gvanno_vars.py @@ -0,0 +1,78 @@ +#!/usr/bin/env python + +GVANNO_VERSION = '1.7.0' +DB_VERSION = '20231222' + +## MISCELLANEOUS +NCBI_BUILD_MAF = 'GRCh38' +MAX_VARIANTS_FOR_REPORT = 500_000 +CODING_EXOME_SIZE_MB = 34.0 +RECOMMENDED_N_MUT_SIGNATURE = 200 + +## GENCODE +GENCODE_VERSION = {'grch38': 44,'grch37': 19} + +## vcfanno +VCFANNO_MAX_PROC = 15 + +## VEP settings/versions +VEP_VERSION = '110' +VEP_ASSEMBLY = {'grch38': 'GRCh38','grch37': 'GRCh37'} +VEP_MIN_FORKS = 1 +VEP_MAX_FORKS = 8 +VEP_MIN_BUFFER_SIZE = 50 +VEP_MAX_BUFFER_SIZE = 30000 +VEP_PICK_CRITERIA = ['mane_select','mane_plus_clinical','canonical','appris','tsl','biotype','ccds','rank','length'] + +## https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html#consequences +VEP_consequence_rank = { + 'transcript_ablation': 1, + 'splice_acceptor_variant': 2, + 'splice_donor_variant': 3, + 'stop_gained': 4, + 'frameshift_variant': 5, + 'stop_lost': 6, + 'start_lost' : 7, + 'transcript_amplification': 8, + 'feature_elongation': 9, + 'feature_truncation': 10, + 'inframe_insertion': 11, + 'inframe_deletion': 12, + 'missense_variant': 13, + 'protein_altering_variant': 14, + 'splice_donor_5th_base_variant': 15, + 'splice_region_variant': 16, + 'splice_donor_region_variant': 17, + 'splice_polypyrimidine_tract_variant': 18, + 'incomplete_terminal_codon_variant': 19, + 'start_retained_variant': 20, + 'stop_retained_variant': 21, + 'synonymous_variant': 22, + 'coding_sequence_variant': 23, + 'mature_miRNA_variant': 24, + '5_prime_UTR_variant': 25, + '3_prime_UTR_variant': 26, + 'non_coding_transcript_exon_variant': 27, + 'intron_variant': 28, + 'NMD_transcript_variant': 29, + 'non_coding_transcript_variant': 30, + 'coding_transcript_variant': 31, + 'upstream_gene_variant': 32, + 'downstream_gene_variant': 33, + 'TFBS_ablation': 34, + 'TFBS_amplification': 35, + 'TF_binding_site_variant': 36, + 'regulatory_region_ablation': 37, + 'regulatory_region_amplification': 38, + 'regulatory_region_variant': 39, + 'intergenic_variant': 40, + 'sequence_variant': 41 +} + +CSQ_MISSENSE_PATTERN = r"^missense_variant" +CSQ_CODING_PATTERN = r"(stop_(lost|gained)|start_lost|frameshift_|missense_|splice_(donor|acceptor)|protein_altering|inframe_)" +CSQ_CODING_SILENT_PATTERN = r"(stop_(lost|gained)|start_lost|frameshift_|missense_|splice_(donor|acceptor)|protein_altering|inframe_|synonymous|(start|stop)_retained)" +CSQ_NULL_PATTERN = r"(stop_gained|frameshift_)" +CSQ_SPLICE_REGION_PATTERN = r"(splice_|intron_variant)" +CSQ_SPLICE_DONOR_PATTERN = r"(splice_region_variant|splice_donor_variant|splice_donor_region_variant|splice_donor_5th_base_variant)" + diff --git a/src/gvanno/lib/gvanno/mutation_hotspot.py b/src/gvanno/lib/gvanno/mutation_hotspot.py new file mode 100644 index 0000000..d3061f2 --- /dev/null +++ b/src/gvanno/lib/gvanno/mutation_hotspot.py @@ -0,0 +1,167 @@ +#!/usr/bin/env python + +import os,re +import csv +import gzip +from lib.gvanno.annoutils import threeToOneAA + +from typing import Dict +from logging import Logger + +def load_mutation_hotspots(hotspots_fname: str, logger: Logger) -> Dict[str, Dict[str, Dict[str, str]]]: + """ + Load mutation hotspots from a file and create a dictionary of hotspots. + + Parameters: + hotspots_fname (str): The file path of the mutation hotspots file. + logger (Logger): The logger object to log messages. + + Returns: + Dict[str, Dict[str, Dict[str, str]]]: A dictionary containing mutation hotspots categorized by mutation type. + The dictionary has three keys: 'mutation', 'codon', and 'splice'. + Each key maps to a sub-dictionary that contains the hotspots for that mutation type. + The sub-dictionaries are indexed by a combination of gene and mutation identifier. + + Raises: + SystemExit: If the mutation hotspots file does not exist. + + """ + hotspots: Dict[str, Dict[str, Dict[str, str]]] = { + 'mutation': {}, + 'codon': {}, + 'splice': {} + } + + if not os.path.exists(hotspots_fname): + logger.info(f"ERROR: File '{hotspots_fname}' does not exist - exiting") + exit(1) + + with gzip.open(hotspots_fname, mode='rt') as f: + reader = csv.DictReader(f, delimiter='\t') + for row in reader: + gene = str(row['entrezgene']) + hgvsp2 = row['hgvsp2'] + codon = row['codon'] + hgvsc = row['hgvsc'] + + hotspots['mutation'][gene + '-' + hgvsp2] = row + hotspots['codon'][gene + '-' + codon] = row + if hgvsc != '.': + hotspots['splice'][gene + '-' + hgvsc] = row + + return hotspots + + +def match_csq_mutation_hotspot(transcript_csq_elements, cancer_hotspots, rec, principal_csq_properties): + + """ + Function that matches consequence entries from VEP (transcript_csq_elements) with entries in mutation hotspots + """ + + unique_hotspot_mutations = {} + unique_hotspot_codons = {} + + principal_hgvsp = principal_csq_properties['hgvsp'] + principal_hgvsc = principal_csq_properties['hgvsc'] + principal_entrezgene = principal_csq_properties['entrezgene'] + principal_codon = principal_csq_properties['codon'] + + ## loop through all transcript-specific consequences ('csq_elements') for a given variant, check for the presence of + ## 1. Exonic, protein-altering mutations (dictionary key = entrezgene + hgvsp) that overlap known cancer hotspots (https://github.com/sigven/cancerHotspots) + ## - assert whether a potentially hit reflects the principal hgvsp ('by_hgvsp_principal') or an alternative hgvsp ('by_hgvsp_nonprincipal') + ## 2. Splice site mutations (dictionary key = entrezgene + hgvsc) that overlap known cancer hotspots (NOTE: only a subset have been curated) + ## - assert whether a potentially hit reflects the principal hgvsc ('by_hgvsc_principal') or an alternative hgvsc ('by_hgvsc_nonprincipal') + + for csq in transcript_csq_elements: + (consequence, symbol, entrezgene, hgvsc, hgvsp, exon, feature_type, feature, biotype) = csq.split(':') + + if not bool(re.search(r'^(missense|stop|start|inframe|splice_donor|splice_acceptor|frameshift)', consequence)) is True: + continue + + hgvsp_short = threeToOneAA(hgvsp) + hotspot_key_mutation = "." + codon_match = [] + if entrezgene != "." and hgvsp != ".": + hotspot_key_mutation = str(entrezgene) + '-' + str(hgvsp_short) + codon_match = re.findall(r'p.[A-Z][0-9]{1,}',hgvsp_short) + + if entrezgene != "." and (consequence == 'splice_donor_variant' or consequence == 'splice_acceptor_variant'): + hgvsc_key = re.sub(r'>(A|G|C|T)$','',hgvsc) + hotspot_key_mutation = str(entrezgene) + '-' + str(hgvsc_key) + + if hotspot_key_mutation == ".": + continue + + if hotspot_key_mutation in cancer_hotspots['mutation']: + hotspot_info = cancer_hotspots['mutation'][hotspot_key_mutation]['MUTATION_HOTSPOT2'] + hotspot_info_ttype = cancer_hotspots['mutation'][hotspot_key_mutation]['MUTATION_HOTSPOT_CANCERTYPE'] + unique_hotspot_mutations['exonic|' + str(hotspot_info)] = hotspot_info_ttype + + if hotspot_key_mutation in cancer_hotspots['splice']: + hotspot_info = cancer_hotspots['splice'][hotspot_key_mutation]['MUTATION_HOTSPOT2'] + hotspot_info_ttype = cancer_hotspots['splice'][hotspot_key_mutation]['MUTATION_HOTSPOT_CANCERTYPE'] + unique_hotspot_mutations['splice|' + str(hotspot_info)] = hotspot_info_ttype + + + if len(codon_match) > 0: + hotspot_key_codon = str(entrezgene) + '-' + str(codon_match[0]) + + if hotspot_key_codon in cancer_hotspots['codon']: + unique_hotspot_codons[str('exonic|') + cancer_hotspots['codon'][hotspot_key_codon]['MUTATION_HOTSPOT2']] = \ + cancer_hotspots['codon'][hotspot_key_codon]['MUTATION_HOTSPOT_CANCERTYPE'] + + if len(unique_hotspot_mutations.keys()) > 0: + if len(unique_hotspot_mutations.keys()) == 1: + for gene_mutation_key in unique_hotspot_mutations.keys(): + rec.INFO['MUTATION_HOTSPOT'] = gene_mutation_key + rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_mutations[gene_mutation_key] + + if gene_mutation_key.split('|')[0] == 'exonic': + rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsp_nonprincipal' + hgvsp_candidate = 'p.' + str(gene_mutation_key.split('|')[3]) + str(gene_mutation_key.split('|')[4]) + if hgvsp_candidate == principal_hgvsp: + rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsp_principal' + else: + rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsc_nonprincipal' + hgvsc_candidate = re.sub(r'>(A|G|C|T){1,}$', '' , str(gene_mutation_key.split('|')[4])) + if hgvsc_candidate == principal_hgvsc: + rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsc_principal' + else: + ## multiple hotspot matches for alternative hgvsp keys + ## - keep only match against principal HGVSp + for hotspot_info in unique_hotspot_mutations.keys(): + if hotspot_info.split('|')[0] == 'exonic': + hgvsp_candidate = "p." + str(hotspot_info.split('|')[3]) + str(hotspot_info.split('|')[4]) + + if hgvsp_candidate == principal_hgvsp: + rec.INFO['MUTATION_HOTSPOT'] = hotspot_info + rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_mutations[hotspot_info] + rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsp_principal' + else: + hgvsc_candidate = re.sub(r'>(A|G|C|T){1,}$', '' , str(hotspot_info.split('|')[4])) + + if hgvsc_candidate == principal_hgvsc: + rec.INFO['MUTATION_HOTSPOT'] = hotspot_info + rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_mutations[hotspot_info] + rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsc_principal' + + else: + if len(unique_hotspot_codons.keys()) > 0: + if len(unique_hotspot_codons.keys()) == 1: + for gene_codon_key in unique_hotspot_codons.keys(): + + if '|' in gene_codon_key: + + codon = str(gene_codon_key.split('|')[3]) + + if codon == principal_codon: + rec.INFO['MUTATION_HOTSPOT'] = gene_codon_key + rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_codons[gene_codon_key] + rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_codon_principal' + else: + rec.INFO['MUTATION_HOTSPOT'] = gene_codon_key + rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_codons[gene_codon_key] + rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_codon_nonprincipal' + + + return diff --git a/src/gvanno/lib/oncogenicity.py b/src/gvanno/lib/gvanno/oncogenicity.py similarity index 89% rename from src/gvanno/lib/oncogenicity.py rename to src/gvanno/lib/gvanno/oncogenicity.py index 28aa347..50544a9 100644 --- a/src/gvanno/lib/oncogenicity.py +++ b/src/gvanno/lib/gvanno/oncogenicity.py @@ -115,16 +115,16 @@ def assign_oncogenicity_evidence(rec = None, tumortype = "Any"): "SYMBOL", "ONCOGENE", "ONCOGENE_EVIDENCE", - "TUMOR_SUPPRESSOR", - "TUMOR_SUPPRESSOR_EVIDENCE", - "LoF", + "TSG", + "TSG_EVIDENCE", + "LOSS_OF_FUNCTION", "INTRON_POSITION", "EXON_POSITION", - "EAS_AF_GNOMAD", - "NFE_AF_GNOMAD", - "AFR_AF_GNOMAD", - "AMR_AF_GNOMAD", - "SAS_AF_GNOMAD", + "gnomADe_EAS_AF", + "gnomADe_NFE_AF", + "gnomADe_AFR_AF", + "gnomADe_AMR_AF", + "gnomADe_SAS_AF", "DBNSFP_SIFT", "DBNSFP_PROVEAN", "DBNSFP_META_RNN", @@ -143,26 +143,22 @@ def assign_oncogenicity_evidence(rec = None, tumortype = "Any"): variant_data = {} for col in required_oncogenicity_vars: if rec.INFO.get(col) is None: - if col == "TUMOR_SUPPRESSOR" or col == "ONCOGENE": + if col == "TSG" or col == "ONCOGENE": variant_data[col] = False - elif col == "LoF": + elif col == "LOSS_OF_FUNCTION": variant_data['LOSS_OF_FUNCTION'] = False else: variant_data[col] = None else: if rec.INFO.get(col) == '': variant_data[col] = True - else: - if col == "LoF": - if rec.INFO.get(col) == "HC": - variant_data["LOSS_OF_FUNCTION"] = True - else: - variant_data[col] = rec.INFO.get(col) + else: + variant_data[col] = rec.INFO.get(col) for code in clingen_vicc_ev_codes: variant_data[code] = False - - dbnsfp_minimum_majority = 7 + + dbnsfp_minimum_majority = 6 dbnsfp_maximum_minority = 2 dbnsfp_minimum_algos_called = dbnsfp_minimum_majority @@ -250,54 +246,55 @@ def assign_oncogenicity_evidence(rec = None, tumortype = "Any"): variant_data['CLINGEN_VICC_OP3'] = True - if "EAS_AF_GNOMAD" in variant_data.keys() and \ - "SAS_AF_GNOMAD" in variant_data.keys() and \ - "AMR_AF_GNOMAD" in variant_data.keys() and \ - "AFR_AF_GNOMAD" in variant_data.keys() and \ - "NFE_AF_GNOMAD" in variant_data.keys(): + if "gnomADe_EAS_AF" in variant_data.keys() and \ + "gnomADe_SAS_AF" in variant_data.keys() and \ + "gnomADe_AMR_AF" in variant_data.keys() and \ + "gnomADe_AFR_AF" in variant_data.keys() and \ + "gnomADe_NFE_AF" in variant_data.keys(): ## check if variant has MAF > 0.01 (SBVS1) or > 0.05 in any of five major gnomAD subpopulations (exome set) - for pop in ['EAS_AF_GNOMAD','SAS_AF_GNOMAD','AMR_AF_GNOMAD','AFR_AF_GNOMAD','NFE_AF_GNOMAD']: + for pop in ['gnomADe_SAS_AF','gnomADe_EAS_AF','gnomADe_AMR_AF','gnomADe_AFR_AF','gnomADe_NFE_AF']: if not variant_data[pop] is None: ## MAF for this population >= 0.05 if float(variant_data[pop]) >= 0.05: variant_data["CLINGEN_VICC_SBVS1"] = True - for pop in ['EAS_AF_GNOMAD','SAS_AF_GNOMAD','AMR_AF_GNOMAD','AFR_AF_GNOMAD','NFE_AF_GNOMAD']: + for pop in ['gnomADe_SAS_AF','gnomADe_EAS_AF','gnomADe_AMR_AF','gnomADe_AFR_AF','gnomADe_NFE_AF']: if not variant_data[pop] is None: ## MAF for this population >= 0.01 (< 0.05) if float(variant_data[pop]) >= 0.01 and variant_data["CLINGEN_VICC_SBVS1"] is False: variant_data["CLINGEN_VICC_SBS1"] = True - missing_pop_freq = 0 + #missing_pop_freq = 0 approx_zero_pop_freq = 0 - for pop in ['EAS_AF_GNOMAD','SAS_AF_GNOMAD','AMR_AF_GNOMAD','AFR_AF_GNOMAD','NFE_AF_GNOMAD']: + for pop in ['gnomADe_SAS_AF','gnomADe_EAS_AF','gnomADe_AMR_AF','gnomADe_AFR_AF','gnomADe_NFE_AF']: ## no MAF recorded in gnomAD for this population if variant_data[pop] is None: - missing_pop_freq = missing_pop_freq + 1 + approx_zero_pop_freq = approx_zero_pop_freq + 1 else: ## Very low MAF for this population if float(variant_data[pop]) < 0.0001: approx_zero_pop_freq = approx_zero_pop_freq + 1 ## check if variant is missing or with MAF approximately zero in all five major gnomAD subpopulations (exome set) - if missing_pop_freq == 5 or approx_zero_pop_freq == 5: + if approx_zero_pop_freq == 5: variant_data["CLINGEN_VICC_OP4"] = True + ## check if variant is a loss-of-function variant (LOFTEE) in a tumor suppressor gene (Cancer Gene Census/CancerMine) - if "TUMOR_SUPPRESSOR" in variant_data.keys() and \ + if "TSG" in variant_data.keys() and \ "ONCOGENE" in variant_data.keys() and \ "LOSS_OF_FUNCTION" in variant_data.keys() and \ "Consequence" in variant_data.keys(): - if variant_data['LOSS_OF_FUNCTION'] is True and variant_data['TUMOR_SUPPRESSOR'] is True: + if variant_data['LOSS_OF_FUNCTION'] is True and variant_data['TSG'] is True: variant_data['CLINGEN_VICC_OVS1'] = True ## check if variant is creating a stop-lost or protein-length change in oncogene/tumor suppressor genes if variant_data['CLINGEN_VICC_OVS1'] is False and \ ((re.match(r'^(inframe_deletion|inframe_insertion)', variant_data['Consequence']) and \ - (variant_data['TUMOR_SUPPRESSOR'] is True or variant_data['ONCOGENE'] is True)) or \ + (variant_data['TSG'] is True or variant_data['ONCOGENE'] is True)) or \ (re.match(r'^(stop_lost)', variant_data['Consequence']) and \ - variant_data['TUMOR_SUPPRESSOR'] is True)): + variant_data['TSG'] is True)): variant_data['CLINGEN_VICC_OM2'] = True ## check if variant is silent (synonymous|splice) and outside critical splice region @@ -393,25 +390,25 @@ def assign_oncogenicity_evidence(rec = None, tumortype = "Any"): else: variant_data[e] = f'{variant_data[e]}|{code}' - lb_upper_limit = -1 - lb_lower_limit = -6 - b_upper_limit = -7 + likely_benign_upper_limit = -1 + likely_benign_lower_limit = -6 + benign_upper_limit = -7 #vus_lower_limit = 0 #vus_upper_limit = 4 - lo_lower_limit = 6 - lo_upper_limit = 9 - o_lower_limit = 10 + likely_oncogenic_lower_limit = 5 + likely_oncogenic_upper_limit = 9 + oncogenic_lower_limit = 10 variant_data['ONCOGENICITY_SCORE'] = onc_score_benign + onc_score_pathogenic - if variant_data['ONCOGENICITY_SCORE'] <= lb_upper_limit and \ - variant_data['ONCOGENICITY_SCORE'] >= lb_lower_limit: + if variant_data['ONCOGENICITY_SCORE'] <= likely_benign_upper_limit and \ + variant_data['ONCOGENICITY_SCORE'] >= likely_benign_lower_limit: variant_data['ONCOGENICITY_CLASSIFICATION'] = "Likely_Benign" - if variant_data['ONCOGENICITY_SCORE'] >= lo_lower_limit and \ - variant_data['ONCOGENICITY_SCORE'] <= lo_upper_limit: + if variant_data['ONCOGENICITY_SCORE'] >= likely_oncogenic_lower_limit and \ + variant_data['ONCOGENICITY_SCORE'] <= likely_oncogenic_upper_limit: variant_data['ONCOGENICITY_CLASSIFICATION'] = "Likely_Oncogenic" - if variant_data['ONCOGENICITY_SCORE'] <= b_upper_limit: + if variant_data['ONCOGENICITY_SCORE'] <= benign_upper_limit: variant_data['ONCOGENICITY_CLASSIFICATION'] = "Benign" - if variant_data['ONCOGENICITY_SCORE'] >= o_lower_limit: + if variant_data['ONCOGENICITY_SCORE'] >= oncogenic_lower_limit: variant_data['ONCOGENICITY_CLASSIFICATION'] = "Oncogenic" for e in ['ONCOGENICITY_SCORE', diff --git a/src/gvanno/lib/gvanno/utils.py b/src/gvanno/lib/gvanno/utils.py new file mode 100644 index 0000000..ba3b878 --- /dev/null +++ b/src/gvanno/lib/gvanno/utils.py @@ -0,0 +1,159 @@ +#!/usr/bin/env python +import sys +import subprocess +import shutil +import logging +import os +import time +import errno +import platform +import string +import random + + +def getlogger(logger_name): + logger = logging.getLogger(logger_name) + logger.setLevel(logging.DEBUG) + # create console handler and set level to debug + ch = logging.StreamHandler(sys.stdout) + ch.setLevel(logging.DEBUG) + # add ch to logger + logger.addHandler(ch) + # create formatter + formatter = logging.Formatter( + "%(asctime)s - %(name)s - %(levelname)s - %(message)s", "20%y-%m-%d %H:%M:%S") + # add formatter to ch + ch.setFormatter(formatter) + return logger + +def error_message(message, logger): + logger.error("") + logger.error(message) + logger.error("") + sys.exit(1) + +def warn_message(message, logger): + logger.warning("") + logger.warning(message) + logger.warning("") + +def random_id_generator(size = 10, chars = string.ascii_lowercase + string.digits): + return ''.join(random.choice(chars) for _ in range(size)) + +#def random_string(length): +# letters = string.ascii_lowercase +# result_str = ''.join(random.choice(letters) for i in range(length)) +# return result_str + +def check_subprocess(logger, command, debug): + if debug: + logger.info(command) + try: + output = subprocess.check_output( + str(command), stderr=subprocess.STDOUT, shell=True) + if len(output) > 0: + print(str(output.decode()).rstrip()) + except subprocess.CalledProcessError as e: + print(e.output.decode()) + exit(0) + +def get_loftee_dir(): + #pcgr_conda_env = conda_prefix_basename() + return "/opt/vep/src/ensembl-vep/modules" + +def perl_cmd(): + """Retrieve abs path to locally installed conda Perl or first in PATH, + e.g. conda/env/pcgr/bin/perl + """ + perl = shutil.which(os.path.join(get_pcgr_bin(), "perl")) + if perl: + return perl + else: + return shutil.which("perl") + +def get_perl_exports(): + """Environmental exports to use conda installed perl. + """ + perl_path = os.path.normpath(perl_cmd()) # /conda/env/pcgr/bin/perl + perl_path_parent = os.path.dirname(perl_path) # /conda/env/pcgr/bin + out = f"unset PERL5LIB && export PATH={perl_path_parent}:\"$PATH\"" + return out + +def is_integer(n): + try: + float(n) + except ValueError: + return False + else: + return float(n).is_integer() + +# https://stackoverflow.com/a/10840586/2169986 +def remove_file(filename): + try: + os.remove(filename) + except OSError as e: + if e.errno != errno.ENOENT: # errno.ENOENT = no such file or directory + raise # re-raise exception if a different error occurred + + +# from https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/utils.py +def safe_makedir(dname): + """Make a directory if it does not exist, handling concurrent race conditions. + """ + if not dname: + return dname + num_tries = 0 + max_tries = 5 + while not os.path.exists(dname): + # we could get an error here if multiple processes are creating + # the directory at the same time. Grr, concurrency. + try: + os.makedirs(dname) + except OSError: + if num_tries > max_tries: + raise + num_tries += 1 + time.sleep(2) + return dname + +def sort_bed(unsorted_bed_fname: str, sorted_bed_fname: str, debug = False, logger = None): + + ## Sort regions in target BED + if os.path.exists(unsorted_bed_fname) and os.stat(unsorted_bed_fname).st_size != 0: + cmd_sort_custom_bed1 = 'egrep \'^[0-9]\' ' + str(unsorted_bed_fname) + \ + ' | sort -k1,1n -k2,2n -k3,3n > ' + str(sorted_bed_fname) + cmd_sort_custom_bed2 = 'egrep -v \'^[0-9]\' ' + str(unsorted_bed_fname) + \ + ' | egrep \'^[XYM]\' | sort -k1,1 -k2,2n -k3,3n >> ' + str(sorted_bed_fname) + + check_subprocess(logger, cmd_sort_custom_bed1, debug) + check_subprocess(logger, cmd_sort_custom_bed2, debug) + if not debug: + remove(str(unsorted_bed_fname)) + else: + err_msg = 'File ' + str(unsorted_bed_fname) + ' does not exist or is empty' + error_message(err_msg, logger) + + +def check_file_exists(fname: str, logger = None) -> bool: + err = 0 + if not os.path.isfile(fname): + err = 1 + else: + if os.stat(fname).st_size == 0: + err = 1 + if err == 1: + err_msg = f"File {fname} does not exist or has zero size" + error_message(err_msg, logger) + + return(True) + +def check_tabix_file(fname: str, logger = None) -> bool: + tabix_file = os.path.abspath(f'{fname}.tbi') + if not os.path.exists(tabix_file): + err_msg = f"Tabix file (i.e. '.gz.tbi') is not present for the bgzipped VCF input file ({fname}" + \ + "). Please make sure your input VCF is properly compressed and indexed (bgzip + tabix)" + error_message(err_msg, logger) + else: + ## check file size is more than zero + check_file_exists(tabix_file) + return(True) diff --git a/src/gvanno/lib/gvanno/variant.py b/src/gvanno/lib/gvanno/variant.py new file mode 100644 index 0000000..6b78d75 --- /dev/null +++ b/src/gvanno/lib/gvanno/variant.py @@ -0,0 +1,190 @@ +#!/usr/bin/env python + +import os +import re +import pandas as pd +import numpy as np +import warnings + +warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning) + + +def set_genotype(variant_set: pd.DataFrame, logger) -> pd.DataFrame: + """ + Set verbose genotype (homozygous, heterozygous) for each variant + """ + variant_set['GENOTYPE'] = '.' + if {'GT'}.issubset(variant_set.columns): + logger.info("Assignment of genotype (homozygous, heterozygous) for each variant based on 'GT' tag") + heterozygous_states = [] + ref_allele_index = 0 + while ref_allele_index < 20: + alt_allele_index = ref_allele_index + 1 + while alt_allele_index <= 20: + phased_gt_1 = str(ref_allele_index) + "|" + str(alt_allele_index) + phased_gt_2 = str(alt_allele_index) + "|" + str(ref_allele_index) + unphased_gt_1 = str(ref_allele_index) + "/" + str(alt_allele_index) + unphased_gt_2 = str(alt_allele_index) + "/" + str(ref_allele_index) + gts = [phased_gt_1, phased_gt_2, unphased_gt_1, unphased_gt_2] + heterozygous_states.extend(gts) + + alt_allele_index = alt_allele_index + 1 + ref_allele_index = ref_allele_index + 1 + + homozygous_states = [] + hom_allele_index = 1 + while hom_allele_index <= 20: + phased_gt = str(hom_allele_index) + "|" + str(hom_allele_index) + unphased_gt = str(hom_allele_index) + "/" + str(hom_allele_index) + homozygous_states.extend([phased_gt, unphased_gt]) + hom_allele_index = hom_allele_index + 1 + + variant_set.loc[variant_set['GT'].isin(homozygous_states), "GENOTYPE"] = "homozygous" + variant_set.loc[variant_set['GT'].isin(heterozygous_states),"GENOTYPE"] = "heterozygous" + else: + variant_set['GENOTYPE'] = "undefined" + + variant_set = variant_set.astype({'GENOTYPE':'string'}) + return(variant_set) + +def append_annotations(vcf2tsv_gz_fname: str, gvanno_db_dir: str, logger): + + clinvar_tsv_fname = os.path.join(gvanno_db_dir, 'variant','tsv','clinvar', 'clinvar.tsv.gz') + protein_domain_tsv_fname = os.path.join(gvanno_db_dir, 'misc','tsv','protein_domain', 'protein_domain.tsv.gz') + gene_xref_tsv_fname = os.path.join(gvanno_db_dir, 'gene','tsv','gene_transcript_xref', 'gene_transcript_xref.tsv.gz') + vcf2tsv_df = None + clinvar_data_df = None + + num_recs_with_clinvar_hits = 0 + + if os.path.exists(vcf2tsv_gz_fname): + vcf2tsv_df = pd.read_csv( + vcf2tsv_gz_fname, skiprows=[0], sep="\t", na_values=".", + low_memory = False) + if {'CHROM','POS','REF','ALT','CLINVAR_MSID','PFAM_DOMAIN','ENTREZGENE'}.issubset(vcf2tsv_df.columns): + for elem in ['CHROM','POS','REF','ALT','CLINVAR_MSID','PFAM_DOMAIN','ENTREZGENE']: + vcf2tsv_df = vcf2tsv_df.astype({elem:'string'}) + vcf2tsv_df['CLINVAR_MSID'] = vcf2tsv_df['CLINVAR_MSID'].str.replace("\\.[0-9]{1,}$", "", regex = True) + vcf2tsv_df['PFAM_DOMAIN'] = vcf2tsv_df['PFAM_DOMAIN'].str.replace("\\.[0-9]{1,}$", "", regex = True) + vcf2tsv_df['ENTREZGENE'] = vcf2tsv_df['ENTREZGENE'].str.replace("\\.[0-9]{1,}$", "", regex = True) + vcf2tsv_df["VAR_ID"] = vcf2tsv_df["CHROM"].str.cat( + vcf2tsv_df["POS"], sep = "_").str.cat( + vcf2tsv_df["REF"], sep = "_").str.cat( + vcf2tsv_df["ALT"], sep = "_") + + if {'CLINVAR_TRAITS_ALL'}.issubset(vcf2tsv_df.columns): + vcf2tsv_df.drop('CLINVAR_TRAITS_ALL', inplace=True, axis=1) + + ## check number of variants with ClinVar ID's + num_recs_with_clinvar_hits = vcf2tsv_df["CLINVAR_MSID"].notna().sum() + ## check number of variants with PFAM ID's + num_recs_with_pfam_hits = vcf2tsv_df["PFAM_DOMAIN"].notna().sum() + ## check number of variants with Ensembl gene ID's + num_recs_with_entrez_hits = vcf2tsv_df["ENTREZGENE"].notna().sum() + + #print(str(num_recs_with_entrez_hits)) + ## merge variant set with ClinVar trait and variant origin annotations + if num_recs_with_clinvar_hits > 0: + if os.path.exists(clinvar_tsv_fname): + clinvar_data_df = pd.read_csv( + clinvar_tsv_fname, sep="\t", + usecols=["variation_id","origin_simple","VAR_ID","trait"], + low_memory = False) + clinvar_data_df['CLINVAR_TRAITS_ALL'] = clinvar_data_df['origin_simple'].str.capitalize().str.cat( + clinvar_data_df['trait'], sep = " - ") + clinvar_data_df['CLINVAR_MSID'] = clinvar_data_df['variation_id'] + clinvar_data_df = clinvar_data_df.astype({'CLINVAR_MSID':'string'}) + clinvar_data_df['CLINVAR_MSID'] = clinvar_data_df['CLINVAR_MSID'].str.replace("\\.[0-9]{1,}$", "", regex = True) + clinvar_data_df = clinvar_data_df[['VAR_ID','CLINVAR_MSID','CLINVAR_TRAITS_ALL']] + + vcf2tsv_df = vcf2tsv_df.merge( + clinvar_data_df, left_on=["VAR_ID", "CLINVAR_MSID"], right_on=["VAR_ID", "CLINVAR_MSID"], how="left") + else: + logger.error(f"Could not find {clinvar_tsv_fname} needed for ClinVar variant annotation - exiting") + else: + vcf2tsv_df['CLINVAR_TRAITS_ALL'] = '.' + + + ## merge variant set with PFAM domain annotations + if num_recs_with_pfam_hits > 0: + + vcf2tsv_df.drop('PFAM_DOMAIN_NAME', inplace=True, axis=1) + + if os.path.exists(protein_domain_tsv_fname): + prot_domains_data_df = pd.read_csv( + protein_domain_tsv_fname, sep="\t", usecols=["pfam_id","pfam_name"]).drop_duplicates() + prot_domains_data_df.rename(columns = {'pfam_id':'PFAM_DOMAIN', 'pfam_name':'PFAM_DOMAIN_NAME'}, inplace = True) + vcf2tsv_df = vcf2tsv_df.merge(prot_domains_data_df, left_on=["PFAM_DOMAIN"], right_on=["PFAM_DOMAIN"], how="left") + else: + logger.error(f"Could not find {protein_domain_tsv_fname} needed for PFAM domain annotation - exiting") + else: + vcf2tsv_df['PFAM_DOMAIN_NAME'] = '.' + + if num_recs_with_entrez_hits > 0: + + if {'GENENAME'}.issubset(vcf2tsv_df.columns): + vcf2tsv_df.drop('GENENAME', inplace=True, axis=1) + + if os.path.exists(gene_xref_tsv_fname): + gene_xref_df = pd.read_csv( + gene_xref_tsv_fname, sep="\t", na_values=".", + usecols=["entrezgene","name"]) + gene_xref_df = gene_xref_df[gene_xref_df['entrezgene'].notnull()].drop_duplicates() + gene_xref_df = gene_xref_df[gene_xref_df['entrezgene'].notna()].drop_duplicates() + gene_xref_df["entrezgene"] = gene_xref_df["entrezgene"].astype(float).astype(int).astype(str) + vcf2tsv_df["ENTREZGENE"] = vcf2tsv_df["ENTREZGENE"].astype(str) + vcf2tsv_df.loc[vcf2tsv_df["ENTREZGENE"].isna(), "ENTREZGENE"] = "-1" + gene_xref_df.rename(columns = {'entrezgene':'ENTREZGENE', 'name':'GENENAME'}, inplace = True) + vcf2tsv_df = vcf2tsv_df.merge(gene_xref_df, left_on=["ENTREZGENE"], right_on=["ENTREZGENE"], how="left") + vcf2tsv_df["ENTREZGENE"] = vcf2tsv_df['ENTREZGENE'].str.replace("\\.[0-9]{1,}$", "", regex = True) + else: + logger.error(f"Could not find {gene_xref_tsv_fname} needed for gene name annotation - exiting") + else: + vcf2tsv_df['GENENAME'] = '.' + + return(vcf2tsv_df) + +def clean_annotations(variant_set: pd.DataFrame, sample_id, genome_assembly, logger) -> pd.DataFrame: + + if {'Consequence','EFFECT_PREDICTIONS','CLINVAR_CONFLICTED'}.issubset(variant_set.columns): + variant_set.rename(columns = {'Consequence':'CONSEQUENCE'}, inplace = True) + variant_set['EFFECT_PREDICTIONS'] = variant_set['EFFECT_PREDICTIONS'].str.replace("\\.&|\\.$", "NA&", regex = True) + variant_set['EFFECT_PREDICTIONS'] = variant_set['EFFECT_PREDICTIONS'].str.replace("&$", "", regex = True) + variant_set['EFFECT_PREDICTIONS'] = variant_set['EFFECT_PREDICTIONS'].str.replace("&", ", ", regex = True) + variant_set['clinvar_conflicted_bool'] = True + variant_set.loc[variant_set['CLINVAR_CONFLICTED'] == 1, "clinvar_conflicted_bool"] = True + variant_set.loc[variant_set['CLINVAR_CONFLICTED'] != 1, "clinvar_conflicted_bool"] = False + variant_set.drop('CLINVAR_CONFLICTED', inplace=True, axis=1) + variant_set.rename(columns = {'clinvar_conflicted_bool':'CLINVAR_CONFLICTED'}, inplace = True) + + + if not {'VCF_SAMPLE_ID'}.issubset(variant_set.columns): + variant_set['VCF_SAMPLE_ID'] = str(sample_id) + variant_set['SAMPLE_ID'] = str(sample_id) + variant_set['GENOME_VERSION'] = str(genome_assembly) + if {'CHROM','POS','REF','ALT',}.issubset(variant_set.columns): + variant_set['GENOMIC_CHANGE'] = variant_set['CHROM'].astype(str) + ":g." + variant_set['POS'].astype(str) + \ + variant_set['REF'].astype(str) + ">" + variant_set['ALT'].astype(str) + + ## Make sure that specific tags are formatted as integers (not float) during to_csv export + if {'AMINO_ACID_END','AMINO_ACID_START'}.issubset(variant_set.columns): + variant_set.loc[variant_set['AMINO_ACID_START'].isna(),"AMINO_ACID_START"] = -1 + variant_set.loc[variant_set['AMINO_ACID_END'].isna(),"AMINO_ACID_END"] = -1 + variant_set['AMINO_ACID_END'] = variant_set['AMINO_ACID_END'].astype(float).astype(int) + variant_set['AMINO_ACID_START'] = variant_set['AMINO_ACID_START'].astype(float).astype(int) + + for elem in ['NUM_SUBMITTERS','ALLELE_ID','ENTREZGENE','REVIEW_STATUS_STARS']: + vcf_info_tag = 'CLINVAR_' + str(elem) + if vcf_info_tag in variant_set.columns: + variant_set.loc[variant_set[vcf_info_tag].notna(), vcf_info_tag] = \ + variant_set.loc[variant_set[vcf_info_tag].notna(), vcf_info_tag].astype(str).astype(float).astype(int) + + for vcf_info_tag in ['ONCOGENE_RANK','TSG_RANK','TCGA_PANCANCER_COUNT','CGC_TIER','DISTANCE', + 'EXON_AFFECTED','INTRON_POSITION','EXON_POSITION']: + if vcf_info_tag in variant_set.columns: + variant_set.loc[variant_set[vcf_info_tag].notna(), vcf_info_tag] = \ + variant_set.loc[variant_set[vcf_info_tag].notna(), vcf_info_tag].astype(str).astype(float).astype(int) + + variant_set = set_genotype(variant_set, logger) + + return variant_set \ No newline at end of file diff --git a/src/gvanno/lib/gvanno/vcf.py b/src/gvanno/lib/gvanno/vcf.py new file mode 100644 index 0000000..b39fa48 --- /dev/null +++ b/src/gvanno/lib/gvanno/vcf.py @@ -0,0 +1,92 @@ +#!/usr/bin/env python + +import logging + +from lib.gvanno.utils import error_message, warn_message, check_subprocess +from cyvcf2 import VCF +from typing import Union + +def get_vcf_info_tags(vcf_fname): + vcf = VCF(vcf_fname) + info_tags = {} + for e in vcf.header_iter(): + header_element = e.info() + if 'ID' in header_element.keys() and 'HeaderType' in header_element.keys(): + if header_element['HeaderType'] == 'INFO': + info_tags[str(header_element['ID'])] = 1 + + return info_tags + + +def print_vcf_header(vcf_fname, vcfheader_file, logger, chromline_only=False): + if chromline_only == True: + check_subprocess( + logger, f'bgzip -dc {vcf_fname} | egrep \'^#\' | egrep \'^#CHROM\' >> {vcfheader_file}', debug=False) + else: + check_subprocess( + logger, f'bgzip -dc {vcf_fname} | egrep \'^#\' | egrep -v \'^#CHROM\' > {vcfheader_file}', debug=False) + +def detect_reserved_info_tag(tag, tag_name, logger): + reserved_tags = ['AA', 'AC', 'AF', 'AN', 'BQ', 'CIGAR', 'DB', 'DP', 'END', + 'H2', 'H3', 'MQ', 'MQ0', 'NS', 'SB', 'SOMATIC', 'VALIDATED', '1000G'] + if tag in reserved_tags: + err_msg = f'Custom INFO tag ({tag_name}) needs another name - \'{tag}\' is a reserved field in the VCF specification (INFO)' + return error_message(err_msg, logger) + + reserved_format_tags = ['GT', 'DP', 'FT', 'GL', + 'GLE', 'GQ', 'PL', 'HQ', 'PS', 'PQ', 'EC', 'MQ'] + if tag in reserved_format_tags: + err_msg = 'Custom INFO tag ({tag_name}) needs another name - \'{tag}\' is a reserved field in the VCF specification (FORMAT)' + return error_message(err_msg, logger) + +def check_retained_vcf_info_tags(vcf: VCF, retained_info_tags: str, logger: logging.Logger) -> int: + + """ + Function that compares the INFO tags in the query VCF and retained INFO tags set by the user as retained for output + If any retained tag is not in query VCF, an error will be returned + """ + + tags = str(retained_info_tags).split(',') + info_elements_query_vcf = [] + + #vcf = VCF(input_vcf) + logger.info('Checking if existing INFO tags of query VCF file matches retained INFO tags set by the user') + ret = 1 + for e in vcf.header_iter(): + header_element = e.info() + if 'ID' in header_element.keys() and 'HeaderType' in header_element.keys(): + if header_element['HeaderType'] == 'INFO': + info_elements_query_vcf.append(header_element['ID']) + + for t in tags: + if not t in info_elements_query_vcf: + err_msg = "Retained INFO tag '" + str(t) + "' not found among INFO tags in query VCF - make sure retained VCF INFO tags are set correctly" + error_message(err_msg, logger) + else: + logger.info("Retained INFO tag '" + str(t) + "' detected among INFO tags in query VCF") + + return ret + + + +def check_existing_vcf_info_tags(vcf: VCF, populated_infotags: dict, logger) -> int: + + """ + Function that compares the INFO tags in the query VCF and the INFO tags generated by gvanno + If any coinciding tags, an error will be returned + """ + + logger.info('Checking if existing INFO tags of query VCF file coincide with gvanno INFO tags') + ret = 1 + for e in vcf.header_iter(): + header_element = e.info() + if 'ID' in header_element.keys() and 'HeaderType' in header_element.keys(): + if header_element['HeaderType'] == 'INFO': + if header_element['ID'] in populated_infotags.keys(): + err_msg = 'INFO tag ' + str(header_element['ID']) + ' in the query VCF coincides with a VCF ' + \ + 'annotation tag produced by gvanno - please remove or rename this tag in your query VCF' + return error_message(err_msg, logger) + + logger.info('No query VCF INFO tags coincide with populated INFO tags by gvanno') + return ret + diff --git a/src/gvanno/lib/gvanno/vep.py b/src/gvanno/lib/gvanno/vep.py new file mode 100644 index 0000000..f0b50c0 --- /dev/null +++ b/src/gvanno/lib/gvanno/vep.py @@ -0,0 +1,322 @@ +#!/usr/bin/env python + +import os,re +import csv +import gzip + +from lib.gvanno.annoutils import assign_cds_exon_intron_annotations +from lib.gvanno import gvanno_vars + + +def get_csq_record_annotations(csq_fields, varkey, logger, vep_csq_fields_map, transcript_xref_map): + """ + Generates a dictionary object containing the annotations of a CSQ record. + + Parameters: + - csq_fields (list): A list of CSQ fields. + - varkey (str): The VARKEY value. + - logger (Logger): The logger object. + - vep_csq_fields_map (dict): A dictionary mapping VEP CSQ fields to their indices. + - transcript_xref_map (dict): A dictionary mapping Ensembl transcript IDs to their annotations. + + Returns: + - csq_record (dict): A dictionary object containing the annotations of a CSQ record. + """ + + j = 0 + csq_record = {} + + csq_record['VARKEY'] = varkey + ensembl_transcript_id = '.' + + # loop over block annotation elements (separated with '|'), and assign their values to the csq_record dictionary object + while (j < len(csq_fields)): + if j in vep_csq_fields_map['index2field']: + + ## consider non-empty CSQ fields + if csq_fields[j] != '': + csq_record[vep_csq_fields_map['index2field'][j]] = str(csq_fields[j]) + if vep_csq_fields_map['index2field'][j] == 'Feature': + + ensembl_transcript_id = str(csq_fields[j]) + if ensembl_transcript_id in transcript_xref_map: + for annotation in transcript_xref_map[ensembl_transcript_id].keys(): + if annotation != 'SYMBOL': + ## assign additional (non-VEP provided) gene/transcript annotations from the custom + ## transcript_xref_map as key,value pairs in the csq_record object + csq_record[annotation] = transcript_xref_map[ensembl_transcript_id][annotation] + else: + if re.match(r'ENST', ensembl_transcript_id): + logger.warning( + 'Could not find transcript xrefs for ' + str(ensembl_transcript_id)) + + # Specifically assign PFAM protein domain as a csq_record key + if vep_csq_fields_map['index2field'][j] == 'DOMAINS': + domain_identifiers = str(csq_fields[j]).split('&') + for v in domain_identifiers: + if v.startswith('Pfam'): + csq_record['PFAM_DOMAIN'] = str(re.sub(r'\.[0-9]{1,}$', '', re.sub(r'Pfam:', '', v))) + + # Assign COSMIC/DBSNP mutation ID's as individual key,value pairs in the csq_record object + if vep_csq_fields_map['index2field'][j] == 'Existing_variation': + var_identifiers = str(csq_fields[j]).split('&') + parsed_identifiers = {'COSMIC_MUTATION_ID':[], 'DBSNPRSID':[]} + for v in var_identifiers: + if v.startswith('COSV') or v.startswith('COSM'): + parsed_identifiers['COSMIC_MUTATION_ID'].append(v) + if v.startswith('rs'): + parsed_identifiers['DBSNPRSID'].append(v) + for db in parsed_identifiers.keys(): + if len(parsed_identifiers[db]) > 0: + csq_record[db] = '&'.join(parsed_identifiers[db]) + + ## Sort (potentially multiple) variant consequence elements from VEP (they appear unsorted in some cases) + ## Example: intron_variant&splice_region_variant + if vep_csq_fields_map['index2field'][j] == "Consequence": + consequence_elements = sorted(str(csq_fields[j]).split('&')) + csq_record['Consequence'] = '&'.join(consequence_elements) + + else: + csq_record[vep_csq_fields_map['index2field'][j]] = None + j = j + 1 + + ## if VEP/Ensembl does not provide a symbol, use symbol provided by PCGR/CPSR gene_transcript_xref map + if csq_record['SYMBOL'] is None and ensembl_transcript_id != ".": + if ensembl_transcript_id in transcript_xref_map: + csq_record['SYMBOL'] = transcript_xref_map[ensembl_transcript_id]['SYMBOL'] + + # Assign coding status, protein change, coding sequence change, last exon/intron status etc as VCF info tags + csq_record = assign_cds_exon_intron_annotations(csq_record) + + return(csq_record) + + +def pick_single_gene_csq(vep_csq_results, + pick_criteria_ordered = "mane_select,mane_plus_clinical,canonical,appris,tsl,biotype,ccds,rank,length", + logger = None): + + + csq_candidates = [] + + for csq_elem in vep_csq_results['picked_gene_csq']: + if csq_elem is None: + continue + csq_candidate = {} + + ## default values (undefined properties) + csq_candidate['mane_select'] = 1 + csq_candidate['mane_plus_clinical'] = 1 + csq_candidate['canonical'] = 1 + csq_candidate['appris'] = 8 + csq_candidate['biotype'] = 1 + csq_candidate['tsl'] = 6 + csq_candidate['ccds'] = 1 + csq_candidate['rank'] = 42 + + ## set to picked as default + csq_candidate['PICKED'] = True + csq_candidate['varkey'] = csq_elem['VARKEY'] + + ## MANE select status - lower value prioritized + if not csq_elem['MANE_SELECT'] is None: + csq_candidate['mane_select'] = 0 + + ## MANE PLUS clnical status - lower value prioritized + if not csq_elem['MANE_PLUS_CLINICAL'] is None: + csq_candidate['mane_plus_clinical'] = 0 + + ## CANONICAL status - lower value prioritized + if not csq_elem['CANONICAL'] is None: + if csq_elem['CANONICAL'] is True: + csq_candidate['canonical'] = 0 + + ## APPRIS level - lower value prioritized + if not csq_elem['APPRIS'] is None: + if not 'ALTERNATIVE' in csq_elem['APPRIS']: + csq_candidate['appris'] = int(re.sub(r'[A-Z]{1,}:?', '', csq_elem['APPRIS'])) + else: + csq_candidate['appris'] = int(re.sub(r'ALTERNATIVE:','', csq_elem['APPRIS'])) + 5 + + ## Biotype - lower value prioritized + if not csq_elem['BIOTYPE'] is None: + if csq_elem['BIOTYPE'] == 'protein_coding': + csq_candidate['biotype'] = 0 + + ## CCDS - lower value prioritized + if not csq_elem['CCDS'] is None: + csq_candidate['ccds'] = 0 + + ## Consequence rank - lower value prioritized + if not csq_elem['Consequence'] is None: + main_cons = csq_elem['Consequence'].split('&')[0] + if main_cons in gvanno_vars.VEP_consequence_rank: + csq_candidate['rank'] = int(gvanno_vars.VEP_consequence_rank[main_cons]) + else: + warn_msg = f"Missing Consequence in gvanno_vars.VEP_consequence_rank: {csq_elem['Consequence']} - '{main_cons}'" + if not logger is None: + logger.warn(warn_msg) + + ## TSL - lower value prioritized + if not csq_elem['TSL'] is None: + csq_candidate['tsl'] = int(csq_elem['TSL']) + + csq_candidates.append(csq_candidate) + + + ## Go through pick criteria in pre-defined order + ## - set 'PICKED' = False for all csq elements with a score above the minimum value for a given criterion + ## - when there is only one element with 'PICKED' equal to TRUE, break out of the loop, and report the chosen transcript CSQ element + chosen_csq_index = 0 + for rank_criterion in pick_criteria_ordered.split(','): + if rank_criterion != 'length': + lowest_score = 100 + i = 0 + for candidate in csq_candidates: + if candidate[rank_criterion] <= lowest_score: + lowest_score = candidate[rank_criterion] + i = i + 1 + + for candidate in csq_candidates: + if candidate[rank_criterion] > lowest_score: + candidate['PICKED'] = False + + j = 0 + num_picked = 0 + for candidate in csq_candidates: + if candidate['PICKED'] is True: + num_picked += 1 + chosen_csq_index = j + j = j + 1 + + if num_picked == 1: + break + + return(chosen_csq_index) + +def parse_vep_csq(rec, transcript_xref_map, vep_csq_fields_map, vep_pick_order, logger, pick_only=True, + csq_identifier='CSQ', debug = 0): + + """ + Function that parses the comma-separated CSQ elements found in the rec.INFO object (VCF) + - creates an individual CSQ record for all transcript-specific elements provided as comma-separated elements in the CSQ tag + - each individual record is gathered as a dictionary of properties (defined by vep_csq_field_map), i.e. + - 'CSQ=A|missense_variant|KRAS++' in the VCF INFO element gives csq_record['Consequence'] = 'missense_variant', + csq_record['SYMBOL'] = 'KRAS' etc. + - if argument 'pick_only' is TRUE, only elements with 'PICK' == 1' is chosen + """ + + all_csq_pick = [] + alternative_csq_pick = [] + all_transcript_consequences = [] + + varkey = str(rec.CHROM) + '_' + str(rec.POS) + '_' + str(rec.REF) + '_' + str(','.join(rec.ALT)) + + ## Retrieve the INFO element provided by VEP (default 'CSQ') in the VCF object, and + ## loop through all transcript-specific consequence blocks provided, e.g. + # CSQ=A|intron_variant|||.., A|splice_region_variant|||, and so on. + for csq in rec.INFO.get(csq_identifier).split(','): + #print(csq) + csq_fields = csq.split('|') + + entrezgene = '.' + + ## Entrez gene identifier is not provided by VEP, pull out this from 'transcript_xref_map' for a given + ## vtranscript-specific CSQ block + ## - used for 'consequence_entry' object that are added to 'vep_all_csq' array + k = 0 + while(k < len(csq_fields)): + if k in vep_csq_fields_map['index2field']: + if vep_csq_fields_map['index2field'][k] == 'Feature': + ensembl_transcript_id = csq_fields[k] + if ensembl_transcript_id != '' and ensembl_transcript_id.startswith('ENST'): + if ensembl_transcript_id in transcript_xref_map.keys(): + if 'ENTREZGENE' in transcript_xref_map[ensembl_transcript_id].keys(): + entrezgene = transcript_xref_map[ensembl_transcript_id]['ENTREZGENE'] + k = k + 1 + + + ## CPSR - consider all consequences (considering that a variant may overlap other, non-CPSR targets) + if pick_only is False: + csq_record = get_csq_record_annotations(csq_fields, varkey, logger, vep_csq_fields_map, transcript_xref_map) + if 'Feature_type' in csq_record: + if csq_record['Feature_type'] == 'RegulatoryFeature': + #print(str(csq_record)) + all_csq_pick.append(csq_record) + alternative_csq_pick.append(csq_record) + if csq_record['Feature_type'] == 'Transcript': + all_csq_pick.append(csq_record) + else: + ## intergenic + if csq_record['Feature_type'] is None: + alternative_csq_pick.append(csq_record) + + ## PCGR - consider picked consequence only + else: + # loop over VEP consequence blocks PICK'ed according to VEP's ranking scheme + # only consider the primary/picked consequence when expanding with annotation tags + + if csq_fields[vep_csq_fields_map['field2index']['PICK']] == "1": + csq_record = get_csq_record_annotations(csq_fields, varkey, logger, vep_csq_fields_map, transcript_xref_map) + # Append transcript consequence to all_csq_pick + all_csq_pick.append(csq_record) + symbol = "." + hgvsc = "." + hgvsp = "." + exon = "." + if csq_fields[vep_csq_fields_map['field2index']['EXON']] != "": + if "/" in csq_fields[vep_csq_fields_map['field2index']['EXON']]: + exon = str(csq_fields[vep_csq_fields_map['field2index']['EXON']].split('/')[0]) + if csq_fields[vep_csq_fields_map['field2index']['SYMBOL']] != "": + symbol = str(csq_fields[vep_csq_fields_map['field2index']['SYMBOL']]) + if csq_fields[vep_csq_fields_map['field2index']['HGVSc']] != "": + hgvsc = str(csq_fields[vep_csq_fields_map['field2index']['HGVSc']].split(':')[1]) + if csq_fields[vep_csq_fields_map['field2index']['HGVSp']] != "": + hgvsp = str(csq_fields[vep_csq_fields_map['field2index']['HGVSp']].split(':')[1]) + consequence_entry = (str(csq_fields[vep_csq_fields_map['field2index']['Consequence']]) + ":" + + str(symbol) + ":" + + str(entrezgene) + ":" + + str(hgvsc) + ":" + + str(hgvsp) + ":" + + str(exon) + ":" + + str(csq_fields[vep_csq_fields_map['field2index']['Feature_type']]) + ":" + + str(csq_fields[vep_csq_fields_map['field2index']['Feature']]) + ":" + + str(csq_fields[vep_csq_fields_map['field2index']['BIOTYPE']])) + all_transcript_consequences.append(consequence_entry) + + ## CPSR - consider all picked VEP blocks + if pick_only is False: + if len(all_csq_pick) == 0: + if len(alternative_csq_pick) > 0: + all_csq_pick = alternative_csq_pick + else: + logger.info(f"NO VEP BLOCK FOUND") + exit(1) + else: + ## don't consider regulatory consequence as single chosen consequence + if len(all_csq_pick) == 1 and len(alternative_csq_pick) > 0: + if all_csq_pick[0]['Feature_type'] == 'RegulatoryFeature': + all_csq_pick = alternative_csq_pick + + vep_csq_results = {} + vep_csq_results['picked_gene_csq'] = all_csq_pick + vep_csq_results['all_csq'] = all_transcript_consequences + vep_csq_results['picked_csq'] = None + vep_chosen_csq_idx = 0 + + ## If multiple transcript-specific variant consequences highlighted by --pick_allele_gene, + ## prioritize/choose block of consequence according to 'vep_pick_order' + if len(vep_csq_results['picked_gene_csq']) > 1: + vep_chosen_csq_idx = pick_single_gene_csq(vep_csq_results, pick_criteria_ordered = vep_pick_order, logger = logger) + vep_csq_results['picked_csq'] = vep_csq_results['picked_gene_csq'][vep_chosen_csq_idx] + else: + ## check that size if 1, otherwise prompt error below + #logger.info('ERROR: No VEP block chosen by --pick_allele_gene') + if len(vep_csq_results['picked_gene_csq']) == 1: + vep_csq_results['picked_csq'] = vep_csq_results['picked_gene_csq'][vep_chosen_csq_idx] + else: + logger.error('ERROR: No VEP block chosen by --pick_allele_gene') + + + + return (vep_csq_results) +