diff --git a/README.md b/README.md
index 287bcfc..a23ce3f 100755
--- a/README.md
+++ b/README.md
@@ -12,11 +12,15 @@
 
 ### Overview
 
-The generic variant annotator (*gvanno*) is a software package intended for simple analysis and interpretation of human DNA variants. Variants and genes are annotated with disease-related and functional associations. Technically, the workflow is built with the [Docker](https://www.docker.com) technology, and it can also be installed through the [Singularity](https://sylabs.io/docs/) framework.
+The generic variant annotator (*gvanno*) is a software package intended for simple analysis and interpretation of human DNA variants. Variants and genes are annotated with disease-related and functional associations. Technically, the workflow is developed in Python, and it relies upon [Docker](https://www.docker.com) / [Singularity](https://sylabs.io/docs/) technology for encapsulation of software dependencies.
 
 *gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short insertions or deletions (indels). The workflow relies heavily upon [Ensembl's Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record. Note that if your input VCF contains data (genotypes) from multiple samples (i.e. a multisample VCF), the output TSV file will contain one line/record **per sample variant**.
 
-### News 
+### News
+-   December 29th 2023 - **1.7.0 release**
+    - Data updates: ClinVar, GENCODE, GWAS catalog
+    - Software updates: VEP
+    - Improved Singularity support
 
 -   April 27th 2023 - **1.6.0 release**
 
@@ -28,16 +32,16 @@ The generic variant annotator (*gvanno*) is a software package intended for simp
 
     -   Added option `--vep_coding_only` - only report variants that fall into coding regions of transcripts (VEP option `--coding_only`)
 
-### Annotation resources (v1.6.0)
+### Annotation resources (v1.7.0)
 
--   [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v109 (GENCODE v43/v19 as the gene reference dataset)
--   [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.2, March 2021)
+-   [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v110 (GENCODE v44/v19 as the gene reference dataset)
+-   [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.5, November 2023)
 -   [gnomAD](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (release 2.1, October 2018) - from VEP
 -   [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 154) - from VEP
--   [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of variants related to human health/disease phenotypes (April 2023)
+-   [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of variants related to human health/disease phenotypes (December 2023)
 -   [CancerMine](http://bionlp.bcgsc.ca/cancermine/) - literature-mined database of drivers, oncogenes and tumor suppressors in cancer (version 50, March 2023)
 -   [Mutation hotspots](cancerhotspots.org) - Database of mutation hotspots in cancer
--   [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (March 27th 2023)
+-   [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (November 2023)
 
 ### Getting started
 
@@ -49,11 +53,15 @@ The generic variant annotator (*gvanno*) is a software package intended for simp
 
 -   *Other utilities*
 
-    The script that installs the reference data requires that the user has `bgzip` installed. See [here](http://www.htslib.org/download/) for instructions. The script also requires that basic Linux/UNIX commands are available (i.e. `gzip`, `tar`)
-    
-    **NOTE**: We strongly recommend that _gvanno_ is installed on a MacOS or Linux/UNIX operating system
+    The script that installs the reference data requires that the user has `bgzip` and `tabix` installed. See [here](http://www.htslib.org/download/) for instructions. The script also requires that basic Linux/UNIX commands are available (i.e. `gzip`, `tar`)
 
-#### STEP 1: Installation of Docker
+    **NOTE**: _gvanno_ should be installed on a MacOS or Linux/UNIX operating system
+
+#### STEP 1: Installation of Docker/Singularity
+
+- the _gvanno_ workflow can be executed with either _Docker_ or _Singularity_ container technology
+
+##### Installation of Docker
 
 1.  [Install the Docker engine](https://docs.docker.com/engine/installation/) on your preferred platform
     -   installing [Docker on Linux](https://docs.docker.com/engine/installation/linux/)
@@ -65,33 +73,33 @@ The generic variant annotator (*gvanno*) is a software package intended for simp
     -   CPUs: minimum 4
     -   [How to - Mac OS X](https://docs.docker.com/docker-for-mac/#advanced)
 
-##### 1.1: Installation of Singularity (_IN DEVELOPMENT_)
-
-0.  **Note: this works for Singularity version 3.0 and higher**.
+##### Installation of Singularity
 
 1.  [Install Singularity](https://sylabs.io/docs/)
 
-2.  Test that singularity works by running `singularity --version`
-
-3.  If you are in the gvanno directory, build the singularity image like so:
-
-    `cd src`
-
-    `sudo ./buildSingularity.sh`
 
 #### STEP 2: Download *gvanno* and data bundle
 
-1.  [Download and unpack the latest release](https://github.com/sigven/gvanno/releases/tag/v1.6.0)
+1.  [Download and unpack the latest release](https://github.com/sigven/gvanno/releases/tag/v1.7.0)
 
 2.  Install the assembly-specific VEP cache, and gvanno-specific reference data using the `download_gvanno_refdata.py` script, i.e.:
 
     -   `python download_gvanno_refdata.py --download_dir <PATH_TO_DOWNLOAD_DIR> --genome_assembly grch38`
 
-    **NOTE**: This can take a considerable amount of time depending on your local bandwidth (approx 30Gb pr. assembly-specific bundle)
+    **NOTE**: This can take a considerable amount of time depending on your local bandwidth (approx 20Gb pr. assembly-specific bundle)
+
+
+3.  Pull container images
+
+    * Docker
+	    * Pull the [gvanno Docker image (v1.7.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 3.8Gb):
+
+	    * `docker pull sigven/gvanno:1.7.0` (gvanno annotation engine)
+
+    * Singularity
+         * Download the [gvanno SIF image  (v1.7.0)](https://insilico.hpc.uio.no/pcgr/gvanno/gvanno_1.7.0.sif) (approx 1.2Gb) and use this as the argument for `--sif_file` in the `gvanno.py` run script.
 
-3.  Pull the [gvanno Docker image (1.6.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 4.11Gb):
 
-    -   `docker pull sigven/gvanno:1.6.0` (gvanno annotation engine)
 
 #### STEP 3: Input preprocessing
 
@@ -105,7 +113,7 @@ We **strongly** recommend that the input VCF is compressed and indexed using [bg
 
 Run the workflow with **gvanno.py**, which takes the following arguments and options:
 
-```         
+```
 usage:
 gvanno.py -h [options]
 --query_vcf <QUERY_VCF>
@@ -121,7 +129,7 @@ Required arguments:
 --query_vcf QUERY_VCF
                 VCF input file with germline query variants (SNVs/InDels).
 --gvanno_dir GVANNO_DIR
-                Directory that contains the gvanno reference data, e.g. ~/gvanno-1.6.0
+                Directory that contains the gvanno reference data, e.g. ~/gvanno-1.7.0
 --output_dir OUTPUT_DIR
                 Output directory
 --genome_assembly {grch37,grch38}
@@ -132,9 +140,9 @@ Required arguments:
                 Sample identifier - prefix for output files
 
 VEP optional arguments:
---vep_regulatory      Enable Variant Effect Predictor (VEP) to look for overlap with regulatory regions (option --regulatory in VEP).
---vep_gencode_all     Consider all GENCODE transcripts with Variant Effect Predictor (VEP) (option --gencode_basic in VEP is used by default in gvanno).
---vep_lof_prediction  Predict loss-of-function variants with Loftee plugin in Variant Effect Predictor (VEP), default: False
+--vep_regulatory        Enable Variant Effect Predictor (VEP) to look for overlap with regulatory regions (option --regulatory in VEP).
+--vep_gencode_basic     Consider only basic GENCODE transcripts with Variant Effect Predictor (VEP).
+--vep_lof_prediction    Predict loss-of-function variants with the LOFTEE plugin in Variant Effect Predictor (VEP), default: False
 --vep_n_forks VEP_N_FORKS
                 Number of forks for Variant Effect Predictor (VEP) processing, default: 4
 --vep_buffer_size VEP_BUFFER_SIZE
@@ -143,7 +151,7 @@ VEP optional arguments:
 --vep_pick_order VEP_PICK_ORDER
                 Comma-separated string of ordered transcript properties for primary variant pick in
                 Variant Effect Predictor (VEP) processing, default: canonical,appris,biotype,ccds,rank,tsl,length,mane
---vep_skip_intergenic
+--vep_no_intergenic
                 Skip intergenic variants in Variant Effect Predictor (VEP) processing, default: False
 --vep_coding_only
           Only report variants falling into coding regions of transcripts (VEP), default: False
@@ -160,25 +168,40 @@ Other optional arguments:
 --oncogenicity_annotation
                     Classify variants according to oncogenicity (Horak et al., Genet Med, 2022)
 --debug             Print full Docker/Singularity commands to log and do not delete intermediate files with warnings etc.
+--sif_file		gvanno SIF image file for usage of gvanno workflow with option '--container singularity'
 ```
 
-The *examples* folder contains an example VCF file. Analysis of the example VCF can be performed by the following command:
+The *examples* folder contains an example VCF file. Analysis of the example VCF can be performed by the following command (Docker-based):
 
-```         
-python ~/gvanno-1.6.0/gvanno.py
---query_vcf ~/gvanno-1.6.0/examples/example.grch37.vcf.gz
---gvanno_dir ~/gvanno-1.6.0
---output_dir ~/gvanno-1.6.0
+```
+python ~/gvanno-1.7.0/gvanno.py
+--query_vcf ~/gvanno-1.7.0/examples/example.grch37.vcf.gz
+--gvanno_dir ~/gvanno-1.7.0
+--output_dir ~/gvanno-1.7.0
 --sample_id example
 --genome_assembly grch37
 --container docker
 --force_overwrite
 ```
 
+or Singularity-based
+
+```
+python ~/gvanno-1.7.0/gvanno.py
+--query_vcf ~/gvanno-1.7.0/examples/example.grch37.vcf.gz
+--gvanno_dir ~/gvanno-1.7.0
+--output_dir ~/gvanno-1.7.0
+--sample_id example
+--genome_assembly grch37
+--container singularity
+--sif_file gvanno_1.7.0.sif
+--force_overwrite
+```
+
 This command will run the Docker-based *gvanno* workflow and produce the following output files in the *examples* folder:
 
-1.  **example_gvanno_pass_grch37.vcf.gz (.tbi)** - Bgzipped VCF file with rich set of functional/clinical annotations
-2.  **example_gvanno_pass_grch37.tsv.gz** - Compressed TSV file with rich set of functional/clinical annotations
+1.  **example_gvanno_grch37.pass.vcf.gz (.tbi)** - Bgzipped VCF file with rich set of functional/clinical variant and gene annotations
+2.  **example_gvanno_grch37.pass.tsv.gz** - Compressed TSV file with rich set of functional/clinical variant and gene annotations
 
 Similar files are produced for all variants, not only variants with a *PASS* designation in the VCF FILTER column.
 
diff --git a/download_gvanno_refdata.py b/download_gvanno_refdata.py
index f0b94bb..457d158 100755
--- a/download_gvanno_refdata.py
+++ b/download_gvanno_refdata.py
@@ -8,16 +8,16 @@
 import sys
 import locale
 import errno
-#import wget
 import urllib.request as urllib2
 from argparse import RawTextHelpFormatter
 
-GVANNO_VERSION = '1.6.0'
-REFDATA_VERSION = '20230425'
-ENSEMBL_VERSION = '109'
-GENCODE_VERSION = 'v43'
+GVANNO_VERSION = '1.7.0'
+REFDATA_VERSION = '20231224'
+ENSEMBL_VERSION = '110'
+GENCODE_VERSION = 'v44'
 VEP_ASSEMBLY = "GRCh38"
 HOST_GVANNO_REFDATA_URL = "http://insilico.hpc.uio.no/pcgr/gvanno/"
+HOST_HUMAN_ANCESTOR = "https://personal.broadinstitute.org/konradk/loftee_data/"
 
 def __main__():
    
@@ -35,7 +35,7 @@ def __main__():
       'download directory already exist.\nYou can force the overwrite of existing download directory by using this flag, default: %(default)s')
    optional.add_argument('--version', action='version', version='%(prog)s ' + str(GVANNO_VERSION))
    optional.add_argument('--clean_raw_files',action="store_true", help="Delete raw compressed tar files (i.e. VEP) after download and unzip + untar has been conducted successfully")
-   optional.add_argument("--debug", action="store_true", help="Print full commands to log and do not delete intermediate files with warnings etc.")
+   optional.add_argument('--debug', action="store_true", help="Print full commands to log and do not delete intermediate files with warnings etc.")
    required.add_argument('--download_dir',help='Destination directory for downloaded reference data', required = True)
    required.add_argument('--genome_assembly',choices = ['grch37','grch38'], help='Choose build-specific reference data for download: grch37 or grch38', required = True)
 
@@ -65,7 +65,6 @@ def __main__():
       os.mkdir(arg_dict['db_assembly_dir'])
       os.mkdir(arg_dict['vep_assembly_dir'])
 
-
    download_gvanno_ref_data(arg_dict = arg_dict)
 
 
@@ -190,7 +189,7 @@ def download_gvanno_ref_data(arg_dict):
    vep_assembly_dir = os.path.join(os.path.abspath(arg_dict['download_dir']),'data',arg_dict['genome_assembly'], '.vep')
    
    datasets = {}
-   for db in ['vep_cache','vep_fasta','gvanno_custom']:
+   for db in ['vep_cache','vep_fasta','gvanno_custom','human_ancestor']:
       datasets[db] = {}
       datasets[db]['remote_url'] = 'NA'
       datasets[db]['local_path'] = 'NA'
@@ -208,8 +207,10 @@ def download_gvanno_ref_data(arg_dict):
       )
    
    logger = getlogger('download-vep-cache')
-   datasets['vep_cache']['local_path'] = os.path.join(arg_dict['vep_assembly_dir'], f"homo_sapiens_vep_{ENSEMBL_VERSION}_{VEP_ASSEMBLY}.tar.gz")
-   datasets['vep_fasta']['local_path'] = os.path.join(arg_dict['vep_assembly_dir'], "homo_sapiens", f"{ENSEMBL_VERSION}_{VEP_ASSEMBLY}", f"Homo_sapiens.{VEP_ASSEMBLY}.dna.primary_assembly.fa.gz")
+   datasets['vep_cache']['local_path'] = os.path.join(
+      arg_dict['vep_assembly_dir'], f"homo_sapiens_vep_{ENSEMBL_VERSION}_{VEP_ASSEMBLY}.tar.gz")
+   datasets['vep_fasta']['local_path'] = os.path.join(
+      arg_dict['vep_assembly_dir'], "homo_sapiens", f"{ENSEMBL_VERSION}_{VEP_ASSEMBLY}", f"Homo_sapiens.{VEP_ASSEMBLY}.dna.primary_assembly.fa.gz")
    datasets['vep_fasta']['local_path_uncompressed'] = re.sub(r'.gz','',datasets['vep_fasta']['local_path'])
 
    vep_cache_bytes_remote = get_url_num_bytes(url = datasets['vep_cache']['remote_url'], logger = logger)
@@ -273,15 +274,38 @@ def download_gvanno_ref_data(arg_dict):
    check_subprocess(command = command_unzip_fasta, logger = logger)
    command_bgzip_fasta = f"bgzip {datasets['vep_fasta']['local_path_uncompressed']}"
    check_subprocess(command = command_bgzip_fasta, logger = logger)
-
    
-
+   logger = getlogger('download-human-ancestor')
+   logger.info("Downloading human ancestor FASTA files")
+   for postfix in ['gz', 'gz.fai', 'gz.gzi']:
+      datasets['human_ancestor']['remote_url'] = f'{HOST_HUMAN_ANCESTOR}{VEP_ASSEMBLY}/human_ancestor.fa.{postfix}'
+      datasets['human_ancestor']['local_path'] = os.path.join(
+         arg_dict['vep_assembly_dir'], "homo_sapiens", f"{ENSEMBL_VERSION}_{VEP_ASSEMBLY}", f'human_ancestor.fa.{postfix}')
+
+      logger = getlogger('download-human-ancestor')
+      custom_cache_bytes_remote = get_url_num_bytes(url = datasets['human_ancestor']['remote_url'], logger = logger)
+      logger.info('Human ancestor FASTA - remote target file ' + str(datasets['human_ancestor']['remote_url']))
+      logger.info('Human ancestor FASTA - size: ' + pretty_print(custom_cache_bytes_remote, logger = logger))
+      logger.info('Human ancestor FASTA - local destination file: ' + str(datasets['human_ancestor']['local_path']))
+
+
+      if os.path.exists(datasets['human_ancestor']['local_path']):
+         if os.path.getsize(datasets['human_ancestor']['local_path']) == custom_cache_bytes_remote:
+            logger.info('Human ancestor FASTA already downloaded')
+         else:
+            logger.info('Human ancestor FASTA - download in progress - this can take a while ...  ')
+            urllib2.urlretrieve(datasets['human_ancestor']['remote_url'], datasets['human_ancestor']['local_path'])
+      else:
+         logger.info('Human ancestor FASTA - download in progress - this can take a while ...  ')
+         urllib2.urlretrieve(datasets['human_ancestor']['remote_url'], datasets['human_ancestor']['local_path'])
+      
+   
    datasets['gvanno_custom']['remote_url'] = f'{HOST_GVANNO_REFDATA_URL}gvanno.databundle.{arg_dict["genome_assembly"]}.{REFDATA_VERSION}.tgz'
    datasets['gvanno_custom']['local_path'] = os.path.join(arg_dict['download_dir'], f'gvanno.databundle.{arg_dict["genome_assembly"]}.{REFDATA_VERSION}.tgz')
 
    logger = getlogger('download-gvanno-custom')
    custom_cache_bytes_remote = get_url_num_bytes(url = datasets['gvanno_custom']['remote_url'], logger = logger)
-   logger.info("Downloading custom gvanno variant datasets:  Clinvar / dbNSFP / ncER / cancerhotspots ++")
+   logger.info("Downloading custom gvanno variant datasets:  Clinvar / dbNSFP / ncER / GWAS catalog / cancerhotspots ++")
    logger.info('Custom gvanno datasets - remote target file ' + str(datasets['gvanno_custom']['remote_url']))
    logger.info('Custom gvanno datasets - size: ' + pretty_print(custom_cache_bytes_remote, logger = logger))
    logger.info('Custom gvanno datasets - local destination file: ' + str(datasets['gvanno_custom']['local_path']))
diff --git a/examples/example_small.grch37.vcf.gz b/examples/example_small.grch37.vcf.gz
new file mode 100644
index 0000000..429b8b5
Binary files /dev/null and b/examples/example_small.grch37.vcf.gz differ
diff --git a/examples/example_small.grch37.vcf.gz.tbi b/examples/example_small.grch37.vcf.gz.tbi
new file mode 100644
index 0000000..542e4b2
Binary files /dev/null and b/examples/example_small.grch37.vcf.gz.tbi differ
diff --git a/examples/example.grch38.vcf.gz b/examples/example_small.grch38.vcf.gz
similarity index 100%
rename from examples/example.grch38.vcf.gz
rename to examples/example_small.grch38.vcf.gz
diff --git a/examples/example.grch38.vcf.gz.tbi b/examples/example_small.grch38.vcf.gz.tbi
old mode 100755
new mode 100644
similarity index 100%
rename from examples/example.grch38.vcf.gz.tbi
rename to examples/example_small.grch38.vcf.gz.tbi
diff --git a/gvanno.py b/gvanno.py
index 34ee0f4..8e2eba5 100755
--- a/gvanno.py
+++ b/gvanno.py
@@ -1,7 +1,5 @@
 #!/usr/bin/env python
 
-import csv
-import re
 import argparse
 import os
 import subprocess
@@ -11,10 +9,10 @@
 import platform
 from argparse import RawTextHelpFormatter
 
-GVANNO_VERSION = '1.6.0'
-DB_VERSION = 'GVANNO_DB_VERSION = 20230425'
-VEP_VERSION = '109'
-GENCODE_VERSION = 'v43'
+GVANNO_VERSION = '1.7.0'
+DB_VERSION = 'GVANNO_DB_VERSION = 20231224'
+VEP_VERSION = '110'
+GENCODE_VERSION = 'v44'
 VEP_ASSEMBLY = "GRCh38"
 DOCKER_IMAGE_VERSION = 'sigven/gvanno:' + str(GVANNO_VERSION)
 
@@ -35,10 +33,10 @@ def __main__():
    optional.add_argument('--force_overwrite', action = "store_true", help='By default, the script will fail with an error if any ' + \
       'output file already exists.\nYou can force the overwrite of existing result files by using this flag, default: %(default)s')
    optional.add_argument('--version', action='version', version='%(prog)s ' + str(GVANNO_VERSION))
-   optional.add_argument('--no_vcf_validate', action = "store_true",help="Skip validation of input VCF with Ensembl's vcf-validator, default: %(default)s")
-   optional.add_argument('--docker_uid', dest = 'docker_user_id', help = 'Docker user ID. default is the host system user ID. If you are experiencing permission errors, try setting this up to root (`--docker-uid root`)')
+   optional.add_argument('--docker_uid', dest = 'docker_user_id', help = 'Docker user ID. default is the host system user ID. ' + \
+      'If you are experiencing permission errors, try setting this up to root (`--docker-uid root`)')
    optional_vep.add_argument('--vep_regulatory', action='store_true', help = 'Enable VEP to look for overlap with regulatory regions (option --regulatory in VEP).')
-   optional_vep.add_argument('--vep_gencode_all', action='store_true', help = 'Consider all GENCODE transcripts with VEP (option --gencode_basic in VEP is used by default in gvanno).')
+   optional_vep.add_argument('--vep_gencode_basic', action='store_true', help = 'Consider basic GENCODE transcripts with VEP (option --gencode_basic in VEP).')
    optional_vep.add_argument('--vep_lof_prediction', action = "store_true", help = "Predict loss-of-function variants with Loftee plugin " + \
       "in VEP, default: %(default)s")
    optional_vep.add_argument('--vep_n_forks', default = 4, help="Number of forks for VEP processing, default: %(default)s")
@@ -46,18 +44,19 @@ def __main__():
       "for VEP processing\n- set lower to reduce memory usage, higher to increase speed, default: %(default)s")
    optional_vep.add_argument('--vep_pick_order', default = "mane_select,mane_plus_clinical,canonical,appris,biotype,ccds,rank,tsl,length", help="Comma-separated string " + \
       "of ordered transcript properties for primary variant pick in\nVEP processing, default: %(default)s")
-   optional_vep.add_argument('--vep_skip_intergenic', action = "store_true", help="Skip intergenic variants (VEP), default: %(default)s")
+   optional_vep.add_argument('--vep_no_intergenic', action = "store_true", help="Skip intergenic variants (VEP), default: %(default)s")
    optional_vep.add_argument('--vep_coding_only', action = "store_true", help="Only return consequences that fall in the coding regions of transcripts (VEP), default: %(default)s")
    optional.add_argument('--vcfanno_n_processes', default = 4, help="Number of processes for vcfanno " + \
       "processing (see https://github.com/brentp/vcfanno#-p), default: %(default)s")
    optional.add_argument('--oncogenicity_annotation', action ='store_true', help = 'Classify variants according to oncogenicity (Horak et al., Genet Med, 2022)')
    optional.add_argument("--debug", action="store_true", help="Print full Docker/Singularity commands to log and do not delete intermediate files with warnings etc.")
+   optional.add_argument("--sif_file", help="gvanno SIF file for usage of gvanno workflow with option '--container singularity'", default = None)
 
    required.add_argument('--query_vcf', help='VCF input file with query variants (SNVs/InDels).', required = True)
    required.add_argument('--gvanno_dir',help='Directory that contains the gvanno data bundle, e.g. ~/gvanno-' + str(GVANNO_VERSION), required = True)
    required.add_argument('--output_dir',help='Output directory', required = True)
    required.add_argument('--genome_assembly',choices = ['grch37','grch38'], help='Genome assembly build: grch37 or grch38', required = True)
-   required.add_argument('--container', choices = ['docker', 'singularity'], action = "store",help="Run gvanno with docker or singularity")
+   required.add_argument('--container', choices = ['docker', 'singularity'], action = "store",help="Run gvanno with docker or singularity (the latter requires --sif_file)")
    required.add_argument('--sample_id',help="Sample identifier - prefix for output files", required = True)
    
    args = parser.parse_args()
@@ -133,6 +132,15 @@ def verify_input_files(arg_dict, logger):
       err_msg = "Output directory (" + str(output_dir_full) + ") does not exist"
       gvanno_error_message(err_msg,logger)
    
+   if arg_dict['container'] == 'singularity':
+      if arg_dict['sif_file'] is None:
+         err_msg = "Please specify a singularity image when running in singularity mode (option '--sif_file')"
+         gvanno_error_message(err_msg,logger)
+      else:
+         if not os.path.exists(os.path.abspath(arg_dict['sif_file'])):
+            err_msg = "Singularity image file ('--sif_file' " + str(arg_dict['sif_file']) + ") does not exist"
+            gvanno_error_message(err_msg,logger)
+   
    ## check if input vcf exist
    if not arg_dict['query_vcf'] is None:
       if not os.path.exists(os.path.abspath(arg_dict['query_vcf'])):
@@ -147,7 +155,8 @@ def verify_input_files(arg_dict, logger):
       if os.path.abspath(arg_dict['query_vcf']).endswith('.vcf.gz'):
          tabix_file = arg_dict['query_vcf'] + '.tbi'
          if not os.path.exists(os.path.abspath(tabix_file)):
-            err_msg = "Tabix file (i.e. '.gz.tbi') is not present for the bgzipped VCF input file (" + os.path.abspath(arg_dict['query_vcf']) + "). Please make sure your input VCF is properly compressed and indexed (bgzip + tabix)"
+            err_msg = "Tabix file (i.e. '.gz.tbi') is not present for the bgzipped VCF input file (" + \
+               os.path.abspath(arg_dict['query_vcf']) + "). Please make sure your input VCF is properly compressed and indexed (bgzip + tabix)"
             gvanno_error_message(err_msg,logger)
 
       input_vcf_basename = os.path.basename(str(arg_dict['query_vcf']))
@@ -192,7 +201,8 @@ def verify_input_files(arg_dict, logger):
    f_rel_not.close()
          
    if compliant_data_bundle == 0:
-      err_msg = 'The gvanno data bundle is not compliant with the software version - please download the latest software and data bundle (see https://github.com/sigven/gvanno for instructions)'
+      err_msg = 'The gvanno data bundle is not compliant with the software version - please download the ' + \
+         'latest software and data bundle (see https://github.com/sigven/gvanno for instructions)'
       gvanno_error_message(err_msg,logger)
    
    host_directories = {}
@@ -242,6 +252,7 @@ def run_gvanno(arg_dict, host_directories):
    output_pass_vcf = 'None'
    uid = ''
    docker_user_id = arg_dict['docker_user_id']
+   debug = arg_dict['debug']
 
    global GENCODE_VERSION, VEP_ASSEMBLY
    if arg_dict['genome_assembly'] == 'grch37':
@@ -258,30 +269,48 @@ def run_gvanno(arg_dict, host_directories):
          uid = getpass.getuser()
    
    if uid == '':
-      logger.warning('Was not able to get user id/username for logged-in user on the underlying platform (platform.system(): ' + str(platform.system()) + ', sys.platform: ' + str(sys.platform) + '), now running gvanno as root')
+      logger.warning('Was not able to get user id/username for logged-in user on the underlying platform (platform.system(): ' + \
+         str(platform.system()) + ', sys.platform: ' + str(sys.platform) + '), now running gvanno as root')
       uid = 'root'
       
    vepdb_dir_host = os.path.join(str(host_directories['db_dir_host']),'.vep')
-   vcf_validation = 1
-   if arg_dict['no_vcf_validate']:
-      vcf_validation = 0
    data_dir = '/data'
    output_dir = '/workdir/output'
    vep_dir = '/usr/local/share/vep/data'
    input_vcf_docker = 'None'
+   data_dir_assembly = os.path.join(data_dir, "data", str(arg_dict["genome_assembly"]))
+   
+   #genome_assembly = arg_dict['genome_assembly']
+   
+   conf_options = {}
+   conf_options['sample_id'] = arg_dict['sample_id']
+   conf_options['genome_assembly'] = arg_dict['genome_assembly']
+   conf_options['conf'] = {}
+   conf_options['conf']['vep'] = {}
+   conf_options['conf']['vep']['vep_n_forks'] = arg_dict['vep_n_forks']
+   conf_options['conf']['vep']['vep_pick_order'] = arg_dict['vep_pick_order']
+   conf_options['conf']['vep']['vep_buffer_size'] = arg_dict['vep_buffer_size']
+   conf_options['conf']['vep']['vep_gencode_basic'] = arg_dict['vep_gencode_basic']
+   conf_options['conf']['vep']['vep_regulatory'] = arg_dict['vep_regulatory']
+   conf_options['conf']['vep']['vep_lof_prediction'] = arg_dict['vep_lof_prediction']
+   conf_options['conf']['vep']['vep_no_intergenic'] = arg_dict['vep_no_intergenic']
+   conf_options['conf']['vep']['vep_coding_only'] = arg_dict['vep_coding_only']
+   
    
    if host_directories['input_vcf_basename_host'] != 'NA':
       input_vcf_docker = '/workdir/input_vcf/' + str(host_directories['input_vcf_basename_host'])
    
-   vep_volume_mapping = str(vepdb_dir_host) + ":/usr/local/share/vep/data"
-   databundle_volume_mapping = str(host_directories['base_dir_host']) + ":/data"
+   vep_volume_mapping = str(vepdb_dir_host) + ":" + str(vep_dir)
+   databundle_volume_mapping = str(host_directories['base_dir_host']) + ":" + str(data_dir)
    input_vcf_volume_mapping = str(host_directories['input_vcf_dir_host']) + ":/workdir/input_vcf"
-   output_volume_mapping = str(host_directories['output_dir_host']) + ":/workdir/output"
+   output_volume_mapping = str(host_directories['output_dir_host']) + ":" + str(output_dir)
 
    if arg_dict['container'] == 'docker':
-      container_command_run1 = "docker run --rm -t -u " + str(uid) + " -v=" +  str(databundle_volume_mapping) + " -v=" + str(vep_volume_mapping) + " -v=" + str(output_volume_mapping)
+      container_command_run1 = "docker run --rm -t -u " + str(uid) + " -v=" +  str(databundle_volume_mapping) + \
+         " -v=" + str(vep_volume_mapping) + " -v=" + str(output_volume_mapping)
    elif arg_dict['container'] == 'singularity':
-      container_command_run1 = "singularity exec " + " -B " +  str(databundle_volume_mapping) + " -B " + str(vep_volume_mapping) + " -B " + str(output_volume_mapping)
+      container_command_run1 = "singularity exec " + " -B " +  str(databundle_volume_mapping) + " -B " + \
+         str(vep_volume_mapping) + " -B " + str(output_volume_mapping)
 
    if host_directories['input_vcf_dir_host'] != 'NA' and arg_dict['container'] == 'docker':
       container_command_run1 = container_command_run1  + " -v=" + str(input_vcf_volume_mapping)
@@ -291,7 +320,7 @@ def run_gvanno(arg_dict, host_directories):
    if arg_dict['container'] == 'docker':
       container_command_run1 = container_command_run1 + " -w=/workdir/output " + str(DOCKER_IMAGE_VERSION) + " sh -c \""
    elif arg_dict['container'] == 'singularity':
-      container_command_run1 = container_command_run1 + " -W /workdir/output " + 'src/gvanno.sif' + " sh -c \""
+      container_command_run1 = container_command_run1 + " -W /workdir/output " + arg_dict['sif_file'] + " sh -c \""
 
    if arg_dict['container'] == 'docker':
       container_command_run2 = "docker run --rm -t -u " + str(uid) + " -v=" +  str(databundle_volume_mapping) + " -v=" + str(output_volume_mapping)
@@ -299,7 +328,7 @@ def run_gvanno(arg_dict, host_directories):
       docker_command_run_end = '\"'
    elif arg_dict['container'] == 'singularity':
       container_command_run2 = "singularity exec " + " -B " +  str(databundle_volume_mapping) + " -B " + str(output_volume_mapping)
-      container_command_run2 = container_command_run2 + " -W /workdir/output " + 'src/gvanno.sif' + " sh -c \""
+      container_command_run2 = container_command_run2 + " -W /workdir/output " + arg_dict['sif_file'] + " sh -c \""
       docker_command_run_end = '\"'
 
    if arg_dict['debug']:
@@ -311,96 +340,71 @@ def run_gvanno(arg_dict, host_directories):
    logger.info("--- Generic variant annotation (gvanno) workflow ----")
    logger.info("Sample name: " + str(arg_dict['sample_id']))
    logger.info("Genome assembly: " + str(arg_dict['genome_assembly']))
-   print()
-
-   ## GVANNO|validate - verify input file (contents/format)
-   logger = getlogger('gvanno-validate-input')
-   logger.info("STEP 0: Validate input data")
-   vcf_validate_command = str(container_command_run1) + "gvanno_validate_input.py " + str(data_dir) + " " + str(input_vcf_docker) + " " + \
-      str(vcf_validation) + " "  + str(arg_dict['genome_assembly']) + docker_command_run_end
-   if arg_dict['debug']:
-      logger.info(vcf_validate_command)
-
-   check_subprocess(vcf_validate_command)
-   logger.info('Finished')
    
    if not input_vcf_docker == 'None':
       
-      ## Define input, output and temporary file names
-      output_vcf = os.path.join(output_dir, str(arg_dict['sample_id']) + '_gvanno_' + str(arg_dict['genome_assembly']) + '.vcf.gz')
-      output_tsv = os.path.join(output_dir, str(arg_dict['sample_id']) + '_gvanno_'  + str(arg_dict['genome_assembly']) + '.tsv')
-      output_pass_vcf = os.path.join(output_dir, str(arg_dict['sample_id']) + '_gvanno_pass_' + str(arg_dict['genome_assembly']) + '.vcf.gz')
-      output_pass_tsv = os.path.join(output_dir, str(arg_dict['sample_id']) + '_gvanno_pass_' + str(arg_dict['genome_assembly']) + '.tsv')
-      input_vcf_gvanno_ready = os.path.join(output_dir, re.sub(r'(\.vcf$|\.vcf\.gz$)','.gvanno_ready.vcf.gz',host_directories['input_vcf_basename_host']))
-      vep_vcf = re.sub(r'(\.vcf$|\.vcf\.gz$)','.vep.vcf',input_vcf_gvanno_ready)
-      vep_vcfanno_vcf = re.sub(r'(\.vcf$|\.vcf\.gz$)','.vep.vcfanno.vcf',input_vcf_gvanno_ready)
-      vep_vcfanno_annotated_vcf = re.sub(r'\.vcfanno','.vcfanno.annotated',vep_vcfanno_vcf) + '.gz'
-      vep_vcfanno_annotated_pass_vcf = re.sub(r'\.vcfanno','.vcfanno.annotated.pass',vep_vcfanno_vcf) + '.gz'
+      # Define temporary output file names
+      prefix = os.path.join(output_dir, f'{conf_options["sample_id"]}_gvanno_{conf_options["genome_assembly"]}')
       
-      ## Path for human genome assembly and human ancestor (FASTA)
-      fasta_assembly = os.path.join(vep_dir, "homo_sapiens", str(VEP_VERSION) + "_" + str(VEP_ASSEMBLY), "Homo_sapiens." + str(VEP_ASSEMBLY) + ".dna.primary_assembly.fa.gz")
-      ancestor_assembly = os.path.join(vep_dir, "homo_sapiens", str(VEP_VERSION) + "_" + str(VEP_ASSEMBLY), "human_ancestor.fa.gz")
-
-      ## List all VEP flags used when calling VEP
-      loftee_dir = '/opt/vep/src/ensembl-vep/modules'
-      plugins_in_use = "NearestExonJB"
-      vep_flags = "--hgvs --dont_skip --failed 1 --af --af_1kg --af_gnomade --af_gnomadg --variant_class --domains --symbol --protein --ccds " + \
-         "--uniprot --appris --biotype --canonical --format vcf --mane --cache --numbers --total_length --allele_number --no_escape " + \
-         "--xref_refseq --plugin NearestExonJB,max_range=50000"
-      vep_options = "--vcf --quiet --check_ref --flag_pick_allele_gene --pick_order " + str(arg_dict['vep_pick_order']) + \
-         " --force_overwrite --species homo_sapiens --assembly " + str(VEP_ASSEMBLY) + " --offline --fork " + \
-         str(arg_dict['vep_n_forks']) + " " + str(vep_flags) + " --dir /usr/local/share/vep/data"
-      
-      gencode_set_in_use = "GENCODE - all transcripts"
-      if arg_dict['vep_gencode_all'] == 0:
-         vep_options = vep_options + " --gencode_basic"
-         gencode_set_in_use = "GENCODE - basic transcript set (--gencode_basic)"
-      if arg_dict['vep_skip_intergenic'] == 1:
-         vep_options = vep_options + " --no_intergenic"
-      if arg_dict['vep_coding_only'] == 1:
-         vep_options = vep_options + " --coding_only"
-      if arg_dict['vep_regulatory'] == 1:
-         vep_options = vep_options + " --regulatory"
-      if arg_dict['vep_lof_prediction'] == 1:
-         plugins_in_use = plugins_in_use + ", LoF"
-         vep_options += " --plugin LoF,loftee_path:" + loftee_dir + ",human_ancestor_fa:" + str(ancestor_assembly)  + ",use_gerp_end_trunc:0 --dir_plugins " + loftee_dir
-
-      ## Compose full VEP command
-      vep_main_command = str(container_command_run1) + "vep --input_file " + str(input_vcf_gvanno_ready) + " --output_file " + str(vep_vcf) + \
-         " " + str(vep_options) + " --buffer_size " + str(arg_dict['vep_buffer_size']) + " --fasta " + str(fasta_assembly) + docker_command_run_end
-      vep_bgzip_command = container_command_run1 + "bgzip -f -c " + str(vep_vcf) + " > " + str(vep_vcf) + ".gz" + docker_command_run_end
-      vep_tabix_command = str(container_command_run1) + "tabix -f -p vcf " + str(vep_vcf) + ".gz" + docker_command_run_end
-
-      ## GVANNO|VEP - run consequence annotation with Variant Effect Predictor
-      logger = getlogger('gvanno-vep')   
-      print()
-      logger.info("STEP 1: Basic variant annotation with Variant Effect Predictor (v" + str(VEP_VERSION) + ", GENCODE " + str(GENCODE_VERSION) + ", " + str(arg_dict['genome_assembly']) + ")")
-      logger.info("VEP configuration - one primary consequence block pr. alternative allele (--flag_pick_allele)")
-      logger.info("VEP configuration - transcript pick order: " + str(arg_dict['vep_pick_order']))
-      logger.info("VEP configuration - transcript pick order: See more at https://www.ensembl.org/info/docs/tools/vep/script/vep_other.html#pick_options")
-      logger.info("VEP configuration - GENCODE set: " + str(gencode_set_in_use))
-      logger.info("VEP configuration - buffer size: " + str(arg_dict['vep_buffer_size']))
-      logger.info("VEP configuration - skip intergenic: " + str(arg_dict['vep_skip_intergenic']))
-      logger.info("VEP configuration - coding only: " + str(arg_dict['vep_coding_only']))
-      logger.info("VEP configuration - look for overlap with regulatory regions: " + str(arg_dict['vep_regulatory']))
-      logger.info("VEP configuration - number of forks: " + str(arg_dict['vep_n_forks']))
-      logger.info("VEP configuration - loss-of-function prediction: " + str(arg_dict['vep_lof_prediction']))
-      logger.info("VEP configuration - plugins in use: " + str(plugins_in_use))
-
-      if arg_dict['debug']:
-         logger.info(vep_main_command)
-      check_subprocess(vep_main_command)
-      check_subprocess(vep_bgzip_command)
-      check_subprocess(vep_tabix_command)
-      logger.info("Finished")
-
-      ## GVANNO|vcfanno - annotate VCF against a number of variant annotation resources
-      print()
+      input_vcf_validated =    f'{prefix}.gvanno_ready.vcf'
+      vep_vcf =                f'{prefix}.vep.vcf'
+      vep_vcfanno_vcf =        f'{prefix}.vep.vcfanno.vcf'
+      vep_vcfanno_summarised_vcf =      f'{prefix}.vep.vcfanno.summarised.vcf'
+      vep_vcfanno_summarised_pass_vcf = f'{prefix}.vep.vcfanno.summarised.pass.vcf'
+      output_vcf =             f'{prefix}.vcf.gz'
+      output_pass_vcf =        f'{prefix}.pass.vcf.gz'
+      output_vcf2tsv =         f'{prefix}.vcf2tsv.tsv'
+      output_pass_vcf2tsv =    f'{prefix}.pass.vcf2tsv.tsv'
+      output_pass_tsv =        f'{prefix}.pass.tsv.gz'      
+
+      # gvanno|validate_input - verify that VCF is of appropriate format
+      logger = getlogger("gvanno-validate-input")
+      print('')
+      logger.info("gvanno - STEP 0: Validate input data and options")
+
+      vcf_validate_command = (
+                f'{container_command_run1}'
+                f'gvanno_validate_input.py '
+                f'{data_dir} '
+                f'{input_vcf_docker} '
+                f'{input_vcf_validated} '
+                f'{conf_options["genome_assembly"]} '
+                f'{conf_options["sample_id"]} '                
+                f'--output_dir {output_dir} '
+                f'{"--debug " if debug else ""}'
+                f'{docker_command_run_end}'
+                )
+      check_subprocess(vcf_validate_command)
+      logger.info('Finished gvanno-validate-input')
+      print('----')
+      logger = getlogger("gvanno-run-vep")
+      logger.info("gvanno - STEP 1: Variant Effect Predictor (VEP)")
+      gvanno_vep_command = (
+                f'{container_command_run1}'
+                f'gvanno_vep.py '
+                f'{vep_dir} '
+                f'{input_vcf_validated}.gz '
+                f'{vep_vcf} '
+                f'{conf_options["genome_assembly"]} '
+                f'{conf_options["conf"]["vep"]["vep_pick_order"]} '
+                f'{"--vep_regulatory" if conf_options["conf"]["vep"]["vep_regulatory"] else ""} '
+                f'--vep_buffer_size {int(conf_options["conf"]["vep"]["vep_buffer_size"])} '
+                f'--vep_n_forks {int(conf_options["conf"]["vep"]["vep_n_forks"])} '
+                f'{"--vep_gencode_basic" if conf_options["conf"]["vep"]["vep_gencode_basic"] else ""} '
+                f'{"--vep_lof_prediction" if conf_options["conf"]["vep"]["vep_lof_prediction"] else ""} '
+                f'{"--vep_no_intergenic" if conf_options["conf"]["vep"]["vep_no_intergenic"] else ""} '
+                f'{"--debug " if debug else ""}'
+                f'{docker_command_run_end}'
+         )
+      check_subprocess(gvanno_vep_command)
+
+      ## gvanno|vcfanno - annotate VCF against a number of variant annotation resources
+      print("----")
       logger = getlogger('gvanno-vcfanno')
       logger.info("STEP 2: Clinical/functional variant annotations with gvanno-vcfanno (Clinvar, ncER, dbNSFP, GWAS catalog)")
       logger.info('vcfanno configuration - number of processes (-p): ' + str(arg_dict['vcfanno_n_processes']))
       gvanno_vcfanno_command = str(container_command_run2) + "gvanno_vcfanno.py --num_processes "  + str(arg_dict['vcfanno_n_processes']) + \
-         " --dbnsfp --clinvar --ncer --gvanno_xref --gwas " + str(vep_vcf) + ".gz " + str(vep_vcfanno_vcf) + \
+         " --dbnsfp --gene_transcript_xref --clinvar --ncer --gwas " + str(vep_vcf) + ".gz " + str(vep_vcfanno_vcf) + \
          " " + os.path.join(data_dir, "data", str(arg_dict['genome_assembly'])) + docker_command_run_end
       
       if arg_dict['debug']:
@@ -408,28 +412,37 @@ def run_gvanno(arg_dict, host_directories):
       check_subprocess(gvanno_vcfanno_command)
       logger.info("Finished")
 
-      ## GVANNO|summarise - expand annotations in VEP and vcfanno-annotated VCF file
-      print()
+      ## gvanno|summarise - expand annotations in VEP and vcfanno-annotated VCF file
+      print("----")
       logger = getlogger("gvanno-summarise")
       logger.info("STEP 3: Summarise gene and variant annotations with gvanno-summarise")
       logger.info("Configuration - oncogenicity classification: " + str(int(arg_dict['oncogenicity_annotation'])))
-      gvanno_summarise_command = str(container_command_run2) + "gvanno_summarise.py " + str(vep_vcfanno_vcf) + ".gz " + \
-         os.path.join(data_dir, "data", str(arg_dict['genome_assembly'])) + " " + str(int(arg_dict['vep_lof_prediction'])) + \
-            " "  + str(int(arg_dict['oncogenicity_annotation'])) + " " + str(int(arg_dict['vep_regulatory'])) + " " + \
-               str(int(arg_dict['debug'])) + docker_command_run_end
+      gvanno_summarise_command = (
+         f'{container_command_run2}'
+         f'gvanno_summarise.py '
+         f'{vep_vcfanno_vcf}.gz ' 
+         f'{vep_vcfanno_summarised_vcf} '
+         f'{int(arg_dict["vep_regulatory"])} '
+         f'{int(arg_dict["oncogenicity_annotation"])} '
+         f'{conf_options["conf"]["vep"]["vep_pick_order"]} '
+         f'{data_dir_assembly} '
+         f'{"--debug " if debug else ""}'
+         f'--compress_output_vcf '
+         f'{docker_command_run_end}'
+      )
       
       if arg_dict['debug']:
          logger.info(gvanno_summarise_command)
       check_subprocess(gvanno_summarise_command)
       logger.info("Finished")
       
-      ## GVANNO|clean - move output files and clean up temporary files
-      create_output_vcf_command1 = str(container_command_run2) + 'mv ' + str(vep_vcfanno_annotated_vcf) + ' ' + str(output_vcf) + "\""
-      create_output_vcf_command2 = str(container_command_run2) + 'mv ' + str(vep_vcfanno_annotated_vcf) + '.tbi ' + str(output_vcf) + '.tbi' + "\""
-      create_output_vcf_command3 = str(container_command_run2) + 'mv ' + str(vep_vcfanno_annotated_pass_vcf) + ' ' + str(output_pass_vcf) + "\""
-      create_output_vcf_command4 = str(container_command_run2) + 'mv ' + str(vep_vcfanno_annotated_pass_vcf) + '.tbi ' + str(output_pass_vcf) + '.tbi' + "\""
-      clean_command = str(container_command_run2) + 'rm -f ' + str(vep_vcf) + '* ' + str(vep_vcfanno_annotated_vcf) + ' ' + \
-          str(vep_vcfanno_annotated_pass_vcf) + '* ' + str(vep_vcfanno_vcf) + '* ' +  str(input_vcf_gvanno_ready) + "* "  + docker_command_run_end
+      ## gvanno|clean - move output files and clean up temporary files
+      create_output_vcf_command1 = str(container_command_run2) + 'mv ' + str(vep_vcfanno_summarised_vcf) + '.gz ' + str(output_vcf) + "\""
+      create_output_vcf_command2 = str(container_command_run2) + 'mv ' + str(vep_vcfanno_summarised_vcf) + '.gz.tbi ' + str(output_vcf) + '.tbi' + "\""
+      create_output_vcf_command3 = str(container_command_run2) + 'mv ' + str(vep_vcfanno_summarised_pass_vcf) + '.gz ' + str(output_pass_vcf) + "\""
+      create_output_vcf_command4 = str(container_command_run2) + 'mv ' + str(vep_vcfanno_summarised_pass_vcf) + '.gz.tbi ' + str(output_pass_vcf) + '.tbi' + "\""
+      clean_command = str(container_command_run2) + 'rm -f ' + str(vep_vcf) + '* ' + str(vep_vcfanno_summarised_vcf) + ' ' + \
+          str(vep_vcfanno_summarised_pass_vcf) + '* ' + str(vep_vcfanno_vcf) + '* ' +  str(input_vcf_validated) + "* "  + docker_command_run_end
       check_subprocess(create_output_vcf_command1)
       check_subprocess(create_output_vcf_command2)
       check_subprocess(create_output_vcf_command3)
@@ -437,21 +450,41 @@ def run_gvanno(arg_dict, host_directories):
       if not arg_dict['debug']:
          check_subprocess(clean_command)
       
-      print()
-      ## GVANNO|vcf2tsv - convert VCF to TSV with https://github.com/sigven/vcf2tsv
+      print("----")
+      ## gvanno|vcf2tsv - convert VCF to TSV with https://github.com/sigven/vcf2tsv
       logger = getlogger("gvanno-vcf2tsv")
       logger.info("STEP 4: Converting genomic VCF to TSV with https://github.com/sigven/vcf2tsvpy")
-      gvanno_vcf2tsv_command_pass = str(container_command_run2) + "vcf2tsvpy --input_vcf " + str(output_pass_vcf) + " --compress --out_tsv " + str(output_pass_tsv) + docker_command_run_end
-      gvanno_vcf2tsv_command_all = str(container_command_run2) + "vcf2tsvpy --input_vcf " + str(output_vcf) + " --compress --keep_rejected --out_tsv " + str(output_tsv) + docker_command_run_end
+      gvanno_vcf2tsv_command_pass = str(container_command_run2) + "vcf2tsvpy --input_vcf " + str(output_pass_vcf) + " --compress --out_tsv " + str(output_vcf2tsv) + docker_command_run_end
+      gvanno_vcf2tsv_command_all = str(container_command_run2) + "vcf2tsvpy --input_vcf " + str(output_vcf) + " --compress --keep_rejected --out_tsv " + str(output_pass_vcf2tsv) + docker_command_run_end
       logger.info("Conversion of VCF variant data to records of tab-separated values - PASS variants only")
       check_subprocess(gvanno_vcf2tsv_command_pass)
       logger.info("Conversion of VCF variant data to records of tab-separated values - PASS and non-PASS variants")
       check_subprocess(gvanno_vcf2tsv_command_all)
       logger.info("Finished")
+
+      print("----")
+      ## gvanno|append - move output files and clean up temporary fileslogger.info("Finished")
+      logger = getlogger("gvanno-finalize")
+      logger.info("STEP 5: Appending ClinVar traits, official gene names, and protein domain annotations")     
+      gvanno_append_tsv_command = (
+         f'{container_command_run2}'
+         f'gvanno_finalize.py '
+         f'{data_dir_assembly} '
+         f'{output_pass_vcf2tsv}.gz ' 
+         f'{output_pass_tsv} '
+         f'{arg_dict["genome_assembly"]} '
+         f'{arg_dict["sample_id"]}'         
+         f'{docker_command_run_end}'
+      )
+      check_subprocess(gvanno_append_tsv_command)
+      logger.info("Finished")
+      if not debug:
+         clean_command2 = str(container_command_run2) + 'rm -f ' + str(output_pass_vcf2tsv) + '* ' + \
+            str(output_vcf2tsv) + "* "  + docker_command_run_end
+         check_subprocess(clean_command2)
+      
       
-      #return
   
-   print
    
    
 if __name__=="__main__": __main__()
diff --git a/src/Dockerfile b/src/Dockerfile
index 1a0a2e2..c0a3f36 100644
--- a/src/Dockerfile
+++ b/src/Dockerfile
@@ -1,3 +1,4 @@
+ARG BRANCH=release/110
 
 ###################################################
 # Stage 1 - docker container to build ensembl-vep #
@@ -23,7 +24,7 @@ RUN apt-get update && apt-get -y install \
 ENV OPT /opt/vep
 ENV OPT_SRC $OPT/src
 ENV HTSLIB_DIR $OPT_SRC/htslib
-ENV BRANCH=release/109
+ARG BRANCH
 
 # Working directory
 WORKDIR $OPT_SRC
@@ -100,7 +101,6 @@ RUN apt-get update && apt-get -y install \
     perl \
     perl-base \
     unzip \
-    wget \
     vim && \
     apt-get -y purge manpages-dev && \
     apt-get clean && \
@@ -113,10 +113,10 @@ ENV PERL5LIB_TMP $PERL5LIB:$OPT_SRC/ensembl-vep:$OPT_SRC/ensembl-vep/modules
 ENV PERL5LIB $PERL5LIB_TMP:$OPT_SRC/bioperl-live
 ENV KENT_SRC $OPT/src/kent-335_base/src
 ENV HTSLIB_DIR $OPT_SRC/htslib
-ENV MACHTYPE x86_64
 ENV DEPS $OPT_SRC
 ENV PATH $OPT_SRC/ensembl-vep:$OPT_SRC/var_c_code:$PATH
 ENV LANG_VAR en_US.UTF-8
+ARG BRANCH
 
 # Create vep user
 RUN useradd -r -m -U -d "$OPT" -s /bin/bash -c "VEP User" -p '' vep && usermod -a -G sudo vep && mkdir -p $OPT_SRC
@@ -139,7 +139,8 @@ RUN perl Makefile.PL && make && make install && rm -f Makefile* cpanfile
 
 WORKDIR $OPT_SRC
 # Install/compile more libraries
-RUN ensembl-vep/travisci/build_c.sh && \
+RUN export MACHTYPE=$(uname -m) &&\
+    ensembl-vep/travisci/build_c.sh && \
     # Remove unused Bio-DB-HTS files
     rm -rf Bio-HTS/cpanfile Bio-HTS/Build.PL Bio-HTS/Build Bio-HTS/_build Bio-HTS/INSTALL.pl && \
     # Install ensembl perl dependencies (cpanm)
@@ -151,7 +152,9 @@ RUN ensembl-vep/travisci/build_c.sh && \
     echo "$LANG_VAR UTF-8" >> /etc/locale.gen && locale-gen en_US.utf8 && \
     /usr/sbin/update-locale LANG=$LANG_VAR && \
     # Copy htslib executables. It also requires the packages 'zlib1g-dev', 'libbz2-dev' and 'liblzma-dev'
-    cp $HTSLIB_DIR/bgzip $HTSLIB_DIR/tabix $HTSLIB_DIR/htsfile /usr/local/bin/
+    cp $HTSLIB_DIR/bgzip $HTSLIB_DIR/tabix $HTSLIB_DIR/htsfile /usr/local/bin/ && \
+    # Remove CPAN cache
+    rm -rf /root/.cpanm
 
 ENV LC_ALL $LANG_VAR
 ENV LANG $LANG_VAR
@@ -160,43 +163,32 @@ ENV LANG $LANG_VAR
 USER vep
 ENV PERL5LIB $PERL5LIB_TMP
 
-
-USER vep
-ENV PERL5LIB $PERL5LIB_TMP
+# Setup Docker environment for when users run VEP and INSTALL.pl in Docker image:
+#   - skip VEP updates in INSTALL.pl
+ENV VEP_NO_UPDATE 1
+#   - avoid Faidx/HTSLIB installation in INSTALL.pl
+ENV VEP_NO_HTSLIB 1
+#   - skip plugin installation in INSTALL.pl
+ENV VEP_NO_PLUGINS 1
+#   - set plugins directory for VEP and INSTALL.pl
+ENV VEP_DIR_PLUGINS /plugins
+ENV VEP_PLUGINSDIR $VEP_DIR_PLUGINS
+WORKDIR $VEP_DIR_PLUGINS
 
 WORKDIR /
 ADD loftee_1.0.3.tgz $OPT/src/ensembl-vep/modules
 ADD UTRannotator.tgz $OPT/src/ensembl-vep/modules
 ADD NearestExonJB.pm $OPT/src/ensembl-vep/modules
-#RUN wget -q "https://raw.githubusercontent.com/Ensembl/VEP_plugins/$BRANCH/NearestExonJB.pm" -O $OPT/src/ensembl-vep/modules/NearestExonJB.pm
-
 
-# Final steps
-WORKDIR $OPT_SRC/ensembl-vep
 # Update bash profile
+WORKDIR $OPT_SRC/ensembl-vep
 RUN echo >> $OPT/.profile && \
     echo PATH=$PATH:\$PATH >> $OPT/.profile && \
     echo export PATH >> $OPT/.profile && \
-    # Run INSTALL.pl and remove the ensemb-vep tests and travis
-    ./INSTALL.pl -a a -l -n && rm -rf t travisci .travis.yml
-
-
-# Install dependencies for VEP plugins:
-#USER root
-#ENV PLUGIN_DEPS "https://raw.githubusercontent.com/Ensembl/VEP_plugins/$BRANCH/config"
-##   - Ubuntu packages
-#RUN curl -O "$PLUGIN_DEPS/ubuntu-packages.txt" && \
-#    apt-get update && apt-get install -y --no-install-recommends \
-#    $(sed s/\#.*//g ubuntu-packages.txt) && \
-#    rm -rf /var/lib/apt/lists/* ubuntu-packages.txt
-##   - Perl modules
-#RUN curl -O "$PLUGIN_DEPS/cpanfile" && \
-#    cpanm --installdeps --with-recommends . && \
-#    rm -rf /root/.cpanm cpanfile
-##   - Python packages
-#RUN curl -O "$PLUGIN_DEPS/requirements.txt" && \
-#    python -m pip install --no-cache-dir -r requirements.txt && \
-#    rm requirements.txt
+    # Install Ensembl API and plugins
+    ./INSTALL.pl --auto ap --plugins all --pluginsdir $VEP_DIR_PLUGINS --no_update --no_htslib && \
+    # Remove the ensemb-vep tests and travis
+    rm -rf t travisci .travis.yml
 
 ######################################################
 # Stage 3 - adding dependencies for gvanno analysis  #
@@ -210,17 +202,15 @@ RUN apt-get update \
   && ln -s /usr/bin/python3 python \
   && pip3 install --upgrade pip
 
-RUN apt-get update && apt-get -y install apache2 apt-utils build-essential cpanminus curl git libmysqlclient-dev libpng-dev libssl-dev manpages mysql-client openssl perl perl-base unzip vim wget sudo
+RUN apt-get update && apt-get -y install apache2 apt-utils build-essential curl git libpng-dev libssl-dev manpages openssl unzip vim wget sudo
 # install ensembl dependencies
-RUN cpanm Test::Object PPI::Document Task::Weaken Test::SubCalls Test::Object DBI DBD::mysql Archive::Zip Perl::Critic Set::IntervalTree
-RUN apt-get update && apt-get install apt-transport-https
 
 USER root
 WORKDIR /
 
 ENV PACKAGE_BIO="libhts2 bedtools"
 ENV PACKAGE_DEV="gfortran gcc-multilib autoconf liblzma-dev libncurses5-dev libblas-dev liblapack-dev libssh2-1-dev libxml2-dev vim libssl-dev libcairo2-dev libbz2-dev libcurl4-openssl-dev"
-ENV PYTHON_MODULES="numpy==1.19.2 cython==0.29.21 scipy==1.5.3 pandas==1.1.3 cyvcf2==0.20.9"
+ENV PYTHON_MODULES="numpy cython scipy pandas cyvcf2"
 RUN apt-get update \
 	&& apt-get install -y --no-install-recommends \
 		nano ed locales vim-tiny fonts-texgyre \
@@ -229,17 +219,12 @@ RUN apt-get update \
 
 RUN apt-get autoremove
 
-## Install vcfanno version v0.3.2
-RUN wget https://github.com/brentp/vcfanno/releases/download/v0.3.2/vcfanno_linux64 && \
+## Install vcfanno version v0.3.5
+RUN wget https://github.com/brentp/vcfanno/releases/download/v0.3.5/vcfanno_linux64 && \
     mv vcfanno_linux64 vcfanno && \
     mv vcfanno /usr/local/bin && \
     chmod 755 /usr/local/bin/vcfanno
 
-## Install Ensembl's Vcf-validator
-RUN wget https://github.com/EBIvariation/vcf-validator/releases/download/v0.9.3/vcf_validator_linux && \
-mv vcf_validator_linux /usr/local/bin/vcf_validator && \
-chmod 755 /usr/local/bin/vcf_validator
-
 USER root
 WORKDIR /
 
@@ -259,20 +244,22 @@ RUN apt-get update \
 
 USER root
 WORKDIR /
+USER root
 
 ## vt - variant tool set + vcf2tsvpy - install using Conda
 ## primary use of vt in GVANNO: decomposition of multiallelic variants in a VCF file
 
 # install Conda
-RUN wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh \
+RUN wget http://repo.continuum.io/miniconda/Miniconda3-py37_4.8.2-Linux-x86_64.sh -O miniconda.sh \
     && chmod 0755 miniconda.sh
 RUN ["/bin/bash", "-c", "/miniconda.sh -b -p /conda"]
 RUN rm miniconda.sh
 
 # update conda & install vt + vcf2tsvpy
-RUN /conda/bin/conda update conda
-RUN /conda/bin/conda update python
-RUN /conda/bin/conda install -c bioconda vt vcf2tsvpy python=3.6
+#RUN /conda/bin/conda update conda
+#RUN /conda/bin/conda update python
+#RUN /conda/bin/conda install python=3.6
+RUN /conda/bin/conda install -c bioconda vt vcf2tsvpy bcftools
 
 ## Clean Up
 RUN apt-get clean autoclean
diff --git a/src/buildDocker.sh b/src/buildDocker.sh
index 8962fd6..5e2522b 100755
--- a/src/buildDocker.sh
+++ b/src/buildDocker.sh
@@ -3,5 +3,5 @@
 tar czvfh gvanno.tgz gvanno/
 echo "Build the Docker Image"
 TAG=`date "+%Y%m%d"`
-docker build -t sigven/gvanno:$TAG --rm=true .
+docker build --no-cache -t sigven/gvanno:$TAG --rm=true .
 
diff --git a/src/gvanno/gvanno_finalize.py b/src/gvanno/gvanno_finalize.py
new file mode 100755
index 0000000..fa27cc6
--- /dev/null
+++ b/src/gvanno/gvanno_finalize.py
@@ -0,0 +1,32 @@
+#!/usr/bin/env python
+
+import os,re
+import argparse
+
+from lib.gvanno.variant import append_annotations, clean_annotations
+from lib.gvanno.utils import getlogger
+
+
+def __main__():
+    parser = argparse.ArgumentParser(description='Finalize gvanno-annotated TSV file - append final annotations')
+    parser.add_argument('gvanno_db_dir', help='Assembly-specific directory with gvanno data bundle files')
+    parser.add_argument('tsv_file_in', help='vcf2tsv-converted VCF file with gvanno-annotated variants (SNVs/InDels)')
+    parser.add_argument('tsv_file_out', help='TSV file with cleaned gvanno-annotated variants (SNVs/InDels)')
+    parser.add_argument('genome_assembly', help='Genome assembly')
+    parser.add_argument('sample_id', help='Sample identifier')
+    parser.add_argument("--debug", action="store_true", default=False, help="Print full commands to log, default: %(default)s")
+    args = parser.parse_args()
+
+    logger = getlogger('gvanno-vep')
+    
+    arg_dict = vars(args)
+   
+    variant_set = \
+        append_annotations(
+            arg_dict['tsv_file_in'], gvanno_db_dir = arg_dict['gvanno_db_dir'], logger = logger)
+    variant_set = clean_annotations(variant_set, arg_dict['sample_id'], arg_dict['genome_assembly'], logger = logger)        
+    variant_set.fillna('.').to_csv(arg_dict['tsv_file_out'], sep="\t", compression="gzip", index=False)
+    
+
+if __name__=="__main__": __main__()
+
diff --git a/src/gvanno/gvanno_summarise.py b/src/gvanno/gvanno_summarise.py
index bd57c2e..7f550d4 100755
--- a/src/gvanno/gvanno_summarise.py
+++ b/src/gvanno/gvanno_summarise.py
@@ -3,155 +3,200 @@
 import csv
 import re
 import argparse
-from cyvcf2 import VCF, Writer
-import gzip
+import cyvcf2
 import os
-import annoutils
-import oncogenicity
 
-logger = annoutils.getlogger('gvanno-gene-annotate')
-csv.field_size_limit(500 * 1024 * 1024)
+from lib.gvanno.annoutils import read_infotag_file, make_transcript_xref_map, read_genexref_namemap, write_pass_vcf, map_regulatory_variant_annotations
+from lib.gvanno.vep import parse_vep_csq
+from lib.gvanno.dbnsfp import vep_dbnsfp_meta_vcf, map_variant_effect_predictors
+from lib.gvanno.oncogenicity import assign_oncogenicity_evidence
+from lib.gvanno.mutation_hotspot import load_mutation_hotspots, match_csq_mutation_hotspot
+from lib.gvanno.utils import error_message, check_subprocess, getlogger
+from lib.gvanno.vep import parse_vep_csq
 
+csv.field_size_limit(500 * 1024 * 1024)
 
 def __main__():
-   
-   parser = argparse.ArgumentParser(description='Gene annotations from gvanno pipeline (SNVs/InDels)')
-   parser.add_argument('vcf_file', help='VCF file with VEP-annotated query variants (SNVs/InDels)')
-   parser.add_argument('gvanno_db_dir',help='gvanno data directory')
-   parser.add_argument('lof_prediction',default=0,type=int,help='VEP LoF prediction setting (0/1)')
-   parser.add_argument('oncogenicity_annotation',default=0,type=int,help='Include oncogenicity annotation (0/1)')
-   parser.add_argument('regulatory_annotation',default=0,type=int,help='Inclusion of VEP regulatory annotations (0/1)')
-   parser.add_argument('debug',default=0,type=int,help='Output extensive debugging information')
-   args = parser.parse_args()
-
-   extend_vcf_annotations(args.vcf_file, args.gvanno_db_dir, args.lof_prediction, args.oncogenicity_annotation, args.regulatory_annotation, args.debug)
-
-def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0, oncogenicity_annotation = 0, regulatory_annotation = 0, debug = 0):
-   """
-   Function that reads VEP/vcfanno-annotated VCF and extends the VCF INFO column with tags from
-   1. CSQ elements within the primary transcript consequence picked by VEP, e.g. SYMBOL, Feature, Gene, Consequence etc.
-   2. Gene annotations, e.g. known oncogenes/tumor suppressors
-   3. Cancer hotspot mutation sites
-   4. Variant effect predictions
-   """
-
-   ## read VEP and PCGR tags to be appended to VCF file
-   vcf_infotags_meta = annoutils.read_infotag_file(logger, os.path.join(gvanno_db_directory,'gvanno_infotags.tsv'))
-   gvanno_xref_map = annoutils.read_genexref_namemap(logger, os.path.join(gvanno_db_directory,'gvanno_xref', 'gvanno_xref_namemap.tsv'))
-   out_vcf = re.sub(r'\.vcf(\.gz){0,}$','.annotated.vcf',query_vcf)
-
-   meta_vep_dbnsfp_info = annoutils.vep_dbnsfp_meta_vcf(query_vcf, vcf_infotags_meta)
-   dbnsfp_prediction_algorithms = meta_vep_dbnsfp_info['dbnsfp_prediction_algorithms']
-   vep_csq_fields_map = meta_vep_dbnsfp_info['vep_csq_fieldmap']
-   vcf = VCF(query_vcf)
-   for tag in vcf_infotags_meta:
-      if lof_prediction == 0 and regulatory_annotation == 0:
-         if not tag.startswith('LoF') and not tag.startswith('REGULATORY_'):
-            vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']), \
-                                    'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
-      elif lof_prediction == 1 and regulatory_annotation == 0:
-         if not tag.startswith('REGULATORY_'):
-            vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']), \
-                                    'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
-      elif lof_prediction == 0 and regulatory_annotation == 1:
-         if not tag.startswith('LoF'):
-            vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']), \
-                                    'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
-      else:
-         vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']), \
-                                 'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
-
-
-   w = Writer(out_vcf, vcf)
-   current_chrom = None
-   num_chromosome_records_processed = 0
-   num_records_filtered = 0
-
-   cancer_hotspots = annoutils.read_cancer_hotspots(logger, os.path.join(gvanno_db_directory,'cancer_hotspots', 'cancer_hotspots.tsv'))
-   
-   vcf_info_element_types = {}
-   for e in vcf.header_iter():
-      header_element = e.info()
-      if 'ID' in header_element and 'HeaderType' in header_element and 'Type' in header_element:
-         identifier = str(header_element['ID'])
-         fieldtype = str(header_element['Type'])
-         vcf_info_element_types[identifier] = fieldtype
-
-   for rec in vcf:
-      if current_chrom is None:
-         current_chrom = str(rec.CHROM)
-         num_chromosome_records_processed = 0
-      else:
-         if str(rec.CHROM) != current_chrom:
-            logger.info('Completed summary of functional annotations for ' + str(num_chromosome_records_processed) + ' variants on chromosome ' + str(current_chrom))
+    parser = argparse.ArgumentParser(description='Summarise VEP annotations (gene/variant) from gvanno pipeline (SNVs/InDels)')
+    parser.add_argument('vcf_file_in', help='Bgzipped VCF file with VEP-annotated query variants (SNVs/InDels)')
+    parser.add_argument('vcf_file_out', help='Bgzipped VCF file with extended VEP-annotated query variants (SNVs/InDels)')
+    parser.add_argument('regulatory_annotation',default=0,type=int,help='Inclusion of VEP regulatory annotations (0/1)')
+    parser.add_argument('oncogenicity_annotation',default=0,type=int,help='Include oncogenicity annotation (0/1)')
+    parser.add_argument('vep_pick_order', default="mane_select,mane_plus_clinical,canonical,appris,biotype,ccds,rank,tsl,length", 
+                        help=f"Comma-separated string of ordered transcript/variant properties for selection of primary variant consequence")
+    parser.add_argument('gvanno_db_dir',help='gvanno data directory')
+    parser.add_argument('--compress_output_vcf', action="store_true", default=False, help="Compress output VCF file")
+    
+    parser.add_argument("--debug", action="store_true", default=False, help="Print full commands to log, default: %(default)s")
+    args = parser.parse_args()
+
+    logger = getlogger('gvanno-gene-annotate')
+    
+    arg_dict = vars(args)
+    
+    extend_vcf_annotations(arg_dict, logger)
+
+def extend_vcf_annotations(arg_dict, logger):
+    """
+    Function that reads VEP/vcfanno-annotated VCF and extends the VCF INFO column with tags from
+    1. CSQ elements for the primary (i.e. "picked") gene transcript consequence from VEP, e.g. SYMBOL, Feature, Gene, Consequence etc.
+    2. Gene annotations (GENE_TRANSCRIPT_XREF),
+    3. Variant effect predictions - dbNSFP
+
+    Moreover, it performs two important matching procedures, using 
+    4. Information from VEP's CSQ information (HGVSp/HGVSc) to match known mutation hotspots in cancer
+    
+    Finally, it assesses somatic variant oncogenicity, using
+    5. Gene annotations (tumor suppressor, oncogene) and variant annotations (loss-of-function, gnomAD variant frequencies, variant effect predictions).
+       Variant oncogenicity levels are provided for all variants using a recommended five-level scheme ("Oncogenic", "Likely oncogenic", "VUS", "Likely Benign", "Benign")
+       - Recommended scoring scheme for variant oncogenicity classification outlined by VICC/ClinGen consortia (Horak et al., Genet Med, 2022)
+
+    List of VCF INFO tags appended by this procedure is defined by the 'infotags' files in the gvanno_db_dir
+    """
+    
+    vcf_infotags = {}
+    
+    vcf_infotags['other'] = read_infotag_file(os.path.join(arg_dict['gvanno_db_dir'], 'vcf_infotags_gvanno.tsv'), scope = "gvanno")
+    vcf_infotags['vep'] = read_infotag_file(os.path.join(arg_dict['gvanno_db_dir'], 'vcf_infotags_vep.tsv'), scope = "vep")
+    vcf_infotags['other'].update(vcf_infotags['vep'])
+    vcf_info_metadata = vcf_infotags['other']
+
+    gene_transcript_xref_map = read_genexref_namemap(
+        os.path.join(arg_dict['gvanno_db_dir'], 'gene','tsv','gene_transcript_xref', 'gene_transcript_xref_bedmap.tsv.gz'), logger)
+    cancer_hotspots = load_mutation_hotspots(
+        os.path.join(arg_dict['gvanno_db_dir'], 'misc','tsv','hotspot', 'hotspot.tsv.gz'), logger)
+
+    out_vcf = re.sub(r'(\.gz)$','',arg_dict['vcf_file_out'])
+
+    meta_vep_dbnsfp_info = vep_dbnsfp_meta_vcf(arg_dict['vcf_file_in'], vcf_info_metadata)
+    dbnsfp_prediction_algorithms = meta_vep_dbnsfp_info['dbnsfp_prediction_algorithms']
+    vep_csq_fields_map = meta_vep_dbnsfp_info['vep_csq_fieldmap']
+    
+    vcf = cyvcf2.VCF(arg_dict['vcf_file_in'])
+    for tag in sorted(vcf_info_metadata):
+        if arg_dict['regulatory_annotation'] == 0:
+            if not tag.startswith('REGULATORY_'):
+                vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_info_metadata[tag]['description']),'Type':str(vcf_info_metadata[tag]['type']), 'Number': str(vcf_info_metadata[tag]['number'])})
+        else:
+            vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_info_metadata[tag]['description']),'Type':str(vcf_info_metadata[tag]['type']), 'Number': str(vcf_info_metadata[tag]['number'])})
+
+    w = cyvcf2.Writer(arg_dict['vcf_file_out'], vcf)
+    current_chrom = None
+    num_chromosome_records_processed = 0
+
+    vcf_info_element_types = {}
+    for e in vcf.header_iter():
+        header_element = e.info()
+        if 'ID' in header_element and 'HeaderType' in header_element and 'Type' in header_element:
+            identifier = str(header_element['ID'])
+            fieldtype = str(header_element['Type'])
+            vcf_info_element_types[identifier] = fieldtype
+
+    vars_no_csq = list()
+    for rec in vcf:
+        alt_allele = ','.join(rec.ALT)
+        pos = rec.start + 1
+        variant_id = f"g.{rec.CHROM}:{pos}{rec.REF}>{alt_allele}"
+        if current_chrom is None:
             current_chrom = str(rec.CHROM)
             num_chromosome_records_processed = 0
-      if rec.INFO.get('CSQ') is None:
-         num_records_filtered = num_records_filtered + 1
-         #logger.warning('Variant record ' + str(variant_id) + ' does not have CSQ tag from Variant Effect Predictor (--vep_skip_intergenic or --vep_coding_only turned ON?)  - variant will be skipped')
-         continue
-      num_chromosome_records_processed += 1
-      gvanno_xref = annoutils.make_transcript_xref_map(rec, gvanno_xref_map, xref_tag = "GVANNO_XREF")
-
-      if regulatory_annotation == 1:
-         csq_record_results_all = annoutils.parse_vep_csq(rec, gvanno_xref, vep_csq_fields_map, logger, pick_only = False, csq_identifier = 'CSQ')
-
-         if 'picked_gene_csq' in csq_record_results_all:
-            vep_csq_records_all = csq_record_results_all['picked_gene_csq']
-            rec.INFO['REGULATORY_ANNOTATION'] = annoutils.map_regulatory_variant_annotations(vep_csq_records_all)
-
-      vep_csq_record_results = annoutils.parse_vep_csq(rec, gvanno_xref, vep_csq_fields_map, logger, pick_only = True, csq_identifier = 'CSQ', debug = debug)
-
-      
-      principal_hgvsp = '.'
-      principal_hgvsc = '.'
-      if 'picked_csq' in vep_csq_record_results:
-         csq_record = vep_csq_record_results['picked_csq']
-         for k in csq_record:
-            if k in vcf_info_element_types:
-               if vcf_info_element_types[k] == "Flag":
-                  #rec.INFO[k] = False
-                  if csq_record[k] == "1":
-                     rec.INFO[k] = True                                
-               else:
-                  if not csq_record[k] is None:
-                     rec.INFO[k] = csq_record[k]
-
-                     if k == 'HGVSp_short':
-                        principal_hgvsp = csq_record[k]
+        else:
+            if str(rec.CHROM) != current_chrom:
+                if not current_chrom is None:
+                    logger.info(f"Completed summary of functional annotations for {num_chromosome_records_processed} variants on chr{current_chrom}")
+                current_chrom = str(rec.CHROM)
+                num_chromosome_records_processed = 0
+        if rec.INFO.get('CSQ') is None:
+            
+            vars_no_csq.append(variant_id)
+            continue
+
+        num_chromosome_records_processed += 1
+        transcript_xref_map = make_transcript_xref_map(rec, gene_transcript_xref_map, xref_tag = "GENE_TRANSCRIPT_XREF")
+
+        vep_csq_record_results = {}
+        vep_csq_record_results = \
+            parse_vep_csq(rec, transcript_xref_map, vep_csq_fields_map, arg_dict['vep_pick_order'], 
+                        logger, pick_only = False, csq_identifier = 'CSQ')
+        
+        if 'picked_gene_csq' in vep_csq_record_results and bool(arg_dict['regulatory_annotation']) is True:
+            rec.INFO['REGULATORY_ANNOTATION'] = map_regulatory_variant_annotations(
+                vep_csq_record_results['picked_gene_csq'])
+
+        principal_csq_properties = {}
+        principal_csq_properties['hgvsp'] = '.'
+        principal_csq_properties['hgvsc'] = '.'
+        principal_csq_properties['entrezgene'] = '.'
+        principal_csq_properties['exon'] = '.'
+        principal_csq_properties['codon'] = '.'
+        principal_csq_properties['lof'] = '.'
+        
+        if 'picked_csq' in vep_csq_record_results:
+            csq_record = vep_csq_record_results['picked_csq']
+            for k in csq_record:
+                if k in vcf_info_element_types:
+                    if vcf_info_element_types[k] == "Flag" and csq_record[k] == "1":
+                        rec.INFO[k] = True
+                    else:
+                        if not csq_record[k] is None:
+                            rec.INFO[k] = csq_record[k]
+
+                            if k == 'HGVSp_short':
+                                principal_csq_properties['hgvsp'] = csq_record[k]
+                                if re.match(r'^(p.[A-Z]{1}[0-9]{1,}[A-Za-z]{1,})', principal_csq_properties['hgvsp']):
+                                    codon_match = re.findall(r'[A-Z][0-9]{1,}', principal_csq_properties['hgvsp'])
+                                    if len(codon_match) == 1:
+                                        principal_csq_properties['codon'] = 'p.' + codon_match[0]
                      
-                     if k == 'HGVSc':
-                        principal_hgvsc = csq_record[k].split(':')[1]
-            #else:
-            #   print("missing\t" + str(k))
-
-      if 'all_csq' in vep_csq_record_results:
-         rec.INFO['VEP_ALL_CSQ'] = ','.join(vep_csq_record_results['all_csq'])
-         annoutils.map_cancer_hotspots(vep_csq_record_results['all_csq'], cancer_hotspots, rec, principal_hgvsp, principal_hgvsc)
-
-      if not rec.INFO.get('DBNSFP') is None:
-         annoutils.map_variant_effect_predictors(rec, dbnsfp_prediction_algorithms)
-
-      if oncogenicity_annotation == 1:
-         oncogenicity.assign_oncogenicity_evidence(rec, tumortype = "Any")
-      w.write_record(rec)
-   w.close()
-   logger.info('Completed summary of functional annotations for ' + str(num_chromosome_records_processed) + ' variants on chromosome ' + str(current_chrom))
-   vcf.close()
-   logger.info("Number of variant calls filtered by VEP (No CSQ tag, '--vep_coding_only' / '--vep_skip_intergenic'): " + str(num_records_filtered))
-
-   if os.path.exists(out_vcf):
-      if os.path.getsize(out_vcf) > 0:
-         os.system('bgzip -f ' + str(out_vcf))
-         os.system('tabix -f -p vcf ' + str(out_vcf) + '.gz')
-         annotated_vcf = out_vcf + '.gz'
-         annoutils.write_pass_vcf(annotated_vcf, logger)
-      else:
-         annoutils.error_message('No remaining PASS variants found in query VCF - exiting and skipping STEP 4 (gvanno-writer)', logger)
-   else:
-      annoutils.error_message('No remaining PASS variants found in query VCF - exiting and skipping STEP 4 (gvanno-writer)', logger)
-
-if __name__=="__main__": __main__()
-
-
-      
+                            if k == 'HGVSc':
+                                principal_csq_properties['hgvsc'] = csq_record[k].split(':')[1]
+                            
+                            if k == 'ENTREZGENE':
+                                principal_csq_properties['entrezgene'] = csq_record[k]
+                            
+                            if k == 'LOSS_OF_FUNCTION':
+                                principal_csq_properties['lof'] = csq_record[k]
+                            
+                            if k == 'EXON':
+                                if "/" in csq_record[k]:
+                                    principal_csq_properties['exon'] = csq_record[k].split('/')[0]
+        
+        if 'all_csq' in vep_csq_record_results:
+            rec.INFO['VEP_ALL_CSQ'] = ','.join(vep_csq_record_results['all_csq'])
+            match_csq_mutation_hotspot(vep_csq_record_results['all_csq'], cancer_hotspots, rec, principal_csq_properties)
+
+        if not rec.INFO.get('DBNSFP') is None:
+            map_variant_effect_predictors(rec, dbnsfp_prediction_algorithms)
+        
+        if arg_dict['oncogenicity_annotation'] == 1:
+            assign_oncogenicity_evidence(rec, tumortype = "Any")
+
+        if "GENE_TRANSCRIPT_XREF" in vcf_info_element_types:
+            gene_xref_tag = rec.INFO.get('GENE_TRANSCRIPT_XREF')
+            if not gene_xref_tag is None:
+                del rec.INFO['GENE_TRANSCRIPT_XREF']                
+        w.write_record(rec)
+    if vars_no_csq:
+        logger.warning(f"There were {len(vars_no_csq)} records with no CSQ tag from VEP (was --vep_no_intergenic flag set?). Skipping them and showing (up to) the first 100:")
+        print('----')
+        print(', '.join(vars_no_csq[:100]))
+        print('----')
+    w.close()
+    if current_chrom is not None:
+        logger.info(f"Completed summary of functional annotations for {num_chromosome_records_processed} variants on chr{current_chrom}")
+    vcf.close()
+
+    if os.path.exists(arg_dict['vcf_file_out']):
+        if os.path.getsize(arg_dict['vcf_file_out']) > 0:
+            if arg_dict['compress_output_vcf'] is True:
+                check_subprocess(logger, f'bgzip -f {out_vcf}', debug=arg_dict['debug'])
+                check_subprocess(logger, f'tabix -f -p vcf {out_vcf}.gz', debug=arg_dict['debug'])
+                summarized_vcf = f'{arg_dict["vcf_file_out"]}.gz'
+                write_pass_vcf(summarized_vcf, logger)
+        else:
+            error_message('No remaining PASS variants found in query VCF - exiting and skipping STEP 4', logger)
+    else:
+        error_message('No remaining PASS variants found in query VCF - exiting and skipping STEP 4', logger)
+
+if __name__=="__main__":
+    __main__()
diff --git a/src/gvanno/gvanno_validate_input.py b/src/gvanno/gvanno_validate_input.py
index 8701a64..1d77835 100755
--- a/src/gvanno/gvanno_validate_input.py
+++ b/src/gvanno/gvanno_validate_input.py
@@ -7,118 +7,145 @@
 import subprocess
 import logging
 import sys
-import annoutils
 import pandas as np
 from cyvcf2 import VCF
 
+from lib.gvanno.vcf import check_existing_vcf_info_tags
+from lib.gvanno.annoutils import read_infotag_file
+from lib.gvanno.utils import getlogger, random_id_generator, check_subprocess, remove_file
+
+
 def __main__():
    
    parser = argparse.ArgumentParser(description='Verify input data for gvanno')
    parser.add_argument('gvanno_dir',help='Docker location of gvanno base directory with accompanying data directory, e.g. /data')
    parser.add_argument('input_vcf', help='VCF input file with query variants (SNVs/InDels)')
-   parser.add_argument('vcf_validation',type=int, help="Perform VCF validation with Ensembl's vcf-validator")
+   parser.add_argument('validated_vcf', help="Validated VCF file with query variants (SNVs/InDels)")
    parser.add_argument('genome_assembly',help='grch37 or grch38')
+   parser.add_argument('sample_id',help='Sample identifier')
    parser.add_argument('--output_dir', dest='output_dir', help='Output directory', default='/workdir/output')
-
-
+   parser.add_argument('--debug', action='store_true', help="Print debug messages")
    args = parser.parse_args()
    
-   ret = validate_gvanno_input(args.gvanno_dir, args.input_vcf, args.vcf_validation, args.genome_assembly, args.output_dir)
+   ret = validate_gvanno_input(args.gvanno_dir, args.input_vcf, args.validated_vcf, args.sample_id, args.genome_assembly, args.output_dir, args.debug)
    if ret != 0:
       sys.exit(-1)
 
-def check_existing_vcf_info_tags(input_vcf, gvanno_directory, genome_assembly, logger):
-   
-   """
-   Function that compares the INFO tags in the query VCF and the INFO tags generated by gvanno
-   If any coinciding tags, an error will be returned
-   """
-   
-   gvanno_infotags_desc = annoutils.read_infotag_file(logger, os.path.join(gvanno_directory,'data',genome_assembly,'gvanno_infotags.tsv'))
-         
-   vcf = VCF(input_vcf)
-   logger.info('Checking if existing INFO tags of query VCF file coincide with gvanno INFO tags')
-   ret = 1
-   for e in vcf.header_iter():
-      header_element = e.info()
-      if 'ID' in header_element.keys() and 'HeaderType' in header_element.keys():
-         if header_element['HeaderType'] == 'INFO':
-            if header_element['ID'] in gvanno_infotags_desc.keys():
-               err_msg = 'INFO tag ' + str(header_element['ID']) + ' in the query VCF coincides with a VCF annotation tag produced by gvanno - please remove or rename this tag in your query VCF'
-               return annoutils.error_message(err_msg, logger)
-   
-   logger.info('No query VCF INFO tags coincide with gvanno INFO tags')
-   return ret
 
-def simplify_vcf(input_vcf, vcf, output_dir, logger):
-   
-   """
-   Function that performs tre things on the validated input VCF:
-   1. Strip of any genotype data
-   2. If VCF have variants with multiple alternative alleles ("multiallelic", e.g. 'A,T'), these are decomposed into variants with a single alternative allele
-   3. Final VCF file is sorted and indexed (bgzip + tabix)
-   """
-   
-   input_vcf_gvanno_ready = os.path.join(output_dir, re.sub(r'(\.vcf$|\.vcf\.gz$)','.gvanno_ready.tmp.vcf',os.path.basename(input_vcf)))
-   input_vcf_gvanno_ready_decomposed = os.path.join(output_dir, re.sub(r'(\.vcf$|\.vcf\.gz$)','.gvanno_ready.vcf',os.path.basename(input_vcf)))
-   
-   multiallelic_alt = 0
-   for rec in vcf:
-      POS = rec.start + 1
-      alt = ",".join(str(n) for n in rec.ALT)
-      if len(rec.ALT) > 1:
-         logger.warning("Multiallelic site detected:" + str(rec.CHROM) + '\t' + str(POS) + '\t' + str(rec.REF) + '\t' + str(alt))
-         multiallelic_alt = 1
-   
-   command_vcf_sample_free1 = 'egrep \'^#\' ' + str(input_vcf) + ' > ' + str(input_vcf_gvanno_ready)
-   command_vcf_sample_free2 = 'egrep -v \'^#\' ' + str(input_vcf) + ' | sed \'s/^chr//\' | egrep \'^[0-9]\' | sort -k1,1n -k2,2n -k3,3 -k4,4 >> ' + str(input_vcf_gvanno_ready)
-   command_vcf_sample_free3 = 'egrep -v \'^#\' ' + str(input_vcf) + ' | sed \'s/^chr//\' | egrep -v \'^[0-9]\' | egrep \'^[XYM]\' | sort -k1,1 -k2,2n -k3,3 -k4,4 >> ' + str(input_vcf_gvanno_ready)
-   command_vcf_sample_free4 = 'egrep -v \'^#\' ' + str(input_vcf) + ' | sed \'s/^chr//\' | egrep -v \'^[0-9]\' | egrep -v \'^[XYM]\' | sort -k1,1 -k2,2n -k3,3 -k4,4 >> ' + str(input_vcf_gvanno_ready)
-   if input_vcf.endswith('.gz'):
-      command_vcf_sample_free1 = 'bgzip -dc ' + str(input_vcf) + ' | egrep \'^#\' > ' + str(input_vcf_gvanno_ready)
-      command_vcf_sample_free2 = 'bgzip -dc ' + str(input_vcf) + ' | egrep -v \'^#\' | sed \'s/^chr//\' | egrep \'^[0-9]\' | sort -k1,1n -k2,2n -k3,3 -k4,4 >> ' + str(input_vcf_gvanno_ready)
-      command_vcf_sample_free3 = 'bgzip -dc ' + str(input_vcf) + ' | egrep -v \'^#\' | sed \'s/^chr//\' | egrep -v \'^[0-9]\' | egrep \'^[XYM]\' | sort -k1,1 -k2,2n -k3,3 -k4,4 >> ' + str(input_vcf_gvanno_ready)
-      command_vcf_sample_free4 = 'bgzip -dc ' + str(input_vcf) + ' | egrep -v \'^#\' | sed \'s/^chr//\' | egrep -v \'^[0-9]\' | egrep -v \'^[XYM]\' | sort -k1,1 -k2,2n -k3,3 -k4,4 >> ' + str(input_vcf_gvanno_ready)
-   
-   os.system(command_vcf_sample_free1)
-   os.system(command_vcf_sample_free2)
-   os.system(command_vcf_sample_free3)
-   os.system(command_vcf_sample_free4)
-
-   if multiallelic_alt == 1:
-      logger.info('Decomposing multi-allelic sites in input VCF file using \'vt decompose\'')
-      command_decompose = 'vt decompose -s ' + str(input_vcf_gvanno_ready) + ' > ' + str(input_vcf_gvanno_ready_decomposed) + ' 2> ' + os.path.join(output_dir, 'decompose.log')
-      os.system(command_decompose)
-   else:
-      command_copy = 'cp ' + str(input_vcf_gvanno_ready) + ' ' + str(input_vcf_gvanno_ready_decomposed)
-      os.system(command_copy)
-   os.system('bgzip -f ' + str(input_vcf_gvanno_ready_decomposed))
-   os.system('tabix -p vcf ' + str(input_vcf_gvanno_ready_decomposed) + '.gz')
-   os.system('rm -f ' + str(input_vcf_gvanno_ready) + ' /workdir/output/decompose.log')
-
-def validate_gvanno_input(gvanno_directory, input_vcf, vcf_validation, genome_assembly, output_dir):
+
+def simplify_vcf(input_vcf, validated_vcf, vcf, output_dir, sample_id, logger, debug):
+    """
+    input_vcf: path to input VCF
+    validated_vcf: path to validated VCF
+    vcf: parsed cyvcf2 object
+    Function that performs the following on the validated input VCF:
+    1. Strip of any genotype data
+    2. If VCF has variants with multiple alternative alleles ("multiallelic", e.g. 'A,T'), 
+       these are decomposed into variants with a single alternative allele
+    3. Final VCF file is sorted and indexed (bgzip + tabix)
+    """
+
+    random_id = random_id_generator(15) 
+
+    temp_files = {}
+    temp_files['vcf_1'] = \
+        os.path.join(output_dir, f'{sample_id}.gvanno_validate.bcftools.{random_id}_1.vcf')
+    temp_files['vcf_2'] = \
+        os.path.join(output_dir, f'{sample_id}.gvanno_validate.bcftools.{random_id}_2.vcf.gz')
+    temp_files['vcf_3'] = \
+        os.path.join(output_dir, f'{sample_id}.gvanno_validate.bftools.{random_id}_3.vcf.gz')
+    bcftools_simplify_log = \
+        os.path.join(output_dir, f'{sample_id}.gvanno_validate.bcftools.{random_id}.log')
+    vt_decompose_log = \
+        os.path.join(output_dir, f'{sample_id}.gvanno_validate.vt_decompose.{random_id}.log')
+
+    multiallelic_list = list()
+    for rec in vcf:
+        POS = rec.start + 1
+        alt = ",".join(str(n) for n in rec.ALT)
+        if len(rec.ALT) > 1:
+            variant_id = f"{rec.CHROM}:{POS}_{rec.REF}->{alt}"
+            multiallelic_list.append(variant_id)
+
+    logger.info('Extracting variants on autosomal/sex/mito chromosomes only (1-22,X,Y, M/MT) with bcftools')
+    # bgzip + tabix required for sorting
+    cmd_vcf1 = f'bcftools view {input_vcf} | bgzip -cf > {temp_files["vcf_2"]} && tabix -p vcf {temp_files["vcf_2"]} && ' + \
+        f'bcftools sort --temp-dir {output_dir} -Oz {temp_files["vcf_2"]} > {temp_files["vcf_3"]} 2> {bcftools_simplify_log} && ' + \
+        f'tabix -p vcf {temp_files["vcf_3"]}'
+    # Keep only autosomal/sex/mito chrom (handle hg38 and hg19), remove FORMAT metadata lines, keep cols 1-8, sub chr prefix
+    chrom_to_keep = [str(x) for x in [*range(1,23), 'X', 'Y', 'M', 'MT']]
+    chrom_to_keep = ','.join([*['chr' + chrom for chrom in chrom_to_keep], *[chrom for chrom in chrom_to_keep]])
+    cmd_vcf2 = f'bcftools view --regions {chrom_to_keep} {temp_files["vcf_3"]} | egrep -v \'^##FORMAT=\' ' + \
+        f'| cut -f1-8 | sed \'s/^chr//\' > {temp_files["vcf_1"]}'
+
+    check_subprocess(logger, cmd_vcf1, debug)
+    check_subprocess(logger, cmd_vcf2, debug)
+
+    if multiallelic_list:
+        logger.warning(f"There were {len(multiallelic_list)} multiallelic sites detected. Showing (up to) the first 100:")
+        print('----')
+        print(', '.join(multiallelic_list[:100]))
+        print('----')
+        logger.info('Decomposing multi-allelic sites in input VCF file using \'vt decompose\'')
+        command_decompose = f'vt decompose -s {temp_files["vcf_1"]} > {validated_vcf} 2> {vt_decompose_log}'
+        check_subprocess(logger, command_decompose, debug)
+    else:
+        logger.info('All sites seem to be decomposed - skipping decomposition!')
+        check_subprocess(logger, f'cp {temp_files["vcf_1"]} {validated_vcf}', debug)
+
+    keep_uncompressed = False
+    # need to keep uncompressed copy for vcf2maf.pl if selected
+    bgzip_cmd = f"bgzip -cf {validated_vcf} > {validated_vcf}.gz" if keep_uncompressed else f"bgzip -f {validated_vcf}"
+    check_subprocess(logger, bgzip_cmd, debug)
+    check_subprocess(logger, f'tabix -p vcf {validated_vcf}.gz', debug)
+
+    if os.path.exists(f'{validated_vcf}.gz') and os.path.getsize(f'{validated_vcf}.gz') > 0:
+        vcf = VCF(f'{validated_vcf}.gz')
+        i = 0
+        for rec in vcf:
+            i = i + 1
+        if len(vcf.seqnames) == 0 or i == 0:
+            logger.info('')
+            logger.info("Input VCF contains NO valid variants after VCF cleaning - quitting workflow")
+            logger.info('')
+            exit(1)
+
+    if not debug:
+        remove_file(temp_files["vcf_1"])
+        remove_file(temp_files["vcf_2"])
+        remove_file(temp_files["vcf_3"])
+        remove_file(temp_files["vcf_2"] + str('.tbi'))
+        remove_file(temp_files["vcf_3"] + str('.tbi'))
+        remove_file(bcftools_simplify_log)
+        remove_file(vt_decompose_log)
+
+def validate_gvanno_input(gvanno_directory, input_vcf, validated_vcf, sample_id, genome_assembly, output_dir, debug):
    """
    Function that reads the input file to gvanno (VCF file) and performs the following checks:
-   1. Check that VCF file is properly formatted (according to EBIvariation/vcf-validator - VCF v4.2)
-   2. Check that no INFO annotation tags in the query VCF coincides with those generated by gvanno
-   3. Check that if VCF have variants with multiple alternative alleles (e.g. 'A,T') run vt decompose
-   4. Any genotype data from VCF input file is stripped, and the resulting VCF file is sorted and indexed (bgzip + tabix) 
+   1. Check that no INFO annotation tags in the query VCF coincides with those generated by gvanno
+   2. Check that if VCF have variants with multiple alternative alleles (e.g. 'A,T') run vt decompose
+   3. Any genotype data from VCF input file is stripped, and the resulting VCF file is sorted and indexed (bgzip + tabix) 
    """
-   logger = annoutils.getlogger('gvanno-validate-input')
+   logger = getlogger('gvanno-validate-input')
 
    if not input_vcf == 'None':
-      if vcf_validation == 1:
-         valid_vcf = annoutils.is_valid_vcf(input_vcf, output_dir, logger, False)
-         if valid_vcf == -1:
-            return -1
-      else:
-         logger.info('Skipping validation of VCF file - as provided by option --no_vcf_validate')
-      tag_check = check_existing_vcf_info_tags(input_vcf, gvanno_directory, genome_assembly, logger)
+      vcf_object = VCF(input_vcf)
+
+      ## Check that VCF does not contain INFO tags that will be appended with gvanno annotation
+      vcf_infotags = {}
+      vcf_infotags['gvanno'] = read_infotag_file(
+         os.path.join(gvanno_directory,'data',genome_assembly, 'vcf_infotags_gvanno.tsv'), scope = "gvanno")
+      vcf_infotags['vep'] = read_infotag_file(
+         os.path.join(gvanno_directory,'data',genome_assembly, 'vcf_infotags_vep.tsv'), scope = "vep")
+      vcf_infotags['gvanno'].update(vcf_infotags['vep'])
+      vcf_tags_gvanno = vcf_infotags['gvanno']
+        
+      tag_check = check_existing_vcf_info_tags(vcf_object, vcf_tags_gvanno, logger)
       if tag_check == -1:
-         return -1
+         return -1     
       
       vcf = VCF(input_vcf)
-      simplify_vcf(input_vcf, vcf, output_dir, logger)
+      simplify_vcf(input_vcf, validated_vcf, vcf, output_dir, sample_id, logger, debug)
    
    return 0
    
diff --git a/src/gvanno/gvanno_vcfanno.py b/src/gvanno/gvanno_vcfanno.py
index d972a5d..1b5bfc7 100755
--- a/src/gvanno/gvanno_vcfanno.py
+++ b/src/gvanno/gvanno_vcfanno.py
@@ -1,138 +1,167 @@
 #!/usr/bin/env python
 
 import argparse
-from cyvcf2 import VCF
 import random
-import annoutils
-import os
-import re
-import sys
+import re, os
+import glob
 
-logger = annoutils.getlogger('gvanno-vcfanno')
+from lib.gvanno.vcf import get_vcf_info_tags, print_vcf_header
+from lib.gvanno.utils import check_subprocess, random_id_generator, getlogger, remove_file
+from lib.gvanno.annoutils import read_vcfanno_tag_file
 
 
 def __main__():
-   parser = argparse.ArgumentParser(description='Run brentp/vcfanno - annotate a VCF file against multiple VCF files in parallel', formatter_class=argparse.ArgumentDefaultsHelpFormatter)
-   parser.add_argument('query_vcf', help='Bgzipped input VCF file with query variants (SNVs/InDels)')
-   parser.add_argument('out_vcf', help='Output VCF file with appended annotations from multiple VCF files')
-   parser.add_argument('gvanno_db_dir', help='gvanno data directory')
-   parser.add_argument('--num_processes', help="Number of processes vcfanno can use during annotation", default=4)
-   parser.add_argument("--ncer",action = "store_true", help="Annotate VCF with ranking of variant deleteriousness in non-coding regions (ncER)")
-   parser.add_argument("--clinvar",action = "store_true", help="Annotate VCF with annotations from ClinVar")
-   parser.add_argument("--dbnsfp",action = "store_true", help="Annotate VCF with annotations from database of non-synonymous functional predictions")
-   parser.add_argument("--gvanno_xref",action = "store_true", help="Annotate VCF with transcript annotations from gvanno (protein complexes, disease associations, etc)")
-   parser.add_argument("--gwas",action = "store_true", help="Annotate VCF with against known loci associated with cancer, as identified from genome-wide association studies (GWAS)")
-
-   
-   args = parser.parse_args()
-   query_info_tags = get_vcf_info_tags(args.query_vcf)
-   vcfheader_file = args.out_vcf + '.tmp.' + str(random.randrange(0,10000000)) + '.header.txt'
-   vcfanno_conf_fname = args.out_vcf + '.tmp.conf.toml'
-   print_vcf_header(args.query_vcf, vcfheader_file, chromline_only = False)
-   run_vcfanno(args.num_processes, args.query_vcf, vcfanno_conf_fname, query_info_tags, vcfheader_file, args.gvanno_db_dir, args.out_vcf, args.ncer, args.clinvar, args.dbnsfp, args.gvanno_xref,args.gwas)
-
-
-def prepare_vcfanno_configuration(vcfanno_data_directory, vcfanno_conf_fname, vcfheader_file, logger, datasource_info_tags, query_info_tags, datasource):
-   for t in datasource_info_tags:
-      if t in query_info_tags:
-         logger.warning("Query VCF has INFO tag " + str(t) + ' - this is also present in the ' + str(datasource) + ' VCF/BED annotation file. This tag will be overwritten if not renamed in the query VCF')
-   append_to_conf_file(datasource, datasource_info_tags, vcfanno_data_directory, vcfanno_conf_fname)
-   append_to_vcf_header(vcfanno_data_directory, datasource, vcfheader_file)
-
-def run_vcfanno(num_processes, query_vcf, vcfanno_conf_fname, query_info_tags, vcfheader_file, gvanno_db_directory, output_vcf, ncer, clinvar, dbnsfp,  gvanno_xref, gwas):
-   """
-   Function that annotates a VCF file with vcfanno against a user-defined set of germline and somatic VCF files
-   """
-   clinvar_info_tags = ["CLINVAR_MSID","CLINVAR_PMID","CLINVAR_CLNSIG","CLINVAR_VARIANT_ORIGIN","CLINVAR_CONFLICTED","CLINVAR_UMLS_CUI","CLINVAR_HGVSP",
-                        "CLINVAR_UMLS_CUI_SOMATIC","CLINVAR_CLNSIG_SOMATIC","CLINVAR_PMID_SOMATIC","CLINVAR_ALLELE_ID","CLINVAR_MOLECULAR_EFFECT",
-                        "CLINVAR_REVIEW_STATUS_STARS","CLINVAR_CLASSIFICATION"]
-   dbnsfp_info_tags = ["DBNSFP"]
-   gvanno_xref_info_tags = ["GVANNO_XREF"]
-   gwas_info_tags = ["GWAS_HIT"]
-   ncer_info_tags = ["NCER_PERCENTILE"]
-   
-   if clinvar is True:
-      prepare_vcfanno_configuration(gvanno_db_directory, vcfanno_conf_fname, vcfheader_file, logger, clinvar_info_tags, query_info_tags, "clinvar")
-   if dbnsfp is True:
-      prepare_vcfanno_configuration(gvanno_db_directory, vcfanno_conf_fname, vcfheader_file, logger, dbnsfp_info_tags, query_info_tags, "dbnsfp")
-   if gvanno_xref is True:
-      prepare_vcfanno_configuration(gvanno_db_directory, vcfanno_conf_fname, vcfheader_file, logger, gvanno_xref_info_tags, query_info_tags, "gvanno_xref")
-   if gwas is True:
-      prepare_vcfanno_configuration(gvanno_db_directory, vcfanno_conf_fname, vcfheader_file, logger, gwas_info_tags, query_info_tags, "gwas")
-   if ncer is True:
-      prepare_vcfanno_configuration(gvanno_db_directory, vcfanno_conf_fname, vcfheader_file, logger, ncer_info_tags, query_info_tags, "ncer")
-
-   out_vcf_vcfanno_unsorted1 = output_vcf + '.tmp.unsorted.1'
-   query_prefix = re.sub('\.vcf.gz$','',query_vcf)
-   print_vcf_header(query_vcf, vcfheader_file, chromline_only = True)
-   command1 = "vcfanno -p=" + str(num_processes) + " " + str(vcfanno_conf_fname) + " " + str(query_vcf) + " > " + str(out_vcf_vcfanno_unsorted1) + " 2> " + str(query_prefix) + '.vcfanno.log'
-   os.system(command1)
-   
-   os.system('cat ' + str(vcfheader_file) + ' > ' + str(output_vcf))
-   os.system('cat ' + str(out_vcf_vcfanno_unsorted1) + ' | grep -v \'^#\' >> ' + str(output_vcf))
-   os.system('rm -f ' + str(output_vcf) + '.tmp*')
-   os.system('bgzip -f ' + str(output_vcf))
-   os.system('tabix -f -p vcf ' + str(output_vcf) + '.gz')
-   return 0
-   
-def append_to_vcf_header(gvanno_db_directory, datasource, vcfheader_file):
-   """
-   Function that appends the VCF header information for a given 'datasource' (containing INFO tag formats/descriptions, and datasource version)
-   """
-   vcf_info_tags_file = str(gvanno_db_directory) + '/' + str(datasource) + '/' + str(datasource) + '.vcfanno.vcf_info_tags.txt'
-   os.system('cat ' + str(vcf_info_tags_file) + ' >> ' + str(vcfheader_file))
-
-
-def append_to_conf_file(datasource, datasource_info_tags, gvanno_db_directory, vcfanno_conf_fname):
-   """
-   Function that appends data to a vcfanno conf file ('vcfanno_conf_fname') according to user-defined ('datasource'). 
-   The datasource defines the set of tags that will be appended during annotation
-   """
-   fh = open(vcfanno_conf_fname,'a')
-   if datasource != 'gvanno_xref' and datasource != 'ncer':
-      fh.write('[[annotation]]\n')
-      fh.write('file="' + str(gvanno_db_directory) + '/' + str(datasource) + '/' + str(datasource) + '.vcf.gz"\n')
-      fields_string = 'fields = ["' + '","'.join(datasource_info_tags) + '"]'
-      ops = ['concat'] * len(datasource_info_tags)
-      ops_string = 'ops=["' + '","'.join(ops) + '"]'
-      fh.write(fields_string + '\n')
-      fh.write(ops_string + '\n\n')
-   else:
-      if datasource == 'gvanno_xref' or datasource == 'ncer':
-         if datasource == 'gvanno_xref':
-            fh.write('[[annotation]]\n')
-            fh.write('file="' + str(gvanno_db_directory) + '/' + str(datasource) + '/' + str(datasource) + '.bed.gz"\n')
-            fh.write('columns=[4]\n')
-            names_string = 'names=["' + '","'.join(datasource_info_tags) + '"]'
-            fh.write(names_string +'\n')
-            fh.write('ops=["concat"]\n\n')
-            
-         else: ## ncER
-            fh.write('[[annotation]]\n')
-            fh.write('file="' + str(gvanno_db_directory) + '/' + str(datasource) + '/' + str(datasource) + '.bed.gz"\n')
-            fh.write('columns=[4]\n')
-            names_string = 'names=["' + '","'.join(datasource_info_tags) + '"]'
-            fh.write(names_string +'\n')
-            fh.write('ops=["mean"]\n\n')
-   fh.close()
-   return
-
-def get_vcf_info_tags(vcffile):
-   vcf = VCF(vcffile)
-   info_tags = {}
-   for e in vcf.header_iter():
-      header_element = e.info()
-      if 'ID' in header_element.keys() and 'HeaderType' in header_element.keys():
-         if header_element['HeaderType'] == 'INFO':
-            info_tags[str(header_element['ID'])] = 1
-   
-   return info_tags
-
-
-def print_vcf_header(query_vcf, vcfheader_file, chromline_only = False):
-   if chromline_only == True:
-      os.system('bgzip -dc ' + str(query_vcf) + ' | egrep \'^#\' | egrep \'^#CHROM\' >> ' + str(vcfheader_file))
-   else:
-      os.system('bgzip -dc ' + str(query_vcf) + ' | egrep \'^#\' | egrep -v \'^#CHROM\' > ' + str(vcfheader_file))
-
-if __name__=="__main__": __main__()
+    parser = argparse.ArgumentParser(description='Run brentp/vcfanno - annotate a VCF file against multiple VCF files in parallel',
+                                     formatter_class=argparse.ArgumentDefaultsHelpFormatter)
+    parser.add_argument(
+        'query_vcf', help='Bgzipped input VCF file with query variants (SNVs/InDels)')
+    parser.add_argument(
+        'out_vcf', help='Output VCF file with appended annotations from multiple VCF files')
+    parser.add_argument(
+        'gvanno_db_dir', help='gvanno assembly-specific data directory')
+    parser.add_argument(
+        '--num_processes', help="Number of processes vcfanno can use during annotation", default=4)
+    parser.add_argument("--clinvar", action="store_true",
+                        help="Annotate VCF with annotations from ClinVar")
+    parser.add_argument("--ncer", action="store_true",
+                         help="Annotate VCF with ranking of variant deleteriousness in non-coding regions (ncER)")
+    parser.add_argument("--dbnsfp", action="store_true",
+                        help="Annotate VCF with annotations from database of non-synonymous functional predictions")   
+    parser.add_argument("--gene_transcript_xref", action="store_true",
+                        help="Annotate VCF with transcript annotations from PCGR (drug targets, actionable genes, cancer gene roles, etc)")
+    parser.add_argument("--gwas", action="store_true",
+                        help="Annotate VCF against moderate-to-low cancer risk variants, as identified from genome-wide association studies (GWAS)")
+    parser.add_argument("--debug", action="store_true", default=False,
+                        help="Print full commands to log, keep temporary and log files, default: %(default)s")
+
+    args = parser.parse_args()
+
+    logger = getlogger('gvanno-vcfanno')
+
+    query_info_tags = get_vcf_info_tags(args.query_vcf)
+    vcfheader_file = args.out_vcf + '.tmp.' + \
+        str(random.randrange(0, 10000000)) + '.header.txt'
+    conf_fname = args.out_vcf + '.tmp.conf.toml'
+    print_vcf_header(args.query_vcf, vcfheader_file,
+                     logger, chromline_only=False)
+    
+    vcfanno_tracks = {}
+    ## BED   
+    vcfanno_tracks['gene_transcript_xref'] = args.gene_transcript_xref
+
+    ## VCF
+    vcfanno_tracks['gwas'] = args.gwas
+    vcfanno_tracks['dbnsfp'] = args.dbnsfp
+    vcfanno_tracks['clinvar'] = args.clinvar
+    vcfanno_tracks['ncer'] = args.ncer
+    vcfanno_tracks['gene_transcript_xref'] = args.gene_transcript_xref
+
+    run_vcfanno(args.num_processes, args.query_vcf, vcfanno_tracks, query_info_tags, vcfheader_file,
+                args.gvanno_db_dir, conf_fname, args.out_vcf, args.debug, logger)
+
+
+def run_vcfanno(num_processes, query_vcf, vcfanno_tracks, query_info_tags, vcfheader_file, gvanno_db_dir, conf_fname,
+                output_vcf, debug, logger):
+
+    """
+    Function that annotates a VCF file with vcfanno against a user-defined set of germline and somatic VCF files
+    """
+
+    ## Collect metadata (VCF INFO tags) for annotations populated with vcfanno
+    metadata_vcf_infotags = {}
+    infotags = {}
+
+    track_file_info = {}
+
+    ## INFO tags used for vcfanno annotation
+    track_file_info['tags_fname'] = {}
+
+    ## Source file (VCF/BED) used for vcfanno annotation
+    track_file_info['track_fname'] = {}
+    
+    for variant_track in ['clinvar','gwas','dbnsfp']:
+        track_file_info['tags_fname'][variant_track] = os.path.join(gvanno_db_dir,'variant','vcf', variant_track, f'{variant_track}.vcfanno.vcf_info_tags.txt')
+        track_file_info['track_fname'][variant_track] = os.path.join(gvanno_db_dir,'variant','vcf', variant_track, f'{variant_track}.vcf.gz')
+
+    track_file_info['tags_fname']['gene_transcript_xref'] = os.path.join(gvanno_db_dir,'gene','bed', 'gene_transcript_xref', 'gene_transcript_xref.vcfanno.vcf_info_tags.txt')
+    track_file_info['track_fname']['gene_transcript_xref'] = os.path.join(gvanno_db_dir,'gene','bed', 'gene_transcript_xref', 'gene_transcript_xref.bed.gz')
+    track_file_info['tags_fname']['ncer'] = os.path.join(gvanno_db_dir,'misc','bed', 'ncer', 'ncer.vcfanno.vcf_info_tags.txt')
+    track_file_info['track_fname']['ncer'] = os.path.join(gvanno_db_dir,'misc','bed', 'ncer', 'ncer.bed.gz')
+    
+    
+    for track in track_file_info['tags_fname']:
+
+        if not vcfanno_tracks[track] is True:
+            continue
+
+        infotags_vcfanno = read_vcfanno_tag_file(track_file_info['tags_fname'][track], logger)
+        infotags[track] = infotags_vcfanno.keys()
+        for tag in infotags_vcfanno:
+            if tag in query_info_tags:
+                logger.warning("Query VCF has INFO tag " + str(tag) + ' - this is also present in the ' + str(
+                    track) + ' VCF/BED annotation file. This tag will be overwritten if not renamed in the query VCF')
+            metadata_vcf_infotags[tag] = infotags_vcfanno[tag]
+        
+        ## append track to vcfanno configuration file
+        append_to_conf_file(track, infotags[track], track_file_info['track_fname'][track], conf_fname)
+
+        ## Append VCF INFO tags to VCF header file
+        check_subprocess(logger, f'cat {track_file_info["tags_fname"][track]} >> {vcfheader_file}', debug=False)
+
+    random_id = random_id_generator(10)
+    query_prefix = re.sub(r'\.vcf.gz$', '', query_vcf)
+    print_vcf_header(query_vcf, vcfheader_file, logger, chromline_only=True)
+    
+    vcfanno_command = (
+        f"vcfanno -p={num_processes} {conf_fname} {query_vcf} > {query_prefix}.{random_id}.tmp.vcfanno.unsorted.vcf 2> "
+        f"{query_prefix}.{random_id}.tmp.vcfanno.log"
+        )
+    
+    if debug:
+        logger.info(f"vcfanno command: {vcfanno_command}")
+    check_subprocess(logger, vcfanno_command, debug)
+
+    check_subprocess(
+        logger, f'cat {vcfheader_file} > {output_vcf}', debug=False)
+    check_subprocess(
+        logger, f"cat {query_prefix}.{random_id}.tmp.vcfanno.unsorted.vcf | grep -v '^#' >> {output_vcf}", debug=False)
+    check_subprocess(logger, f'bgzip -f {output_vcf}', debug)
+    check_subprocess(logger, f'tabix -f -p vcf {output_vcf}.gz', debug)
+    if not debug:
+        for intermediate_file in glob.glob(f"{query_prefix}.{random_id}.tmp.vcfanno*"):
+            remove_file(intermediate_file)
+    
+    return
+
+
+def append_to_conf_file(datasource, datasource_info_tags, datasource_track_fname, conf_fname):
+    """
+    Function that appends data to a vcfanno conf file ('conf_fname') according to user-defined ('datasource').
+    The datasource defines the set of tags that will be appended during annotation
+    """
+    fh = open(conf_fname, 'a')
+    fh.write('[[annotation]]\n')
+    fh.write('file="' + str(datasource_track_fname) + '"\n')
+    if datasource == 'ncer' or datasource == 'gerp':        
+        fh.write('columns=[4]\n')
+        names_string = 'names=["' + '","'.join(datasource_info_tags) + '"]'
+        fh.write(names_string + '\n')
+        fh.write('ops=["mean"]\n\n')
+    elif datasource == 'gene_transcript_xref' or datasource == 'rmsk':       
+        fh.write('columns=[4]\n')
+        names_string = 'names=["' + '","'.join(datasource_info_tags) + '"]'
+        fh.write(names_string + '\n')
+        fh.write('ops=["concat"]\n\n')
+    else:        
+        fields_string = 'fields = ["' + '","'.join(datasource_info_tags) + '"]'
+        ops = ['concat'] * len(datasource_info_tags)
+        ops_string = 'ops=["' + '","'.join(ops) + '"]'
+        fh.write(fields_string + '\n')
+        fh.write(ops_string + '\n\n')
+    fh.close()
+    return
+
+
+
+if __name__ == "__main__":
+    __main__()
diff --git a/src/gvanno/gvanno_vep.py b/src/gvanno/gvanno_vep.py
new file mode 100755
index 0000000..ea85f58
--- /dev/null
+++ b/src/gvanno/gvanno_vep.py
@@ -0,0 +1,125 @@
+#!/usr/bin/env python
+
+import os,re
+import argparse
+
+from lib.gvanno.utils import get_loftee_dir, getlogger, check_subprocess
+from lib.gvanno import gvanno_vars
+
+
+def __main__():
+    parser = argparse.ArgumentParser(description='Run VEP on a VCF file (SNVs/InDels)')
+    parser.add_argument('vep_cache_dir', help='Directory with VEP cache files)')
+    parser.add_argument('vcf_file_in', help='Bgzipped VCF file with validated query variants (SNVs/InDels)')
+    parser.add_argument('vcf_file_out', help='Bgzipped VCF file with VEP-annotated query variants (SNVs/InDels)')
+    parser.add_argument('genome_assembly', help='Genome assembly')
+    parser.add_argument('vep_pick_order', default="mane_select,mane_plus_clinical,canonical,appris,biotype,ccds,rank,tsl,length", 
+                        help=f"Comma-separated string of ordered transcript/variant properties for selection of primary variant consequence")
+    parser.add_argument('--vep_regulatory',action = "store_true",help='Inclusion of VEP regulatory annotations')
+    
+    parser.add_argument('--vep_buffer_size',default=1000,type=int,help='Buffer size for VEP')
+    parser.add_argument('--vep_gencode_basic',action="store_true",help='Only consier basic GENCODE transcripts)')
+    parser.add_argument('--vep_n_forks',default=4,type=int,help='Number of forks for VEP processing')
+    parser.add_argument('--vep_lof_prediction',action="store_true",help='Perform LoF prediction with the LOFTEE plugin in VEP')
+    parser.add_argument('--vep_coding_only', action="store_true", help="Only consider coding variants")
+    parser.add_argument('--vep_no_intergenic', action="store_true", help="Skip intergenic variants")
+    parser.add_argument("--debug", action="store_true", default=False, help="Print full commands to log, default: %(default)s")
+    args = parser.parse_args()
+
+    logger = getlogger('gvanno-vep')
+    
+    arg_dict = vars(args)
+    conf_options = {}
+    conf_options['genome_assembly'] = arg_dict['genome_assembly']
+    conf_options['conf'] = {}
+    conf_options['conf']['vep'] = {}
+    conf_options['conf']['vep']['vep_n_forks'] = arg_dict['vep_n_forks']
+    conf_options['conf']['vep']['vep_pick_order'] = arg_dict['vep_pick_order']
+    conf_options['conf']['vep']['vep_buffer_size'] = arg_dict['vep_buffer_size']
+    conf_options['conf']['vep']['vep_gencode_basic'] = arg_dict['vep_gencode_basic']
+    conf_options['conf']['vep']['vep_regulatory'] = arg_dict['vep_regulatory']
+    conf_options['conf']['vep']['vep_lof_prediction'] = arg_dict['vep_lof_prediction']
+    conf_options['conf']['vep']['vep_coding_only'] = arg_dict['vep_coding_only']
+    conf_options['conf']['vep']['vep_no_intergenic'] = arg_dict['vep_no_intergenic']
+    
+    run_vep(arg_dict['vep_cache_dir'], conf_options, arg_dict['vcf_file_in'], arg_dict['vcf_file_out'], logger, arg_dict['debug'])
+
+def run_vep(vep_cache_dir, conf_options, input_vcf, output_vcf, logger, debug = False):
+    
+    output_vcf_gz = f'{output_vcf}.gz'
+    genome_assembly = conf_options['genome_assembly']
+    
+    fasta_assembly = os.path.join(
+        vep_cache_dir, 'homo_sapiens', 
+        f'{gvanno_vars.VEP_VERSION}_{gvanno_vars.VEP_ASSEMBLY[genome_assembly]}', 
+        f'Homo_sapiens.{gvanno_vars.VEP_ASSEMBLY[genome_assembly]}.dna.primary_assembly.fa.gz')
+    ancestor_assembly = os.path.join(
+        vep_cache_dir, 'homo_sapiens', 
+        f'{gvanno_vars.VEP_VERSION}_{gvanno_vars.VEP_ASSEMBLY[genome_assembly]}', 
+        f'human_ancestor.fa.gz')
+
+    plugins_in_use = "NearestExonJB, LoF"
+
+    # List all VEP flags used when calling VEP
+    vep_flags = (
+        f'--hgvs --af_gnomad --variant_class --domains --symbol --protein --ccds --mane '
+        f'--uniprot --appris --biotype --tsl --canonical --format vcf --cache --numbers '
+        f'--total_length --allele_number --failed 1 --no_stats --no_escape --xref_refseq --vcf '
+        f'--check_ref --dont_skip --flag_pick_allele_gene --plugin NearestExonJB,max_range=50000 '
+        f'--force_overwrite --species homo_sapiens --offline')
+    vep_options = (
+        f'--dir {vep_cache_dir} --assembly {gvanno_vars.VEP_ASSEMBLY[genome_assembly]} --cache_version {gvanno_vars.VEP_VERSION} '
+        f'--fasta {fasta_assembly} '
+        f'--pick_order {conf_options["conf"]["vep"]["vep_pick_order"]} '
+        f'--buffer_size {conf_options["conf"]["vep"]["vep_buffer_size"]} '
+        f'--fork {conf_options["conf"]["vep"]["vep_n_forks"]} '
+        f'{vep_flags} '
+        f'{"--verbose" if debug else "--quiet"}')
+    
+    gencode_set_in_use = "GENCODE - all transcripts"
+    if conf_options['conf']['vep']['vep_no_intergenic'] is True:
+        vep_options += ' --no_intergenic'
+    if conf_options['conf']['vep']['vep_regulatory'] is True:
+        vep_options += ' --regulatory'
+    if conf_options['conf']['vep']['vep_coding_only'] is True:
+        vep_options += ' --coding_only'
+    if conf_options['conf']['vep']['vep_gencode_basic'] is True:
+        vep_options += ' --gencode_basic'
+        gencode_set_in_use = "GENCODE - basic transcript set (--gencode_basic)"
+
+    ## LOFTEE plugin - variant loss-of-function annotation        
+    loftee_dir = get_loftee_dir()
+    assert os.path.isdir(loftee_dir), f'LoF VEP plugin is not found in {loftee_dir}. Please make sure you installed pcgr conda package and have corresponding conda environment active.'
+    vep_options += f" --plugin LoF,loftee_path:{loftee_dir},human_ancestor_fa:{ancestor_assembly},use_gerp_end_trunc:0 --dir_plugins {loftee_dir}"
+
+    logger.info(f'VEP configuration: Version: {gvanno_vars.VEP_VERSION}' + \
+                f', GENCODE release {gvanno_vars.GENCODE_VERSION[genome_assembly]}, genome assembly {conf_options["genome_assembly"]}')
+    logger.info(f'VEP configuration - one primary consequence block pr. alternative allele (--flag_pick_allele)')
+    logger.info(f'VEP configuration - transcript pick order: {conf_options["conf"]["vep"]["vep_pick_order"]}')
+    logger.info(f'VEP configuration - transcript pick order: See more at https://www.ensembl.org/info/docs/tools/vep/script/vep_other.html#pick_options')
+    logger.info(f'VEP configuration - GENCODE set: {gencode_set_in_use}')
+    logger.info(f'VEP configuration - skip intergenic variants: {"ON" if conf_options["conf"]["vep"]["vep_no_intergenic"] == 1 else "OFF"}')
+    logger.info(f'VEP configuration - regulatory variant annotation: {"ON" if conf_options["conf"]["vep"]["vep_regulatory"] == 1 else "OFF"}')
+    logger.info(f'VEP configuration - loss-of-function prediction: {"ON" if conf_options["conf"]["vep"]["vep_lof_prediction"] == 1 else "OFF"}')
+
+    logger.info((
+        f'VEP configuration - buffer size/number of forks: '
+        f'{conf_options["conf"]["vep"]["vep_buffer_size"]}/{conf_options["conf"]["vep"]["vep_n_forks"]}'))
+    logger.info(f'VEP - plugins in use: {plugins_in_use}')
+    
+    # Compose full VEP command
+    vep_main_command = f'vep --input_file {input_vcf} --output_file {output_vcf} {vep_options}'
+    vep_bgzip_command = f'bgzip -f -c {output_vcf} > {output_vcf_gz}'
+    vep_tabix_command = f'tabix -f -p vcf {output_vcf_gz}'
+    if debug:
+        print(vep_main_command)
+    
+    check_subprocess(logger, vep_main_command, debug)
+    check_subprocess(logger, vep_bgzip_command, debug)
+    check_subprocess(logger, vep_tabix_command, debug)
+    logger.info('Finished gvanno-vep')
+    
+    return 0
+
+if __name__=="__main__": __main__()
+
diff --git a/src/gvanno/lib/__pycache__/annoutils.cpython-36.pyc b/src/gvanno/lib/__pycache__/annoutils.cpython-36.pyc
deleted file mode 100644
index 0da3ccd..0000000
Binary files a/src/gvanno/lib/__pycache__/annoutils.cpython-36.pyc and /dev/null differ
diff --git a/src/gvanno/lib/annoutils.py b/src/gvanno/lib/annoutils.py
deleted file mode 100755
index 5678c9a..0000000
--- a/src/gvanno/lib/annoutils.py
+++ /dev/null
@@ -1,985 +0,0 @@
-#!/usr/bin/env python
-
-import os,re,sys
-import csv
-import logging
-import gzip
-from cyvcf2 import VCF, Writer
-import subprocess
-
-
-csv.field_size_limit(500 * 1024 * 1024)
-threeLettertoOneLetterAA = {'Ala':'A','Arg':'R','Asn':'N','Asp':'D','Cys':'C','Glu':'E','Gln':'Q','Gly':'G','His':'H',
-                           'Ile':'I','Leu':'L','Lys':'K', 'Met':'M','Phe':'F','Pro':'P','Ser':'S','Thr':'T','Trp':'W',
-                           'Tyr':'Y','Val':'V','Ter':'X'}
-
-
-
-def read_genexref_namemap(logger, gene_xref_namemap_tsv):
-   namemap_xref = {} ##dictionary returned
-   if not os.path.exists(gene_xref_namemap_tsv):
-      logger.info("ERROR: File '" + str(gene_xref_namemap_tsv) + "' does not exist - exiting")
-      exit(1)
-      #return namemap_xref
-   tsvfile = open(gene_xref_namemap_tsv, 'r')
-   reader = csv.DictReader(tsvfile, delimiter='\t')
-   for row in reader:
-      namemap_xref[row['name']] = int(row['index'])
-   tsvfile.close()
-
-   return namemap_xref
-
-def read_infotag_file(logger, vcf_info_tags_tsv):
-   """
-   Function that reads a file that lists VCF INFO tags produced by PCGR/CPSR/gvanno.
-   An example line of the VCF info tag file is the following:
-   
-   tag	number	type	description category
-   Consequence	.	String	"Impact modifier for the consequence type (picked by VEP's --flag_pick_allele option)."   vep
-   
-   A dictionary is returned, with the tag as the key, and the full dictionary record as the value
-   """
-   info_tag_xref = {} ##dictionary returned
-   if not os.path.exists(vcf_info_tags_tsv):
-      logger.info("ERROR: File '" + str(vcf_info_tags_tsv) + "' does not exist - exiting")
-      return info_tag_xref
-   tsvfile = open(vcf_info_tags_tsv, 'r')
-   reader = csv.DictReader(tsvfile, delimiter='\t')
-   for row in reader:
-      if not row['tag'] in info_tag_xref:
-         info_tag_xref[row['tag']] = row
-   tsvfile.close()
-   
-   return info_tag_xref
-
-def check_subprocess(command):
-   # if debug:
-   #    logger.info(command)
-   try:
-      output = subprocess.check_output(str(command), stderr=subprocess.STDOUT, shell=True)
-      if len(output) > 0:
-         print (str(output.decode()).rstrip())
-   except subprocess.CalledProcessError as e:
-      print (e.output.decode())
-      exit(0)
-
-
-def is_integer(n):
-    try:
-        float(n)
-    except ValueError:
-        return False
-    else:
-        return float(n).is_integer()
-
-def error_message(message, logger):
-   logger.error('')
-   logger.error(message)
-   logger.error('')
-   exit(1)
-
-def write_pass_vcf(annotated_vcf, logger):
-   """
-   Function prints all PASS variants from a VCF file. New VCF file appends '.pass.' to the filename.
-   
-   """
-   out_vcf = re.sub(r'\.vcf\.gz$','.pass.vcf',annotated_vcf)
-   vcf = VCF(annotated_vcf)
-   w = Writer(out_vcf, vcf)
-
-   num_rejected = 0
-   num_pass = 0
-   for rec in vcf:
-      if rec.FILTER is None or rec.FILTER == 'None':
-         w.write_record(rec)
-         num_pass += 1
-      else:
-         num_rejected +=1
-
-   vcf.close()
-   w.close()
-   
-   logger.info('Number of non-PASS/REJECTED variant calls: ' + str(num_rejected))
-   logger.info('Number of PASSed variant calls: ' + str(num_pass))
-   if num_pass == 0:
-      logger.warning('There are zero variants with a \'PASS\' filter in the VCF file')
-      os.system('bgzip -dc ' + str(annotated_vcf) + ' egrep \'^#\' > ' + str(out_vcf))
-   #else:
-   os.system('bgzip -f ' + str(out_vcf))
-   os.system('tabix -f -p vcf ' + str(out_vcf) + '.gz')
-
-   return
-
-def warn_message(message, logger):
-   logger.warning(message)
-
-def getlogger(logger_name):
-	logger = logging.getLogger(logger_name)
-	logger.setLevel(logging.DEBUG)
-
-	# create console handler and set level to debug
-	ch = logging.StreamHandler(sys.stdout)
-	ch.setLevel(logging.DEBUG)
-
-	# add ch to logger
-	logger.addHandler(ch)
-	
-	# create formatter
-	formatter = logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s", "20%y-%m-%d %H:%M:%S")
-	
-	#add formatter to ch
-	ch.setFormatter(formatter)
-	
-	return logger
-
-
-def is_valid_vcf(input_vcf, output_dir, logger, debug):
-   """
-   Function that reads the output file of EBIvariation/vcf-validator and reports potential errors and validation status
-   """
-
-   logger.info('Validating VCF file with EBIvariation/vcf-validator (v0.9.3)')
-
-   ## Filename for output file from VCF validation
-   vcf_validation_output_file = os.path.join(output_dir, re.sub(r'(\.vcf$|\.vcf\.gz$)', '.vcf_validator_output', os.path.basename(input_vcf)))
-   
-   ## Command for running vcf-validator
-   command_v42 = 'vcf_validator -i ' + str(input_vcf) + ' -l warning -r text -o ' + str(output_dir) + ' > ' + str(vcf_validation_output_file)  + ' 2>&1'
-   if debug:
-      logger.info(command_v42)
-   try:
-      subprocess.run(command_v42, stderr=subprocess.STDOUT, shell=True)
-   except subprocess.CalledProcessError as e:
-      print (e.output.decode())
-      exit(0)
-
-   ## read output file from vcf-validator and get name of log/error file
-   validation_results = {}
-   validation_results['validation_status'] = 0
-   validation_results['error_messages'] = []
-   validation_log_fname = None
-   if os.path.exists(vcf_validation_output_file):
-      f = open(vcf_validation_output_file, 'r')
-      for line in f:
-         if 'Text report written to' in line.rstrip():
-            validation_log_fname = re.split(r' : ',line.rstrip(),maxsplit=2)[1]
-      f.close()
-      if not debug:
-         check_subprocess('rm -f ' + str(vcf_validation_output_file))
-
-   if validation_log_fname is None:
-      err_msg = 'Cannot find file with error messages from vcf_validator'
-      return error_message(err_msg, logger)
-
-   if os.path.exists(validation_log_fname):
-      f = open(validation_log_fname, 'r')
-      for line in f:
-         if not re.search(r' \(warning\)$|^Reading from ',line.rstrip()): ## ignore warnings
-            if line.startswith('Line '):
-               validation_results['error_messages'].append('ERROR: ' + line.rstrip())
-            if 'the input file is valid' in line.rstrip(): ## valid VCF
-               validation_results['validation_status'] = 1
-            if 'the input file is not valid' in line.rstrip():  ## non-valid VCF
-               validation_results['validation_status'] = 0
-      f.close()
-      if not debug:
-         check_subprocess('rm -f ' + str(validation_log_fname))
-   else:
-      err_msg = str(validation_log_fname) + ' does not exist'
-      return error_message(err_msg, logger)
-
-   if validation_results['validation_status'] == 0:
-      error_string_42 = '\n'.join(validation_results['error_messages'])
-      validation_status = 'According to the VCF specification, the VCF file (' + str(input_vcf) + ') is NOT valid'
-      err_msg = validation_status + ':\n' + str(error_string_42)
-      return error_message(err_msg, logger)
-   else:
-      validation_status = 'According to the VCF specification, the VCF file ' + str(input_vcf) + ' is valid'
-      logger.info(validation_status)
-   return 0
-
-
-def map_regulatory_variant_annotations(vep_csq_records):
-   """
-   Function that considers an array of VEP CSQ records and appends all regulatory variant consequent annotations (open chromatin, TF_binding_site, 
-   CTCF_binding_site, promoter (flanks), enhancers ) into a single comma-separated string. Each individual regulatory annotation is formatted as:
-   <Consequence>|<Feature_type>|<Feature>|<BIOTYPE>|<MOTIF_NAME>|<MOTIF_POS>|<HIGH_INF_POS>|<MOTIF_SCORE_CHANGE>|<TRANSCRIPTION_FACTORS>
-   """
-
-   regulatory_annotation = '.'
-   if len(vep_csq_records) == 1:
-      return regulatory_annotation
-
-   j = 0
-   regulatory_annotations = []
-   while j < len(vep_csq_records):
-      missing_key_annotation = False
-      for k in ['Feature_type','Consequence','Feature']:
-         if not k in vep_csq_records[j].keys():
-            missing_key_annotation = True
-
-      if missing_key_annotation is False:
-
-         ## RegulatoryFeature annotations - open chromatin, promoters (flanks), enhancers, CTCF binding sites
-         if vep_csq_records[j]['Feature_type'] == 'RegulatoryFeature':
-            biotype = ""
-            if re.match(r"^(enhancer|promoter|open|CTCF|TF_)", vep_csq_records[j]['BIOTYPE']):
-               biotype = vep_csq_records[j]['BIOTYPE']
-            
-            annotation = str(vep_csq_records[j]['Consequence']) + '|' + \
-               str(vep_csq_records[j]['Feature_type']) + '|' + \
-               str(vep_csq_records[j]['Feature']) + '|' + \
-               str(biotype) + '|||||'
-
-            regulatory_annotations.append(annotation)
-
-         ## MotifFeature annotations (TF)
-         if vep_csq_records[j]['Feature_type'] == 'MotifFeature':
-            missing_motif_annotation = False
-            for annotation in ['MOTIF_NAME','MOTIF_POS','HIGH_INF_POS','MOTIF_SCORE_CHANGE','TRANSCRIPTION_FACTORS']:
-               if not annotation in vep_csq_records[j].keys():
-                  missing_motif_annotation = True
-            
-            if missing_motif_annotation is False:
-               annotation = str(vep_csq_records[j]['Consequence']) + '|' + \
-                  str(vep_csq_records[j]['Feature_type']) + '|' + \
-                  str(vep_csq_records[j]['Feature']) + '|TF_binding_site|' + \
-                  str(vep_csq_records[j]['MOTIF_NAME']) + '|' + \
-                  str(vep_csq_records[j]['MOTIF_POS']) + '|' + \
-                  str(vep_csq_records[j]['HIGH_INF_POS']) + '|' + \
-                  str(vep_csq_records[j]['MOTIF_SCORE_CHANGE']) + '|' + \
-                  str(vep_csq_records[j]['TRANSCRIPTION_FACTORS'])
-
-               regulatory_annotations.append(annotation)
-      
-      j = j + 1
-   
-   if len(regulatory_annotations) > 0:
-      regulatory_annotation = ','.join(regulatory_annotations)
-
-   return(regulatory_annotation)
-               
-def get_correct_cpg_transcript(vep_csq_records):
-   """
-   Function that considers an array of VEP CSQ records and picks most relevant consequence (and gene) from
-   neighbouring genes/transcripts of relevance for cancer predisposition (cpg = cancer predisposition gene)
-   """
-
-  
-   csq_idx = 0
-   if len(vep_csq_records) == 1:
-      return csq_idx
-   
-   
-   ## some variants are assigned multiple transcript consequences
-   ## if cancer predisposition genes are in the vicinity of other genes, choose the cancer predisposition gene
-   ## if there are neighbouring cancer-predispositon genes, choose custom gene, preferring coding change (see below, KLLN/PTEN, XPC/TMEM43, NTHL1/TSC2)
-   csq_idx_dict = {}
-   for g in ['KLLN','PTEN','XPC','TMEM43','NTHL1','TSC2']:
-      csq_idx_dict[g] = {}
-      csq_idx_dict[g]['idx'] = -1
-      csq_idx_dict[g]['coding'] = False
-
-   j = 0
-   while j < len(vep_csq_records):
-      if 'CANCER_PREDISPOSITION_SOURCE' in vep_csq_records[j].keys() or 'GE_PANEL_ID' in vep_csq_records[j].keys():
-         csq_idx = j
-         if 'SYMBOL' in vep_csq_records[j].keys():
-            if vep_csq_records[j]['SYMBOL'] in csq_idx_dict.keys():
-               csq_idx_dict[str(vep_csq_records[j]['SYMBOL'])]['idx'] = j
-               if vep_csq_records[j]['CODING_STATUS'] == 'coding':
-                  csq_idx = j  # prefer coding on over anything else
-                  csq_idx_dict[str(vep_csq_records[j]['SYMBOL'])]['coding'] = True
-      j = j + 1
-   
-   if csq_idx_dict['KLLN']['idx'] != -1 and csq_idx_dict['PTEN']['idx'] != -1:
-      csq_idx = csq_idx_dict['PTEN']['idx']
-      if csq_idx_dict['KLLN']['coding'] is True:
-         csq_idx = csq_idx_dict['KLLN']['idx']
-   
-   if csq_idx_dict['XPC']['idx'] != -1 and csq_idx_dict['TMEM43']['idx'] != -1:
-      csq_idx = csq_idx_dict['XPC']['idx']
-      if csq_idx_dict['TMEM43']['coding'] is True:
-         csq_idx = csq_idx_dict['TMEM43']['idx']
-   
-   if csq_idx_dict['TSC2']['idx'] != -1 and csq_idx_dict['NTHL1']['idx'] != -1:
-      csq_idx = csq_idx_dict['TSC2']['idx']
-      if csq_idx_dict['NTHL1']['coding'] is True:
-         csq_idx = csq_idx_dict['NTHL1']['idx']
-
-   if csq_idx is None:
-      csq_idx = 0
-   return csq_idx
-
-
-def threeToOneAA(aa_change):
-	
-   for three_letter_aa in threeLettertoOneLetterAA.keys():
-      aa_change = aa_change.replace(three_letter_aa,threeLettertoOneLetterAA[three_letter_aa])
-
-   aa_change = re.sub(r'[A-Z]{1}fsX([0-9]{1,}|\?)','fs',aa_change)
-   return aa_change
-
-def map_variant_effect_predictors(rec, algorithms):
-   
-   dbnsfp_predictions = map_dbnsfp_predictions(str(rec.INFO.get('DBNSFP')), algorithms)
-   if rec.INFO.get('Gene') is None or rec.INFO.get('Consequence') is None:
-      return
-   gene_id = str(rec.INFO.get('Gene'))
-   consequence = str(rec.INFO.get('Consequence'))
-     
-   dbnsfp_key = ''
-
-   found_key = 0
-   if not rec.INFO.get('HGVSp_short') is None and not rec.INFO.get('HGVSp_short') == '.':
-      aa_change = str(rec.INFO.get('HGVSp_short'))
-      dbnsfp_key = gene_id + ':' + str(aa_change)
-      if dbnsfp_key in dbnsfp_predictions:
-         found_key = 1
-   
-   if found_key == 0 and re.search('splice_',consequence):
-      dbnsfp_key = gene_id
-
-   if dbnsfp_key != '':
-      if dbnsfp_key in dbnsfp_predictions:
-         rec.INFO['EFFECT_PREDICTIONS'] = dbnsfp_predictions[dbnsfp_key]
-         for algo_pred in rec.INFO['EFFECT_PREDICTIONS'].split('&'):
-            if algo_pred.startswith('sift:'):
-               rec.INFO['DBNSFP_SIFT'] = str(algo_pred.split(':')[1])
-            if algo_pred.startswith('provean:'):
-               rec.INFO['DBNSFP_PROVEAN'] = str(algo_pred.split(':')[1])
-            if algo_pred.startswith('m-cap:'):
-               rec.INFO['DBNSFP_M_CAP'] = str(algo_pred.split(':')[1])
-            if algo_pred.startswith('mutpred:'):
-               rec.INFO['DBNSFP_MUTPRED'] = str(algo_pred.split(':')[1])
-            if algo_pred.startswith('metarnn:'):
-               rec.INFO['DBNSFP_META_RNN'] = str(algo_pred.split(':')[1])
-            if algo_pred.startswith('fathmm:'):
-               rec.INFO['DBNSFP_FATHMM'] = str(algo_pred.split(':')[1])
-            if algo_pred.startswith('fathmm_mkl_coding:'):
-               rec.INFO['DBNSFP_FATHMM_MKL'] = str(algo_pred.split(':')[1])
-            if algo_pred.startswith('mutationtaster:'):
-               rec.INFO['DBNSFP_MUTATIONTASTER'] = str(algo_pred.split(':')[1])
-            if algo_pred.startswith('mutationassessor:'):
-               rec.INFO['DBNSFP_MUTATIONASSESSOR'] = str(algo_pred.split(':')[1])
-            if algo_pred.startswith('deogen2:'):
-               rec.INFO['DBNSFP_DEOGEN2'] = str(algo_pred.split(':')[1])
-            if algo_pred.startswith('primateai:'):
-               rec.INFO['DBNSFP_PRIMATEAI'] = str(algo_pred.split(':')[1])
-            if algo_pred.startswith('list_s2:'):
-               rec.INFO['DBNSFP_LIST_S2'] = str(algo_pred.split(':')[1])
-            if algo_pred.startswith('gerp_rs:'):
-               rec.INFO['DBNSFP_GERP'] = str(algo_pred.split(':')[1])
-            if algo_pred.startswith('bayesdel_addaf:'):
-               rec.INFO['DBNSFP_BAYESDEL_ADDAF'] = str(algo_pred.split(':')[1])
-            if algo_pred.startswith('aloft:'):
-               rec.INFO['DBNSFP_ALOFTPRED'] = str(algo_pred.split(':')[1])
-            if algo_pred.startswith('splice_site_rf:'):
-               rec.INFO['DBNSFP_SPLICE_SITE_RF'] = str(algo_pred.split(':')[1])
-            if algo_pred.startswith('splice_site_ada:'):
-               rec.INFO['DBNSFP_SPLICE_SITE_ADA'] = str(algo_pred.split(':')[1])
-
-
-
-def detect_reserved_info_tag(tag, tag_name, logger):
-   reserved_tags = ['AA','AC','AF','AN','BQ','CIGAR','DB','DP','END','H2','H3','MQ','MQ0','NS','SB','SOMATIC','VALIDATED','1000G']
-   if tag in reserved_tags:
-      err_msg = 'Custom INFO tag (' + str(tag_name) + ') needs another name - ' + str(tag) + ' is a reserved field in the VCF specification (INFO)'
-      return error_message(err_msg, logger)
-   
-   reserved_format_tags = ['GT','DP','FT','GL','GLE','GQ','PL','HQ','PS','PQ','EC','MQ']
-   if tag in reserved_format_tags:
-      err_msg = 'Custom INFO tag (' + str(tag_name) + ') needs another name - ' + str(tag) + ' is a reserved field in the VCF specification (FORMAT)'
-      return error_message(err_msg, logger)
-
-
-def assign_cds_exon_intron_annotations(csq_record):
-   csq_record['CODING_STATUS'] = 'noncoding'
-   csq_record['EXONIC_STATUS'] = 'nonexonic'
-   csq_record['SPLICE_DONOR_RELEVANT'] = False
-   csq_record['NULL_VARIANT'] = False
-   csq_record['INTRON_POSITION'] = 0
-   csq_record['EXON_POSITION'] = 0
-   
-   coding_csq_pattern = r"^(stop_|start_lost|frameshift_|missense_|splice_donor|splice_acceptor|protein_altering|inframe_)"
-   wes_csq_pattern = r"^(stop_|start_lost|frameshift_|missense_|splice_donor|splice_acceptor|inframe_|protein_altering|synonymous)"
-   null_pattern = r"^(stop_|frameshift_)"
-   if re.match(coding_csq_pattern, str(csq_record['Consequence'])):
-      csq_record['CODING_STATUS'] = 'coding'
-   
-   if re.match(wes_csq_pattern, str(csq_record['Consequence'])):
-      csq_record['EXONIC_STATUS'] = 'exonic'
-   
-   if re.match(null_pattern, str(csq_record['Consequence'])):
-      csq_record['NULL_VARIANT'] = True
-
-   if re.match(r"^splice_region_variant", str(csq_record['Consequence'])) and re.search(r'(\+3(A|G)>|\+4G>|\+5G>)', str(csq_record['HGVSc'])):
-      csq_record['SPLICE_DONOR_RELEVANT'] = True
-
-   if re.match(r"splice_region_variant|intron_variant", str(csq_record['Consequence'])):
-      match = re.search(r"((-|\+)[0-9]{1,}(dup|del|inv|((ins|del|dup|inv|delins)(A|G|C|T){1,})|(A|C|T|G){1,}>(A|G|C|T){1,}))$", str(csq_record['HGVSc']))
-      if match is not None:
-         pos = re.sub(r"(\+|dup|del|delins|ins|inv|(A|G|C|T){1,}|>)","",match.group(0))
-         if is_integer(pos):
-            csq_record['INTRON_POSITION'] = int(pos)
-   
-   if 'NearestExonJB' in csq_record.keys():
-      if not csq_record['NearestExonJB'] is None:
-         if re.match(r"synonymous_|missense_|stop_|inframe_|start_", str(csq_record['Consequence'])) and str(csq_record['NearestExonJB']) != "":
-            exon_pos_info = csq_record['NearestExonJB'].split("+")
-            if len(exon_pos_info) == 4:
-               if is_integer(exon_pos_info[1]) and str(exon_pos_info[2]) == "end":
-                  csq_record['EXON_POSITION'] = -int(exon_pos_info[1])
-               if is_integer(exon_pos_info[1]) and str(exon_pos_info[2]) == "start":
-                  csq_record['EXON_POSITION'] = int(exon_pos_info[1])
-
-
-   for m in ['HGVSp_short','CDS_CHANGE']:
-      csq_record[m] = '.'
-   if not csq_record['HGVSc'] is None:
-      if csq_record['HGVSc'] != '.':
-         if 'splice_acceptor_variant' in csq_record['Consequence'] or 'splice_donor_variant' in csq_record['Consequence']:
-            key = str(csq_record['Consequence']) + ':' + str(csq_record['HGVSc'])
-            csq_record['CDS_CHANGE'] = key
-   if csq_record['Amino_acids'] is None or csq_record['Protein_position'] is None or csq_record['Consequence'] is None:
-      return
-   if not csq_record['Protein_position'] is None:
-      if csq_record['Protein_position'].startswith('-'):
-         return
-
-   protein_change = '.'
-   if '/' in csq_record['Protein_position']:
-      protein_position = str(csq_record['Protein_position'].split('/')[0])
-      if '-' in protein_position:
-         if protein_position.split('-')[0].isdigit():
-            csq_record['AMINO_ACID_START'] = protein_position.split('-')[0]
-         if protein_position.split('-')[1].isdigit():
-            csq_record['AMINO_ACID_END'] = protein_position.split('-')[1]
-      else:
-         if protein_position.isdigit():
-            csq_record['AMINO_ACID_START'] = protein_position
-            csq_record['AMINO_ACID_END'] = protein_position
-   
-   if not csq_record['HGVSp'] is None:
-      if csq_record['HGVSp'] != '.':
-         if ':' in csq_record['HGVSp']:
-            protein_identifier = str(csq_record['HGVSp'].split(':')[0])
-            if protein_identifier.startswith('ENSP'):
-               protein_change_VEP = str(csq_record['HGVSp'].split(':')[1])
-               protein_change = threeToOneAA(protein_change_VEP)
-  
-   if 'synonymous_variant' in csq_record['Consequence']:
-      protein_change = 'p.' + str(csq_record['Amino_acids']) + str(protein_position) + str(csq_record['Amino_acids'])
-      if 'stop_lost' in str(csq_record['Consequence']) and '/' in str(csq_record['Amino_acids']):
-         protein_change = 'p.X' + str(protein_position) + str(csq_record['Amino_acids']).split('/')[1]
-    
-   csq_record['HGVSp_short'] = protein_change
-   exon_number = 'NA'
-   if not csq_record['EXON'] is None:
-      if csq_record['EXON'] != '.':
-         if '/' in csq_record['EXON']:
-            exon_number = str(csq_record['EXON']).split('/')[0]
-            num_exons = str(csq_record['EXON']).split('/')[1]
-            if exon_number == num_exons:
-               csq_record['LAST_EXON'] = True
-
-   if not csq_record['INTRON'] is None:
-      if csq_record['INTRON'] != '.':
-         if '/' in csq_record['INTRON']:
-            intron_number = str(csq_record['INTRON']).split('/')[0]
-            num_introns = str(csq_record['INTRON']).split('/')[1]
-            if intron_number == num_introns:
-               csq_record['LAST_INTRON'] = True
-
-   if not csq_record['HGVSc'] is None:
-      if csq_record['HGVSc'] != '.':
-         if protein_change != '.':
-            key = str(csq_record['Consequence']) + ':' + str(csq_record['HGVSc']) + ':exon' + str(exon_number) + ':' + str(protein_change)
-            csq_record['CDS_CHANGE'] = key
-
-   return
-
-def map_dbnsfp_predictions(dbnsfp_tag, algorithms):
-   
-   effect_predictions = {}
-   
-   for v in dbnsfp_tag.split(','):
-   
-      dbnsfp_info = v.split('|')
-      if len(dbnsfp_info) == 1:
-         return effect_predictions
-      ref_aa = dbnsfp_info[0]
-      alt_aa = dbnsfp_info[1]
-      all_ids = dbnsfp_info[4].split('&')
-      unique_ids = {}
-      for s in all_ids:
-         unique_ids[s] = 1
-         
-      isoform_aa_keys = []
-      if ref_aa != '.' and alt_aa != '.' and ref_aa != '' and alt_aa != '':
-         aa_pos = dbnsfp_info[6].split('&')
-         for pos in aa_pos:
-            for gene_id in unique_ids:
-               k = str(gene_id) + ':p.' + str(ref_aa) + pos + str(alt_aa)
-               isoform_aa_keys.append(k)
-      else:
-         #continue
-         for gene_id in unique_ids:
-            isoform_aa_keys.append(gene_id)
-   
-      algorithm_raw_predictions = {}
-   
-      i = 7
-      v = 0
-      
-      if len(algorithms) != len(dbnsfp_info[7:]):
-         return effect_predictions
-      
-      while i < len(dbnsfp_info):
-         algorithm_raw_predictions[str(algorithms[v]).lower()] = dbnsfp_info[i].split('&')
-         i = i + 1
-         v = v + 1
-      dbnsfp_predictions = {}
-      
-      for k in isoform_aa_keys:
-         if not k in dbnsfp_predictions:
-            dbnsfp_predictions[k] = {}
-         all_preds = []
-         for algo in algorithm_raw_predictions.keys():
-            unique_algo_predictions = {}
-            for pred in algorithm_raw_predictions[algo]:
-               if pred != '':
-                  if not pred in unique_algo_predictions:
-                     unique_algo_predictions[pred] = 1
-               else:
-                  unique_algo_predictions['.'] = 1
-            if len(unique_algo_predictions.keys()) > 1 and '.' in unique_algo_predictions.keys():
-               del unique_algo_predictions['.']
-            dbnsfp_predictions[k][algo] = str(algo) + ':' + '|'.join(unique_algo_predictions.keys())  
-            all_preds.append(dbnsfp_predictions[k][algo])
-         effect_predictions[k] = '&'.join(all_preds)
-
-   return effect_predictions
-
-
-def make_transcript_xref_map(rec, fieldmap, xref_tag = 'PCGR_ONCO_XREF'):
-   transcript_xref_map = {}
-   if not rec.INFO.get(xref_tag) is None:
-      for tref in rec.INFO.get(xref_tag).split(','):
-         xrefs = tref.split('|')
-         ensembl_transcript_id = str(xrefs[0])
-         transcript_xref_map[ensembl_transcript_id] = {}
-         for annotation in fieldmap.keys():
-            annotation_index = fieldmap[annotation]
-            if annotation_index > (len(xrefs) - 1):
-               continue
-            if xrefs[annotation_index] != '':
-               transcript_xref_map[ensembl_transcript_id][annotation] = xrefs[annotation_index]
-   
-   return(transcript_xref_map)
-
-def vep_dbnsfp_meta_vcf(query_vcf, info_tags_wanted):
-   vep_to_pcgr_af = {'gnomADe_AMR_AF':'AMR_AF_GNOMAD',
-                     'gnomADe_AFR_AF':'AFR_AF_GNOMAD',
-                     'gnomADe_EAS_AF':'EAS_AF_GNOMAD',
-                     'gnomADe_NFE_AF':'NFE_AF_GNOMAD',
-                     'gnomADe_AF':'GLOBAL_AF_GNOMAD',
-                     'gnomADe_SAS_AF':'SAS_AF_GNOMAD',
-                     'gnomADe_OTH_AF':'OTH_AF_GNOMAD',
-                     'gnomADe_ASJ_AF':'ASJ_AF_GNOMAD',
-                     'gnomADe_FIN_AF':'FIN_AF_GNOMAD',
-                     'gnomADG_AMR_AF':'AMR_AF_GNOMADg',
-                     'gnomADg_AFR_AF':'AFR_AF_GNOMADg',
-                     'gnomADg_EAS_AF':'EAS_AF_GNOMADg',
-                     'gnomADg_NFE_AF':'NFE_AF_GNOMADg',
-                     'gnomADg_AF':'GLOBAL_AF_GNOMADg',
-                     'gnomADg_SAS_AF':'SAS_AF_GNOMADg',
-                     'gnomADg_OTH_AF':'OTH_AF_GNOMADg',
-                     'gnomADg_ASJ_AF':'ASJ_AF_GNOMADg',
-                     'gnomADg_FIN_AF':'FIN_AF_GNOMADg',
-                     'AFR_AF':'AFR_AF_1KG',
-                     'AMR_AF':'AMR_AF_1KG',
-                     'SAS_AF':'SAS_AF_1KG',
-                     'EUR_AF':'EUR_AF_1KG',
-                     'EAS_AF':'EAS_AF_1KG', 
-                     'AF':'GLOBAL_AF_1KG'}
-
-   vcf = VCF(query_vcf)
-   vep_csq_index2fields = {}
-   vep_csq_fields2index = {}
-   dbnsfp_prediction_algorithms = []
-   for e in vcf.header_iter():
-      header_element = e.info()
-      if 'ID' in header_element.keys():
-         identifier = str(header_element['ID'])
-         if identifier == 'CSQ' or identifier == 'DBNSFP':
-            description = str(header_element['Description'])
-            if 'Format: ' in description:
-               subtags = description.split('Format: ')[1].split('|')
-               if identifier == 'CSQ':
-                  i = 0
-                  for t in subtags:
-                     v = t.replace('"','')
-                     if t in vep_to_pcgr_af:
-                        v = str(vep_to_pcgr_af[t])
-                     if v in info_tags_wanted:
-                        vep_csq_index2fields[i] = v
-                        vep_csq_fields2index[v] = i
-                     i = i + 1
-               if identifier == 'DBNSFP':
-                  i = 7
-                  while(i < len(subtags)):
-                     dbnsfp_prediction_algorithms.append(str(re.sub(r'((_score)|(_pred))"*$','',subtags[i])))
-                     i = i + 1
-
-   vep_dbnsfp_meta_info = {}
-   vep_dbnsfp_meta_info['vep_csq_fieldmap'] = {}
-   vep_dbnsfp_meta_info['vep_csq_fieldmap']['field2index'] = vep_csq_fields2index
-   vep_dbnsfp_meta_info['vep_csq_fieldmap']['index2field'] = vep_csq_index2fields
-   vep_dbnsfp_meta_info['dbnsfp_prediction_algorithms'] = dbnsfp_prediction_algorithms
-
-   return vep_dbnsfp_meta_info
-
-def parse_vep_csq(rec, transcript_xref_map, vep_csq_fields_map, logger, pick_only = True, csq_identifier = 'CSQ', debug = 0):
-
-   all_csq_pick = []
-   all_transcript_consequences = []
-   
-
-   varkey = str(rec.CHROM) + '_' + str(rec.POS) + '_' + str(rec.REF) + '_' + str(','.join(rec.ALT))
-
-   for csq in rec.INFO.get(csq_identifier).split(','):
-      csq_fields =  csq.split('|')
-
-      entrezgene = '.'
-
-      ## Entrez gene identifier is not provided directly by VEP, pull out this from 'transcript_xref_map' for a given transcript-specific CSQ block
-      ##  - used for 'consequence_entry' object that are added to 'vep_all_csq' array
-      k = 0
-      while(k < len(csq_fields)):
-         if k in vep_csq_fields_map['index2field']:
-            if vep_csq_fields_map['index2field'][k] == 'Feature':
-               ensembl_transcript_id = csq_fields[k]
-               if ensembl_transcript_id != '' and ensembl_transcript_id.startswith('ENST'):
-                  if ensembl_transcript_id in transcript_xref_map.keys():
-                     if 'ENTREZGENE' in transcript_xref_map[ensembl_transcript_id].keys():
-                        entrezgene = transcript_xref_map[ensembl_transcript_id]['ENTREZGENE']
-         k = k + 1
-
-
-      if pick_only is False:
-         j = 0
-         csq_record = {}
-
-         while(j < len(csq_fields)):
-            if j in vep_csq_fields_map['index2field']:
-               if csq_fields[j] != '':
-                  csq_record[vep_csq_fields_map['index2field'][j]] = str(csq_fields[j])
-            j = j + 1
-         all_csq_pick.append(csq_record)
-
-      else:
-         ## loop over VEP consequence blocks PICK'ed according to VEP's ranking scheme
-         if csq_fields[vep_csq_fields_map['field2index']['PICK']] == "1": ## only consider the primary/picked consequence(s) when expanding with annotation tags
-            j = 0
-            csq_record = {}
-
-            ## loop over block annotation elements (separated with '|'), and assign their values to the csq_record dictionary object
-            while(j < len(csq_fields)):
-               if j in vep_csq_fields_map['index2field']:
-                  if csq_fields[j] != '':
-                     csq_record[vep_csq_fields_map['index2field'][j]] = str(csq_fields[j])
-                     if vep_csq_fields_map['index2field'][j] == 'Feature':
-                        ensembl_transcript_id = str(csq_fields[j])
-                        if ensembl_transcript_id in transcript_xref_map:
-                           for annotation in transcript_xref_map[ensembl_transcript_id].keys():
-                              if annotation != 'SYMBOL':
-                                 ## assign additional gene/transcript annotations from the custom transcript 
-                                 ## xref map (PCGR/CPSR/gvanno) as key,value pairs in the csq_record object
-                                 csq_record[annotation] = transcript_xref_map[ensembl_transcript_id][annotation]
-                                 #print(str(annotation) + "\t" + str(transcript_xref_map[ensembl_transcript_id][annotation]))
-                                 
-                              else:
-                                 #print(str(annotation) + "\t" + str(transcript_xref_map[ensembl_transcript_id][annotation]))
-                                 ## If not VEP provides SYMBOL, append SYMBOL provided by xref map
-                                 if csq_record['SYMBOL'] is None:
-                                     csq_record[annotation] = transcript_xref_map[ensembl_transcript_id][annotation]
-                        else:
-                           if re.match(r'ENST', ensembl_transcript_id):
-                              logger.warning('Could not find transcript xrefs for ' + str(ensembl_transcript_id))
-
-                     ## Specifically assign PFAM protein domain as a csq_record key
-                     if vep_csq_fields_map['index2field'][j] == 'DOMAINS':
-                        domain_identifiers = str(csq_fields[j]).split('&')
-                        for v in domain_identifiers:
-                           if v.startswith('Pfam'):
-                              csq_record['PFAM_DOMAIN'] = str(re.sub(r'\.[0-9]{1,}$','',re.sub(r'Pfam:','',v)))
-                        
-                        csq_record['DOMAINS'] = None
-                     ## Assign COSMIC/DBSNP mutation ID's as individual key,value pairs in the csq_record object
-                     if vep_csq_fields_map['index2field'][j] == 'Existing_variation':
-                        var_identifiers = str(csq_fields[j]).split('&')
-                        cosmic_identifiers = []
-                        dbsnp_identifiers = []
-                        for v in var_identifiers:
-                           if v.startswith('COSV'):
-                              cosmic_identifiers.append(v)
-                           if v.startswith('COSM'):
-                              cosmic_identifiers.append(v)
-                           if v.startswith('rs'):
-                              dbsnp_identifiers.append(v)
-                        if len(cosmic_identifiers) > 0:
-                           csq_record['COSMIC_MUTATION_ID'] = '&'.join(cosmic_identifiers)
-                        if len(dbsnp_identifiers) > 0:
-                           csq_record['DBSNPRSID'] = '&'.join(dbsnp_identifiers)
-                  else:
-                     csq_record[vep_csq_fields_map['index2field'][j]] = None
-               j = j + 1
-            
-            ## Assign coding status, protein change, coding sequence change, last exon/intron status etc
-            assign_cds_exon_intron_annotations(csq_record)
-            ## Append transcript consequence to all_csq_pick
-            all_csq_pick.append(csq_record)
-         symbol = "."
-         hgvsc = "."
-         hgvsp = "."
-         if csq_fields[vep_csq_fields_map['field2index']['SYMBOL']] != "":
-            symbol = str(csq_fields[vep_csq_fields_map['field2index']['SYMBOL']])
-         if csq_fields[vep_csq_fields_map['field2index']['HGVSc']] != "":
-            hgvsc = str(csq_fields[vep_csq_fields_map['field2index']['HGVSc']].split(':')[1])
-         if csq_fields[vep_csq_fields_map['field2index']['HGVSp']] != "":
-            hgvsp = str(csq_fields[vep_csq_fields_map['field2index']['HGVSp']].split(':')[1])
-         consequence_entry = (str(csq_fields[vep_csq_fields_map['field2index']['Consequence']]) + ":" +  
-            str(symbol) + ":" + 
-            str(entrezgene) + ":" +
-            str(hgvsc) + ":" + 
-            str(hgvsp) + ":" + 
-            str(csq_fields[vep_csq_fields_map['field2index']['Feature_type']]) + ":" + 
-            str(csq_fields[vep_csq_fields_map['field2index']['Feature']]) + ":" + 
-            str(csq_fields[vep_csq_fields_map['field2index']['BIOTYPE']]))
-         all_transcript_consequences.append(consequence_entry)
-
-
-   vep_chosen_csq_idx = 0
-   vep_csq_results = {}
-   vep_csq_results['picked_gene_csq'] = all_csq_pick
-   vep_csq_results['all_csq'] = all_transcript_consequences
-   vep_csq_results['picked_csq'] = None
-
-   ## IF multiple transcript-specific variant consequences highlighted by --pick_allele_gene , 
-   ## prioritize/choose block of consequence which has
-   ## - A gene with BIOTYPE equal to 'protein-coding' (the other picked transcript/gene may potentialy carry another BIOTYPE nature)
-   ## - A gene consequence classified as 'exonic' (the other picked transcript/gene likely carries a nonexonic consequence)
-   if len(vep_csq_results['picked_gene_csq']) > 0:
-      vep_selected_idx = {}
-      vep_selected_idx['exonic_status'] = {}
-      vep_selected_idx['consequence'] = {}
-
-      #if debug:
-         #print("Picked VEP blocks: " + str(vep_csq_results['picked_gene_csq']))
-
-      i = 0
-      picked_blocks = []
-      for trans_rec in vep_csq_results['picked_gene_csq']:
-
-         if debug:
-            biotype = '.'
-            consequence = '.'
-            exonic_status = '.'
-            genesymbol = '.'
-            distance = '.'
-            feature = '.'
-            if not trans_rec['Consequence'] is None:
-               consequence = trans_rec['Consequence']
-            if not trans_rec['SYMBOL'] is None:
-               genesymbol = trans_rec['SYMBOL']
-            if not trans_rec['BIOTYPE'] is None:
-               biotype = trans_rec['BIOTYPE']
-            if not trans_rec['EXONIC_STATUS'] is None:
-               exonic_status = trans_rec['EXONIC_STATUS']
-            if not trans_rec['DISTANCE'] is None:
-               distance = trans_rec['DISTANCE']
-            if not trans_rec['Feature'] is None:
-               feature = trans_rec['Feature']
-            block = str(genesymbol) + ':' + str(feature) + ':' + str(consequence) + ':' + str(distance) + ':' + str(biotype) + ':' + str(exonic_status)
-            picked_blocks.append(block)
-
-
-         if 'BIOTYPE' in trans_rec and 'Consequence' in trans_rec and 'EXONIC_STATUS' in trans_rec:
-            if not trans_rec['BIOTYPE'] is None and not trans_rec['Consequence'] is None:
-               if trans_rec['BIOTYPE'] == "protein_coding":
-                  ## for protein-coding genes - record the exonic variant consequence status (exonic/nonexonic)
-                  vep_selected_idx['exonic_status'][i] = trans_rec['EXONIC_STATUS']
-                  vep_selected_idx['consequence'][i] = trans_rec['Consequence']
-         i = i + 1
-
-      
-      if debug:
-         print(str(varkey) + " - picked CSQ blocks: " + ' /// '.join(picked_blocks))
-
-            
-      ## when multiple transcript gene blocks are picked by VEP, prioritize the block with 'exonic' consequence
-      if len(vep_selected_idx['exonic_status'].keys()) > 1:
-         exonic_cons_found = 0
-         for j in vep_selected_idx['exonic_status'].keys():
-            if vep_selected_idx['exonic_status'][j] == 'exonic':
-               exonic_cons_found = 1
-               vep_chosen_csq_idx = j
-
-         ## if multiple non-exonic variants are found, prioritize UTR variants over other nonexonic
-         if exonic_cons_found == 0:
-            for j in vep_selected_idx['consequence'].keys():
-               if vep_selected_idx['consequence'][j] == '5_prime_UTR_variant' or vep_selected_idx['consequence'][j] == '3_prime_UTR_variant':                  
-                  vep_chosen_csq_idx = j
-
-      else:
-         if len(vep_selected_idx['exonic_status'].keys()) == 1:
-            for k in vep_selected_idx['exonic_status'].keys():
-               vep_chosen_csq_idx = k
-      
-      vep_csq_results['picked_csq'] = vep_csq_results['picked_gene_csq'][vep_chosen_csq_idx]
-
-   else:
-      print('ERROR: No VEP block chosen by --pick_allele_gene')
-
-   return(vep_csq_results)
-
-
-def read_cancer_hotspots(logger, hotspots_tsv):
-
-   ## load cancer hotspot entries from 'cancer_hotspots.tsv' (provided by github.com/sigven/cancerHotspots)
-
-   hotspots = {} ##dictionary returned
-   mutation_hotspots = {}
-   codon_hotspots = {}
-   splice_hotspots = {}
-   if not os.path.exists(hotspots_tsv):
-      logger.info("ERROR: File '" + str(hotspots_tsv) + "' does not exist - exiting")
-      exit(1)
-   tsvfile = open(hotspots_tsv, 'r')
-   reader = csv.DictReader(tsvfile, delimiter='\t')
-   for row in reader:
-      mutation_hotspots[str(row['entrezgene']) + '-' + row['hgvsp2']] = row
-      codon_hotspots[str(row['entrezgene']) + '-' + row['codon']] = row
-      if row['hgvsc'] != '.':
-         splice_hotspots[str(row['entrezgene']) + '-' + str(row['hgvsc'])] = row
-   tsvfile.close()
-
-   hotspots['mutation'] = mutation_hotspots
-   hotspots['codon'] = codon_hotspots
-   hotspots['splice'] = splice_hotspots
-   return hotspots
-
-
-def map_cancer_hotspots(transcript_csq_elements, cancer_hotspots, rec, principal_hgvsp, principal_hgvsc):
-
-   unique_hotspot_mutations = {}
-   unique_hotspot_codons = {}
-
-   principal_codon = '.'
-   if re.match(r'^(p.[A-Z]{1}[0-9]{1,}[A-Za-z]{1,})', principal_hgvsp):
-      codon_match = re.findall(r'[A-Z][0-9]{1,}',principal_hgvsp)
-      if len(codon_match) == 1:
-         principal_codon = codon_match[0]
-   
-   ## loop through all transcript-specific consequences ('csq_elements') for a given variant, check for the presence of
-   ## 1. Exonic, protein-altering mutations (dictionary key = entrezgene + hgvsp) that overlap known cancer hotspots (https://github.com/sigven/cancerHotspots)
-   ##    - assert whether a potentially hit reflects the principal hgvsp ('by_hgvsp_principal') or an alternative hgvsp ('by_hgvsp_nonprincipal')
-   ## 2. Splice site mutations (dictionary key = entrezgene + hgvsc) that overlap known cancer hotspots (NOTE: only a subset have been curated)
-   ##    - assert whether a potentially hit reflects the principal hgvsc ('by_hgvsc_principal') or an alternative hgvsc ('by_hgvsc_nonprincipal')
-
-   for csq in transcript_csq_elements:
-      (consequence, symbol, entrezgene, hgvsc, hgvsp, feature_type, feature, biotype) = csq.split(':')
-
-      if not bool(re.search(r'^(missense|stop|start|inframe|splice_donor|splice_acceptor|frameshift)', consequence)) is True:
-         continue
-
-      hgvsp_short = threeToOneAA(hgvsp)
-      hotspot_key_mutation = "."
-      codon_match = []
-      if entrezgene != "." and hgvsp != ".":
-         hotspot_key_mutation = str(entrezgene) + '-' + str(hgvsp_short)
-         codon_match = re.findall(r'p.[A-Z][0-9]{1,}',hgvsp_short)
-
-      if entrezgene != "." and (consequence == 'splice_donor_variant' or consequence == 'splice_acceptor_variant'):
-         hgvsc_key = re.sub(r'>(A|G|C|T)$','',hgvsc)
-         hotspot_key_mutation = str(entrezgene) + '-' + str(hgvsc_key)
-
-      if hotspot_key_mutation == ".":
-         continue
-
-      if hotspot_key_mutation in cancer_hotspots['mutation']:
-         hotspot_info = cancer_hotspots['mutation'][hotspot_key_mutation]['MUTATION_HOTSPOT2']
-         hotspot_info_ttype = cancer_hotspots['mutation'][hotspot_key_mutation]['MUTATION_HOTSPOT_CANCERTYPE']
-         unique_hotspot_mutations['exonic|' + str(hotspot_info)] = hotspot_info_ttype
-
-      if hotspot_key_mutation in cancer_hotspots['splice']:
-         hotspot_info = cancer_hotspots['splice'][hotspot_key_mutation]['MUTATION_HOTSPOT2']
-         hotspot_info_ttype = cancer_hotspots['splice'][hotspot_key_mutation]['MUTATION_HOTSPOT_CANCERTYPE']
-         unique_hotspot_mutations['splice|' + str(hotspot_info)] = hotspot_info_ttype
-
-               
-      if len(codon_match) > 0:
-         hotspot_key_codon = str(entrezgene) + '-' + str(codon_match[0])
-
-         if hotspot_key_codon in cancer_hotspots['codon']:            
-            unique_hotspot_codons[str('exonic|') + cancer_hotspots['codon'][hotspot_key_codon]['MUTATION_HOTSPOT2']] =  \
-               cancer_hotspots['codon'][hotspot_key_codon]['MUTATION_HOTSPOT_CANCERTYPE']
-         
-   if len(unique_hotspot_mutations.keys()) > 0:
-      if len(unique_hotspot_mutations.keys()) == 1:
-         for gene_mutation_key in unique_hotspot_mutations.keys():
-            rec.INFO['MUTATION_HOTSPOT'] = gene_mutation_key
-            rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_mutations[gene_mutation_key]
-
-            if gene_mutation_key.split('|')[0] == 'exonic':
-               rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsp_nonprincipal'
-               hgvsp_candidate = 'p.' + str(gene_mutation_key.split('|')[3]) + str(gene_mutation_key.split('|')[4])
-               if hgvsp_candidate == principal_hgvsp:
-                  rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsp_principal'
-            else:
-               rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsc_nonprincipal'
-               hgvsc_candidate = re.sub(r'>(A|G|C|T){1,}$', '' , str(gene_mutation_key.split('|')[4]))
-               if hgvsc_candidate == principal_hgvsc:
-                  rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsc_principal'
-      else:
-         ## multiple hotspot matches for alternative hgvsp keys
-         ## - keep only match against principal HGVSp 
-         for hotspot_info in unique_hotspot_mutations.keys():
-            if hotspot_info.split('|')[0] == 'exonic':
-               hgvsp_candidate = "p." + str(hotspot_info.split('|')[3]) + str(hotspot_info.split('|')[4]) 
-
-               if hgvsp_candidate == principal_hgvsp:
-                  rec.INFO['MUTATION_HOTSPOT'] = hotspot_info
-                  rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_mutations[hotspot_info]
-                  rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsp_principal'
-            else:
-               hgvsc_candidate = re.sub(r'>(A|G|C|T){1,}$', '' , str(hotspot_info.split('|')[4]))
-
-               if hgvsc_candidate == principal_hgvsc:
-                  rec.INFO['MUTATION_HOTSPOT'] = hotspot_info
-                  rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_mutations[hotspot_info]
-                  rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsc_principal'
-
-   else:
-      if len(unique_hotspot_codons.keys()) > 0:
-         if len(unique_hotspot_codons.keys()) == 1:
-            for gene_codon_key in unique_hotspot_codons.keys():
-
-               if '|' in gene_codon_key:
-
-                  codon = str(gene_codon_key.split('|')[3])
-
-                  if codon == principal_codon:
-                     rec.INFO['MUTATION_HOTSPOT'] = gene_codon_key
-                     rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_codons[gene_codon_key]
-                     rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_codon_principal'
-                  else:
-                     rec.INFO['MUTATION_HOTSPOT'] = gene_codon_key
-                     rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_codons[gene_codon_key]
-                     rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_codon_nonprincipal'
-
-
-   return
-      
-
diff --git a/src/gvanno/lib/gvanno/annoutils.py b/src/gvanno/lib/gvanno/annoutils.py
new file mode 100755
index 0000000..77464c8
--- /dev/null
+++ b/src/gvanno/lib/gvanno/annoutils.py
@@ -0,0 +1,349 @@
+#!/usr/bin/env python
+
+import os,re,sys
+import csv
+import logging
+import gzip
+from cyvcf2 import VCF, Writer
+
+from lib.gvanno import gvanno_vars
+from lib.gvanno.utils import error_message, is_integer
+
+csv.field_size_limit(500 * 1024 * 1024)
+threeLettertoOneLetterAA = {'Ala':'A','Arg':'R','Asn':'N','Asp':'D','Cys':'C','Glu':'E','Gln':'Q','Gly':'G','His':'H',
+                           'Ile':'I','Leu':'L','Lys':'K', 'Met':'M','Phe':'F','Pro':'P','Ser':'S','Thr':'T','Trp':'W',
+                           'Tyr':'Y','Val':'V','Ter':'X'}
+
+
+
+def read_infotag_file(vcf_info_tags_tsv, scope = "vep"):
+    """
+    Function that reads a file that lists VCF INFO tags produced by PCGR/CPSR/gvanno.
+    An example line of the VCF info tag file is the following:
+
+    tag	number	type	description category
+    Consequence	.	String	"Impact modifier for the consequence type (picked by VEP's --flag_pick_allele option)."   vep
+
+    A dictionary is returned, with the tag as the key, and the full dictionary record as the value
+    """
+    info_tag_xref = {}  # dictionary returned
+    if not os.path.exists(vcf_info_tags_tsv):
+        ## issue warning
+        return info_tag_xref
+    tsvfile = open(vcf_info_tags_tsv, 'r')
+    reader = csv.DictReader(tsvfile, delimiter='\t')
+    for row in reader:
+        if not row['tag'] in info_tag_xref:
+            if scope in row['category']:
+                info_tag_xref[row['tag']] = row
+
+    return info_tag_xref
+
+def read_vcfanno_tag_file(vcfanno_tag_file, logger):
+
+    infotag_results = {}
+
+    if not os.path.exists(vcfanno_tag_file):
+        err_msg = f"vcfanno configuration file ({vcfanno_tag_file}) does not exist"
+        error_message(err_msg, logger)
+    
+    f = open(vcfanno_tag_file, "r")
+    info_elements = f.readlines()
+    f.close()
+
+
+    for i in info_elements:
+        elem_info_content = i.rstrip().split(',')
+        row = {}
+        for elem in elem_info_content:
+            if elem.startswith('Type'):
+                row['type'] = elem.replace('Type=', '')
+            if elem.startswith('##INFO'):
+                row['tag'] = elem.replace('##INFO=<ID=', '')
+            if elem.startswith('Number'):
+                row['number'] = elem.replace('Number=', '')
+            if elem.startswith('Description'):
+                row['description'] = re.sub(r'">|Description="', '', elem)        
+            
+        if 'tag' in row and not row['tag'] in infotag_results:
+            infotag_results[row['tag']] = row
+    
+    return(infotag_results)
+        
+    
+def read_genexref_namemap(gene_xref_namemap_tsv, logger):
+    """
+    Function that reads a file that lists names of tags in GENE_TRANSCRIPT_XREF annotation.
+    """
+
+    namemap_xref = {}  # dictionary returned
+    if not os.path.exists(gene_xref_namemap_tsv):
+        err_msg = f"gene_transcript_xref BED mapping file ({gene_xref_namemap_tsv}) does not exist"
+        error_message(err_msg, logger)
+    
+    with gzip.open(gene_xref_namemap_tsv, mode='rt') as f:
+        reader = csv.DictReader(f, delimiter='\t')
+        for row in reader:
+            namemap_xref[row['name']] = int(row['index'])
+    f.close()
+
+    return namemap_xref
+
+
+def write_pass_vcf(annotated_vcf, logger):
+    """
+    Function prints all PASS variants from a VCF file. New VCF file appends '.pass.' to the filename.
+    """
+    #out_vcf = re.sub(r'\.annotated\.vcf\.gz$','.annotated.pass.vcf',annotated_vcf)
+    out_vcf = re.sub(r'\.vcf\.gz$', '.pass.vcf', annotated_vcf)
+    vcf = VCF(annotated_vcf)
+    w = Writer(out_vcf, vcf)
+
+    num_rejected = 0
+    num_pass = 0
+    for rec in vcf:
+        if rec.FILTER is None or rec.FILTER == 'None':
+            w.write_record(rec)
+            num_pass += 1
+        else:
+            num_rejected += 1
+
+    vcf.close()
+    w.close()
+
+    logger.info('Number of non-PASS/REJECTED variant calls: ' +
+                str(num_rejected))
+    logger.info('Number of PASSed variant calls: ' + str(num_pass))
+    if num_pass == 0:
+        logger.warning(
+            'There are zero variants with a \'PASS\' filter in the VCF file')
+        os.system('bgzip -dc ' + str(annotated_vcf) +
+                  ' egrep \'^#\' > ' + str(out_vcf))
+    # else:
+    os.system('bgzip -f ' + str(out_vcf))
+    os.system('tabix -f -p vcf ' + str(out_vcf) + '.gz')
+
+    return
+
+
+def map_regulatory_variant_annotations(vep_csq_records):
+    """
+    Function that considers an array of VEP CSQ records and appends all regulatory variant consequent annotations (open chromatin, TF_binding_site,
+    CTCF_binding_site, promoter (flanks), enhancers ) into a single comma-separated string. Each individual regulatory annotation is formatted as:
+    <Consequence>|<Feature_type>|<Feature>|<BIOTYPE>|<MOTIF_NAME>|<MOTIF_POS>|<HIGH_INF_POS>|<MOTIF_SCORE_CHANGE>|<TRANSCRIPTION_FACTORS>
+    """
+
+    regulatory_annotation = '.'
+    if len(vep_csq_records) == 1:
+        return regulatory_annotation
+
+    j = 0
+    regulatory_annotations = []
+    while j < len(vep_csq_records):
+        missing_key_annotation = False
+        for k in ['Feature_type', 'Consequence', 'Feature']:
+            if not k in vep_csq_records[j].keys():
+                missing_key_annotation = True
+
+        if missing_key_annotation is False:
+
+            # RegulatoryFeature annotations - open chromatin, promoters (flanks), enhancers, CTCF binding sites
+            if vep_csq_records[j]['Feature_type'] == 'RegulatoryFeature':
+                biotype = ""
+                if re.match(r"^(enhancer|promoter|open|CTCF|TF_)", vep_csq_records[j]['BIOTYPE']):
+                    biotype = vep_csq_records[j]['BIOTYPE']
+
+                annotation = str(vep_csq_records[j]['Consequence']) + '|' + \
+                    str(vep_csq_records[j]['Feature_type']) + '|' + \
+                    str(vep_csq_records[j]['Feature']) + '|' + \
+                    str(biotype) + '|||||'
+
+                regulatory_annotations.append(annotation)
+
+            # MotifFeature annotations (TF)
+            if vep_csq_records[j]['Feature_type'] == 'MotifFeature':
+                missing_motif_annotation = False
+                for annotation in ['MOTIF_NAME', 'MOTIF_POS', 'HIGH_INF_POS', 
+                                   'MOTIF_SCORE_CHANGE', 'TRANSCRIPTION_FACTORS']:
+                    if not annotation in vep_csq_records[j].keys():
+                        missing_motif_annotation = True
+
+                if missing_motif_annotation is False:
+                    annotation = str(vep_csq_records[j]['Consequence']) + '|' + \
+                        str(vep_csq_records[j]['Feature_type']) + '|' + \
+                        str(vep_csq_records[j]['Feature']) + '|TF_binding_site|' + \
+                        str(vep_csq_records[j]['MOTIF_NAME']) + '|' + \
+                        str(vep_csq_records[j]['MOTIF_POS']) + '|' + \
+                        str(vep_csq_records[j]['HIGH_INF_POS']) + '|' + \
+                        str(vep_csq_records[j]['MOTIF_SCORE_CHANGE']) + '|' + \
+                        str(vep_csq_records[j]['TRANSCRIPTION_FACTORS'])
+
+                    regulatory_annotations.append(annotation)
+
+        j = j + 1
+
+    if len(regulatory_annotations) > 0:
+        regulatory_annotation = ','.join(regulatory_annotations)
+
+    return (regulatory_annotation)
+
+def threeToOneAA(aa_change):
+
+    for three_letter_aa in threeLettertoOneLetterAA.keys():
+        aa_change = aa_change.replace(
+            three_letter_aa, threeLettertoOneLetterAA[three_letter_aa])
+
+    aa_change = re.sub(r'[A-Z]{1}fsX([0-9]{1,}|\?)', 'fs', aa_change)
+    return aa_change
+
+
+def assign_cds_exon_intron_annotations(csq_record):
+
+    csq_record['CODING_STATUS'] = 'noncoding'
+    csq_record['EXONIC_STATUS'] = 'nonexonic'
+    csq_record['SPLICE_DONOR_RELEVANT'] = False
+    csq_record['NULL_VARIANT'] = False
+    csq_record['INTRON_POSITION'] = 0
+    csq_record['EXON_POSITION'] = 0
+    csq_record['CDS_CHANGE'] = '.'
+    csq_record['HGVSp_short'] = '.'
+    csq_record['PROTEIN_CHANGE'] = '.'
+    csq_record['EXON_AFFECTED'] = '.'
+    csq_record['LOSS_OF_FUNCTION'] = False
+
+    if re.search(gvanno_vars.CSQ_CODING_PATTERN, str(csq_record['Consequence'])) is not None:
+        csq_record['CODING_STATUS'] = 'coding'
+
+    if re.search(gvanno_vars.CSQ_CODING_SILENT_PATTERN, str(csq_record['Consequence'])) is not None:
+        csq_record['EXONIC_STATUS'] = 'exonic'
+
+    if 'LoF' in csq_record:
+        csq_record['LOSS_OF_FUNCTION'] = False        
+        if not csq_record['LoF'] is None:
+            if csq_record['LoF'] == 'HC':
+                csq_record['LOSS_OF_FUNCTION'] = True
+            
+        ## Don't list LoF as True if consequence is assigned as missense
+        if re.search(r'^missense_variant$', csq_record['Consequence']) is not None:
+            if csq_record['LOSS_OF_FUNCTION'] is True:
+                csq_record['LOSS_OF_FUNCTION'] = False
+        
+
+    if re.search(gvanno_vars.CSQ_NULL_PATTERN, str(csq_record['Consequence'])) is not None:
+        csq_record['NULL_VARIANT'] = True
+    
+    if re.search(gvanno_vars.CSQ_SPLICE_DONOR_PATTERN, str(csq_record['Consequence'])) is not None \
+        and re.search(r'(\+3(A|G)>|\+4A>|\+5G>)', str(csq_record['HGVSc'])) is not None:
+        csq_record['SPLICE_DONOR_RELEVANT'] = True
+
+    if re.search(gvanno_vars.CSQ_SPLICE_REGION_PATTERN, str(csq_record['Consequence'])) is not None:
+        match = re.search(
+            r"((-|\+)[0-9]{1,}(dup|del|inv|((ins|del|dup|inv|delins)(A|G|C|T){1,})|(A|C|T|G){1,}>(A|G|C|T){1,}))$", str(csq_record['HGVSc']))
+        if match is not None:
+            pos = re.sub(
+                r"(\+|dup|del|delins|ins|inv|(A|G|C|T){1,}|>)", "", match.group(0))
+            if is_integer(pos):
+                csq_record['INTRON_POSITION'] = int(pos)
+
+    if 'NearestExonJB' in csq_record.keys():
+        if not csq_record['NearestExonJB'] is None:
+            if re.search(r"synonymous_|missense_|stop_|inframe_|start_", str(csq_record['Consequence'])) is not None and str(csq_record['NearestExonJB']) != "":
+                exon_pos_info = csq_record['NearestExonJB'].split("+")
+                if len(exon_pos_info) == 4:
+                    if is_integer(exon_pos_info[1]) and str(exon_pos_info[2]) == "end":
+                        csq_record['EXON_POSITION'] = -int(exon_pos_info[1])
+                    if is_integer(exon_pos_info[1]) and str(exon_pos_info[2]) == "start":
+                        csq_record['EXON_POSITION'] = int(exon_pos_info[1])
+
+    if not csq_record['HGVSc'] is None:
+        if csq_record['HGVSc'] != '.':
+            if 'splice_acceptor_variant' in csq_record['Consequence'] or 'splice_donor_variant' in csq_record['Consequence'] \
+                or 'splice_donor_5th_base_variant' in csq_record['Consequence'] or 'splice_region_variant' in csq_record['Consequence'] \
+                    or 'splice_polypyrimidine_tract_variant' in csq_record['Consequence']:
+                key = str(csq_record['Consequence']) + \
+                    ':' + str(csq_record['HGVSc'])
+                csq_record['CDS_CHANGE'] = key
+
+    if csq_record['Amino_acids'] is None or csq_record['Protein_position'] is None or csq_record['Consequence'] is None:
+        return(csq_record)
+    
+    if not csq_record['Protein_position'] is None:
+        if csq_record['Protein_position'].startswith('-'):
+            return(csq_record)
+
+    protein_change = '.'
+    if '/' in csq_record['Protein_position']:
+        protein_position = str(csq_record['Protein_position'].split('/')[0])
+        if '-' in protein_position:
+            if protein_position.split('-')[0].isdigit():
+                csq_record['AMINO_ACID_START'] = protein_position.split('-')[0]
+            if protein_position.split('-')[1].isdigit():
+                csq_record['AMINO_ACID_END'] = protein_position.split('-')[1]
+        else:
+            if protein_position.isdigit():
+                csq_record['AMINO_ACID_START'] = protein_position
+                csq_record['AMINO_ACID_END'] = protein_position
+
+    if not csq_record['HGVSp'] is None:
+        if csq_record['HGVSp'] != '.':
+            if ':' in csq_record['HGVSp']:
+                protein_identifier = str(csq_record['HGVSp'].split(':')[0])
+                if protein_identifier.startswith('ENSP'):
+                    protein_change_VEP = str(csq_record['HGVSp'].split(':')[1])
+                    protein_change = threeToOneAA(protein_change_VEP)
+                    csq_record['PROTEIN_CHANGE'] = protein_change_VEP
+
+    if 'synonymous_variant' in csq_record['Consequence']:
+        protein_change = 'p.' + \
+            str(csq_record['Amino_acids']) + \
+            str(protein_position) + str(csq_record['Amino_acids'])
+        if 'stop_lost' in str(csq_record['Consequence']) and '/' in str(csq_record['Amino_acids']):
+            protein_change = 'p.X' + \
+                str(protein_position) + \
+                str(csq_record['Amino_acids']).split('/')[1]
+
+    csq_record['HGVSp_short'] = protein_change
+    exon_number = 'NA'
+    if not csq_record['EXON'] is None:
+        if csq_record['EXON'] != '.':
+            if '/' in csq_record['EXON']:
+                exon_number = str(csq_record['EXON']).split('/')[0]
+                csq_record['EXON_AFFECTED'] = exon_number
+                if '-' in exon_number:
+                    csq_record['EXON_AFFECTED'] = exon_number.split('-')[0]
+                num_exons = str(csq_record['EXON']).split('/')[1]
+                if exon_number == num_exons:
+                    csq_record['LAST_EXON'] = True
+
+    if not csq_record['INTRON'] is None:
+        if csq_record['INTRON'] != '.':
+            if '/' in csq_record['INTRON']:
+                intron_number = str(csq_record['INTRON']).split('/')[0]
+                num_introns = str(csq_record['INTRON']).split('/')[1]
+                if intron_number == num_introns:
+                    csq_record['LAST_INTRON'] = True
+
+    if not csq_record['HGVSc'] is None:
+        if csq_record['HGVSc'] != '.':
+            if protein_change != '.':
+                key = str(csq_record['Consequence']) + ':' + str(csq_record['HGVSc']) + ':exon' + str(exon_number) + ':' + str(protein_change)
+                csq_record['CDS_CHANGE'] = key
+
+    return(csq_record)
+
+
+def make_transcript_xref_map(rec, fieldmap, xref_tag='GENE_TRANSCRIPT_XREF'):
+    transcript_xref_map = {}
+    if not rec.INFO.get(xref_tag) is None:
+        for tref in rec.INFO.get(xref_tag).split(','):
+            xrefs = tref.split('|')
+            ensembl_transcript_id = str(xrefs[0])
+            transcript_xref_map[ensembl_transcript_id] = {}
+            for annotation in fieldmap.keys():
+                annotation_index = fieldmap[annotation]
+                if annotation_index > (len(xrefs) - 1):
+                    continue
+                if xrefs[annotation_index] != '':
+                    transcript_xref_map[ensembl_transcript_id][annotation] = xrefs[annotation_index]
+
+    return (transcript_xref_map)
diff --git a/src/gvanno/lib/gvanno/dbnsfp.py b/src/gvanno/lib/gvanno/dbnsfp.py
new file mode 100755
index 0000000..64375ba
--- /dev/null
+++ b/src/gvanno/lib/gvanno/dbnsfp.py
@@ -0,0 +1,156 @@
+#!/usr/bin/env python
+
+import os
+import re
+
+from cyvcf2 import VCF
+
+def map_variant_effect_predictors(rec, algorithms):
+
+    dbnsfp_predictions = map_dbnsfp_predictions(
+        str(rec.INFO.get('DBNSFP')), algorithms)
+    if rec.INFO.get('Gene') is None or rec.INFO.get('Consequence') is None:
+        return
+    gene_id = str(rec.INFO.get('Gene'))
+    consequence = str(rec.INFO.get('Consequence'))
+
+    dbnsfp_key = ''
+
+    found_key = 0
+    if not rec.INFO.get('HGVSp_short') is None and not rec.INFO.get('HGVSp_short') == '.':
+        aa_change = str(rec.INFO.get('HGVSp_short'))
+        dbnsfp_key = gene_id + ':' + str(aa_change)
+        if dbnsfp_key in dbnsfp_predictions:
+            found_key = 1
+    
+    if found_key == 0 and re.search('splice_', consequence):
+        dbnsfp_key = gene_id
+
+    algo_mapping = {
+        'sift': 'DBNSFP_SIFT',
+        'provean': 'DBNSFP_PROVEAN',
+        'm-cap': 'DBNSFP_M_CAP',
+        'mutpred': 'DBNSFP_MUTPRED',
+        'metarnn': 'DBNSFP_META_RNN',
+        'fathmm': 'DBNSFP_FATHMM',
+        'fathmm_mkl_coding': 'DBNSFP_FATHMM_MKL',
+        'mutationassessor': 'DBNSFP_MUTATIONASSESSOR',
+        'mutationtaster': 'DBNSFP_MUTATIONTASTER',
+        'deogen2': 'DBNSFP_DEOGEN2',
+        'primateai': 'DBNSFP_PRIMATEAI', 
+        'list_s2': 'DBNSFP_LIST_S2',
+        'gerp_rs': 'DBNSFP_GERP',       
+        'aloft': 'DBNSFP_ALOFTPRED',
+        'bayesdel_addaf': 'DBNSFP_BAYESDEL_ADDAF',
+        'splice_site_ada': 'DBNSFP_SPLICE_SITE_ADA',
+        'splice_site_rf': 'DBNSFP_SPLICE_SITE_RF'
+    }
+
+    if dbnsfp_key != '':
+        if dbnsfp_key in dbnsfp_predictions:
+            rec.INFO['EFFECT_PREDICTIONS'] = dbnsfp_predictions[dbnsfp_key]
+            for algo_pred in rec.INFO['EFFECT_PREDICTIONS'].split('&'):
+                if algo_pred.split(':')[0] in algo_mapping:
+                    rec.INFO[algo_mapping[algo_pred.split(':')[0]]] = str(algo_pred.split(':')[1])
+
+
+def map_dbnsfp_predictions(dbnsfp_tag, algorithms):
+
+    effect_predictions = {}
+    for v in dbnsfp_tag.split(','):
+        dbnsfp_info = v.split('|')
+        if len(dbnsfp_info) == 1:
+            return effect_predictions
+        ref_aa = dbnsfp_info[0]
+        alt_aa = dbnsfp_info[1]
+        all_ids = dbnsfp_info[3].split('&')
+        unique_ids = {}
+        for s in all_ids:
+            unique_ids[s] = 1
+
+        isoform_aa_keys = []
+        if ref_aa != '.' and alt_aa != '.' and ref_aa != '' and alt_aa != '':
+            aa_pos = dbnsfp_info[5].split('&')
+            for pos in aa_pos:
+                for gene_id in unique_ids:
+                    k = str(gene_id) + ':p.' + str(ref_aa) + pos + str(alt_aa)
+                    isoform_aa_keys.append(k)
+        else:
+            # continue
+            for gene_id in unique_ids:
+                isoform_aa_keys.append(gene_id)
+
+        algorithm_raw_predictions = {}
+
+        i = 6
+        v = 0
+       
+        if len(algorithms) != len(dbnsfp_info[6:]):
+            return effect_predictions
+
+        while i < len(dbnsfp_info):
+            algorithm_raw_predictions[str(
+                algorithms[v]).lower()] = dbnsfp_info[i].split('&')
+            i = i + 1
+            v = v + 1
+        dbnsfp_predictions = {}
+
+        for k in isoform_aa_keys:
+            if not k in dbnsfp_predictions:
+                dbnsfp_predictions[k] = {}
+            all_preds = []
+            for algo in algorithm_raw_predictions.keys():
+                unique_algo_predictions = {}
+                for pred in algorithm_raw_predictions[algo]:
+                    if pred != '':
+                        if not pred in unique_algo_predictions:
+                            unique_algo_predictions[pred] = 1
+                    else:
+                        unique_algo_predictions['.'] = 1
+                if len(unique_algo_predictions.keys()) > 1 and '.' in unique_algo_predictions.keys():
+                    del unique_algo_predictions['.']
+                dbnsfp_predictions[k][algo] = str(
+                    algo) + ':' + '|'.join(unique_algo_predictions.keys())
+                all_preds.append(dbnsfp_predictions[k][algo])
+            effect_predictions[k] = '&'.join(all_preds)
+
+    return effect_predictions
+
+
+def vep_dbnsfp_meta_vcf(query_vcf, info_tags_wanted):
+    
+    vcf = VCF(query_vcf)
+    vep_csq_index2fields = {}
+    vep_csq_fields2index = {}
+    dbnsfp_prediction_algorithms = []
+    for e in vcf.header_iter():
+        header_element = e.info()
+        if 'ID' in header_element.keys():
+            identifier = str(header_element['ID'])
+            if identifier == 'CSQ' or identifier == 'DBNSFP':
+                description = str(header_element['Description'])
+                if 'Format: ' in description:
+                    subtags = description.split('Format: ')[1].split('|')
+                    if identifier == 'CSQ':
+                        i = 0
+                        for t in subtags:
+                            v = t.replace('"', '')                            
+                            if v in info_tags_wanted:
+                                vep_csq_index2fields[i] = v
+                                vep_csq_fields2index[v] = i
+                            i = i + 1
+                    if identifier == 'DBNSFP':
+                        i = 6
+                        while (i < len(subtags)):
+                            dbnsfp_prediction_algorithms.append(
+                                str(re.sub(r'((_score)|(_pred))"*$', '', subtags[i])))
+                            i = i + 1
+
+    vep_dbnsfp_meta_info = {}
+    vep_dbnsfp_meta_info['vep_csq_fieldmap'] = {}
+    vep_dbnsfp_meta_info['vep_csq_fieldmap']['field2index'] = vep_csq_fields2index
+    vep_dbnsfp_meta_info['vep_csq_fieldmap']['index2field'] = vep_csq_index2fields
+    vep_dbnsfp_meta_info['dbnsfp_prediction_algorithms'] = dbnsfp_prediction_algorithms
+
+    return vep_dbnsfp_meta_info
+
diff --git a/src/gvanno/lib/gvanno/gvanno_vars.py b/src/gvanno/lib/gvanno/gvanno_vars.py
new file mode 100644
index 0000000..81b0c8c
--- /dev/null
+++ b/src/gvanno/lib/gvanno/gvanno_vars.py
@@ -0,0 +1,78 @@
+#!/usr/bin/env python
+
+GVANNO_VERSION = '1.7.0'
+DB_VERSION = '20231222'
+
+## MISCELLANEOUS
+NCBI_BUILD_MAF = 'GRCh38'
+MAX_VARIANTS_FOR_REPORT = 500_000
+CODING_EXOME_SIZE_MB = 34.0
+RECOMMENDED_N_MUT_SIGNATURE = 200
+
+## GENCODE
+GENCODE_VERSION = {'grch38': 44,'grch37': 19}
+
+## vcfanno
+VCFANNO_MAX_PROC = 15
+
+## VEP settings/versions
+VEP_VERSION = '110'
+VEP_ASSEMBLY = {'grch38': 'GRCh38','grch37': 'GRCh37'}
+VEP_MIN_FORKS = 1
+VEP_MAX_FORKS = 8
+VEP_MIN_BUFFER_SIZE = 50
+VEP_MAX_BUFFER_SIZE = 30000
+VEP_PICK_CRITERIA = ['mane_select','mane_plus_clinical','canonical','appris','tsl','biotype','ccds','rank','length']
+
+## https://www.ensembl.org/info/genome/variation/prediction/predicted_data.html#consequences
+VEP_consequence_rank = {
+    'transcript_ablation': 1,
+    'splice_acceptor_variant': 2,
+    'splice_donor_variant': 3,
+    'stop_gained': 4,
+    'frameshift_variant': 5,
+    'stop_lost': 6,
+    'start_lost' : 7,
+    'transcript_amplification': 8,
+    'feature_elongation': 9,
+    'feature_truncation': 10,
+    'inframe_insertion': 11,
+    'inframe_deletion': 12,
+    'missense_variant': 13,
+    'protein_altering_variant': 14,
+    'splice_donor_5th_base_variant': 15,
+    'splice_region_variant': 16,
+    'splice_donor_region_variant': 17,
+    'splice_polypyrimidine_tract_variant': 18,
+    'incomplete_terminal_codon_variant': 19,
+    'start_retained_variant': 20,
+    'stop_retained_variant': 21,
+    'synonymous_variant': 22,
+    'coding_sequence_variant': 23,
+    'mature_miRNA_variant': 24,
+    '5_prime_UTR_variant': 25,
+    '3_prime_UTR_variant': 26,
+    'non_coding_transcript_exon_variant': 27,
+    'intron_variant': 28,
+    'NMD_transcript_variant': 29,
+    'non_coding_transcript_variant': 30,
+    'coding_transcript_variant': 31,
+    'upstream_gene_variant': 32,
+    'downstream_gene_variant': 33,
+    'TFBS_ablation': 34,
+    'TFBS_amplification': 35,
+    'TF_binding_site_variant': 36,
+    'regulatory_region_ablation': 37,
+    'regulatory_region_amplification': 38,
+    'regulatory_region_variant': 39,
+    'intergenic_variant': 40,
+    'sequence_variant': 41
+}
+
+CSQ_MISSENSE_PATTERN = r"^missense_variant"
+CSQ_CODING_PATTERN = r"(stop_(lost|gained)|start_lost|frameshift_|missense_|splice_(donor|acceptor)|protein_altering|inframe_)"
+CSQ_CODING_SILENT_PATTERN = r"(stop_(lost|gained)|start_lost|frameshift_|missense_|splice_(donor|acceptor)|protein_altering|inframe_|synonymous|(start|stop)_retained)"
+CSQ_NULL_PATTERN = r"(stop_gained|frameshift_)"
+CSQ_SPLICE_REGION_PATTERN = r"(splice_|intron_variant)"
+CSQ_SPLICE_DONOR_PATTERN = r"(splice_region_variant|splice_donor_variant|splice_donor_region_variant|splice_donor_5th_base_variant)"
+
diff --git a/src/gvanno/lib/gvanno/mutation_hotspot.py b/src/gvanno/lib/gvanno/mutation_hotspot.py
new file mode 100644
index 0000000..d3061f2
--- /dev/null
+++ b/src/gvanno/lib/gvanno/mutation_hotspot.py
@@ -0,0 +1,167 @@
+#!/usr/bin/env python
+
+import os,re
+import csv
+import gzip
+from lib.gvanno.annoutils import threeToOneAA
+
+from typing import Dict
+from logging import Logger
+
+def load_mutation_hotspots(hotspots_fname: str, logger: Logger) -> Dict[str, Dict[str, Dict[str, str]]]:
+    """
+    Load mutation hotspots from a file and create a dictionary of hotspots.
+
+    Parameters:
+        hotspots_fname (str): The file path of the mutation hotspots file.
+        logger (Logger): The logger object to log messages.
+
+    Returns:
+        Dict[str, Dict[str, Dict[str, str]]]: A dictionary containing mutation hotspots categorized by mutation type.
+            The dictionary has three keys: 'mutation', 'codon', and 'splice'.
+            Each key maps to a sub-dictionary that contains the hotspots for that mutation type.
+            The sub-dictionaries are indexed by a combination of gene and mutation identifier.
+
+    Raises:
+        SystemExit: If the mutation hotspots file does not exist.
+
+    """
+    hotspots: Dict[str, Dict[str, Dict[str, str]]] = {
+        'mutation': {},
+        'codon': {},
+        'splice': {}
+    }
+
+    if not os.path.exists(hotspots_fname):
+        logger.info(f"ERROR: File '{hotspots_fname}' does not exist - exiting")
+        exit(1)
+
+    with gzip.open(hotspots_fname, mode='rt') as f:
+        reader = csv.DictReader(f, delimiter='\t')
+        for row in reader:
+            gene = str(row['entrezgene'])
+            hgvsp2 = row['hgvsp2']
+            codon = row['codon']
+            hgvsc = row['hgvsc']
+
+            hotspots['mutation'][gene + '-' + hgvsp2] = row
+            hotspots['codon'][gene + '-' + codon] = row
+            if hgvsc != '.':
+                hotspots['splice'][gene + '-' + hgvsc] = row
+
+    return hotspots
+
+
+def match_csq_mutation_hotspot(transcript_csq_elements, cancer_hotspots, rec, principal_csq_properties):
+
+   """
+   Function that matches consequence entries from VEP (transcript_csq_elements) with entries in mutation hotspots
+   """
+
+   unique_hotspot_mutations = {}
+   unique_hotspot_codons = {}
+
+   principal_hgvsp = principal_csq_properties['hgvsp']
+   principal_hgvsc = principal_csq_properties['hgvsc']
+   principal_entrezgene = principal_csq_properties['entrezgene']
+   principal_codon = principal_csq_properties['codon']   
+   
+   ## loop through all transcript-specific consequences ('csq_elements') for a given variant, check for the presence of
+   ## 1. Exonic, protein-altering mutations (dictionary key = entrezgene + hgvsp) that overlap known cancer hotspots (https://github.com/sigven/cancerHotspots)
+   ##    - assert whether a potentially hit reflects the principal hgvsp ('by_hgvsp_principal') or an alternative hgvsp ('by_hgvsp_nonprincipal')
+   ## 2. Splice site mutations (dictionary key = entrezgene + hgvsc) that overlap known cancer hotspots (NOTE: only a subset have been curated)
+   ##    - assert whether a potentially hit reflects the principal hgvsc ('by_hgvsc_principal') or an alternative hgvsc ('by_hgvsc_nonprincipal')
+
+   for csq in transcript_csq_elements:
+      (consequence, symbol, entrezgene, hgvsc, hgvsp, exon, feature_type, feature, biotype) = csq.split(':')
+
+      if not bool(re.search(r'^(missense|stop|start|inframe|splice_donor|splice_acceptor|frameshift)', consequence)) is True:
+         continue
+
+      hgvsp_short = threeToOneAA(hgvsp)
+      hotspot_key_mutation = "."
+      codon_match = []
+      if entrezgene != "." and hgvsp != ".":
+         hotspot_key_mutation = str(entrezgene) + '-' + str(hgvsp_short)
+         codon_match = re.findall(r'p.[A-Z][0-9]{1,}',hgvsp_short)
+
+      if entrezgene != "." and (consequence == 'splice_donor_variant' or consequence == 'splice_acceptor_variant'):
+         hgvsc_key = re.sub(r'>(A|G|C|T)$','',hgvsc)
+         hotspot_key_mutation = str(entrezgene) + '-' + str(hgvsc_key)
+
+      if hotspot_key_mutation == ".":
+         continue
+
+      if hotspot_key_mutation in cancer_hotspots['mutation']:
+         hotspot_info = cancer_hotspots['mutation'][hotspot_key_mutation]['MUTATION_HOTSPOT2']
+         hotspot_info_ttype = cancer_hotspots['mutation'][hotspot_key_mutation]['MUTATION_HOTSPOT_CANCERTYPE']
+         unique_hotspot_mutations['exonic|' + str(hotspot_info)] = hotspot_info_ttype
+
+      if hotspot_key_mutation in cancer_hotspots['splice']:
+         hotspot_info = cancer_hotspots['splice'][hotspot_key_mutation]['MUTATION_HOTSPOT2']
+         hotspot_info_ttype = cancer_hotspots['splice'][hotspot_key_mutation]['MUTATION_HOTSPOT_CANCERTYPE']
+         unique_hotspot_mutations['splice|' + str(hotspot_info)] = hotspot_info_ttype
+
+               
+      if len(codon_match) > 0:
+         hotspot_key_codon = str(entrezgene) + '-' + str(codon_match[0])
+
+         if hotspot_key_codon in cancer_hotspots['codon']:            
+            unique_hotspot_codons[str('exonic|') + cancer_hotspots['codon'][hotspot_key_codon]['MUTATION_HOTSPOT2']] =  \
+               cancer_hotspots['codon'][hotspot_key_codon]['MUTATION_HOTSPOT_CANCERTYPE']
+         
+   if len(unique_hotspot_mutations.keys()) > 0:
+      if len(unique_hotspot_mutations.keys()) == 1:
+         for gene_mutation_key in unique_hotspot_mutations.keys():
+            rec.INFO['MUTATION_HOTSPOT'] = gene_mutation_key
+            rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_mutations[gene_mutation_key]
+
+            if gene_mutation_key.split('|')[0] == 'exonic':
+               rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsp_nonprincipal'
+               hgvsp_candidate = 'p.' + str(gene_mutation_key.split('|')[3]) + str(gene_mutation_key.split('|')[4])
+               if hgvsp_candidate == principal_hgvsp:
+                  rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsp_principal'
+            else:
+               rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsc_nonprincipal'
+               hgvsc_candidate = re.sub(r'>(A|G|C|T){1,}$', '' , str(gene_mutation_key.split('|')[4]))
+               if hgvsc_candidate == principal_hgvsc:
+                  rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsc_principal'
+      else:
+         ## multiple hotspot matches for alternative hgvsp keys
+         ## - keep only match against principal HGVSp 
+         for hotspot_info in unique_hotspot_mutations.keys():
+            if hotspot_info.split('|')[0] == 'exonic':
+               hgvsp_candidate = "p." + str(hotspot_info.split('|')[3]) + str(hotspot_info.split('|')[4]) 
+
+               if hgvsp_candidate == principal_hgvsp:
+                  rec.INFO['MUTATION_HOTSPOT'] = hotspot_info
+                  rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_mutations[hotspot_info]
+                  rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsp_principal'
+            else:
+               hgvsc_candidate = re.sub(r'>(A|G|C|T){1,}$', '' , str(hotspot_info.split('|')[4]))
+
+               if hgvsc_candidate == principal_hgvsc:
+                  rec.INFO['MUTATION_HOTSPOT'] = hotspot_info
+                  rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_mutations[hotspot_info]
+                  rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_hgvsc_principal'
+
+   else:
+      if len(unique_hotspot_codons.keys()) > 0:
+         if len(unique_hotspot_codons.keys()) == 1:
+            for gene_codon_key in unique_hotspot_codons.keys():
+
+               if '|' in gene_codon_key:
+
+                  codon = str(gene_codon_key.split('|')[3])
+
+                  if codon == principal_codon:
+                     rec.INFO['MUTATION_HOTSPOT'] = gene_codon_key
+                     rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_codons[gene_codon_key]
+                     rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_codon_principal'
+                  else:
+                     rec.INFO['MUTATION_HOTSPOT'] = gene_codon_key
+                     rec.INFO['MUTATION_HOTSPOT_CANCERTYPE'] = unique_hotspot_codons[gene_codon_key]
+                     rec.INFO['MUTATION_HOTSPOT_MATCH'] = 'by_codon_nonprincipal'
+
+
+   return
diff --git a/src/gvanno/lib/oncogenicity.py b/src/gvanno/lib/gvanno/oncogenicity.py
similarity index 89%
rename from src/gvanno/lib/oncogenicity.py
rename to src/gvanno/lib/gvanno/oncogenicity.py
index 28aa347..50544a9 100644
--- a/src/gvanno/lib/oncogenicity.py
+++ b/src/gvanno/lib/gvanno/oncogenicity.py
@@ -115,16 +115,16 @@ def assign_oncogenicity_evidence(rec = None, tumortype = "Any"):
       "SYMBOL",
       "ONCOGENE",
       "ONCOGENE_EVIDENCE",
-      "TUMOR_SUPPRESSOR",
-      "TUMOR_SUPPRESSOR_EVIDENCE",
-      "LoF",
+      "TSG",
+      "TSG_EVIDENCE",
+      "LOSS_OF_FUNCTION",
       "INTRON_POSITION",
       "EXON_POSITION",
-      "EAS_AF_GNOMAD",
-      "NFE_AF_GNOMAD",
-      "AFR_AF_GNOMAD",
-      "AMR_AF_GNOMAD",
-      "SAS_AF_GNOMAD",
+      "gnomADe_EAS_AF",
+      "gnomADe_NFE_AF",
+      "gnomADe_AFR_AF",
+      "gnomADe_AMR_AF",
+      "gnomADe_SAS_AF",
       "DBNSFP_SIFT",
       "DBNSFP_PROVEAN",
       "DBNSFP_META_RNN",
@@ -143,26 +143,22 @@ def assign_oncogenicity_evidence(rec = None, tumortype = "Any"):
    variant_data = {}
    for col in required_oncogenicity_vars:
       if rec.INFO.get(col) is None:
-         if col == "TUMOR_SUPPRESSOR" or col == "ONCOGENE":
+         if col == "TSG" or col == "ONCOGENE":
             variant_data[col] = False
-         elif col == "LoF":
+         elif col == "LOSS_OF_FUNCTION":
             variant_data['LOSS_OF_FUNCTION'] = False
          else:
             variant_data[col] = None
       else:
          if rec.INFO.get(col) == '':
             variant_data[col] = True
-         else:
-            if col == "LoF":
-               if rec.INFO.get(col) == "HC":
-                  variant_data["LOSS_OF_FUNCTION"] = True
-            else:
-               variant_data[col] = rec.INFO.get(col)
+         else: 
+            variant_data[col] = rec.INFO.get(col)
    
    for code in clingen_vicc_ev_codes:
       variant_data[code] = False
-
-   dbnsfp_minimum_majority = 7
+   
+   dbnsfp_minimum_majority = 6
    dbnsfp_maximum_minority = 2
    dbnsfp_minimum_algos_called = dbnsfp_minimum_majority
 
@@ -250,54 +246,55 @@ def assign_oncogenicity_evidence(rec = None, tumortype = "Any"):
             variant_data['CLINGEN_VICC_OP3'] = True
    
 
-   if "EAS_AF_GNOMAD" in variant_data.keys() and \
-      "SAS_AF_GNOMAD" in variant_data.keys() and \
-      "AMR_AF_GNOMAD" in variant_data.keys() and \
-      "AFR_AF_GNOMAD" in variant_data.keys() and \
-      "NFE_AF_GNOMAD" in variant_data.keys():
+   if "gnomADe_EAS_AF" in variant_data.keys() and \
+      "gnomADe_SAS_AF" in variant_data.keys() and \
+      "gnomADe_AMR_AF" in variant_data.keys() and \
+      "gnomADe_AFR_AF" in variant_data.keys() and \
+      "gnomADe_NFE_AF" in variant_data.keys():
 
       ## check if variant has MAF > 0.01 (SBVS1) or > 0.05 in any of five major gnomAD subpopulations (exome set)
-      for pop in ['EAS_AF_GNOMAD','SAS_AF_GNOMAD','AMR_AF_GNOMAD','AFR_AF_GNOMAD','NFE_AF_GNOMAD']:
+      for pop in ['gnomADe_SAS_AF','gnomADe_EAS_AF','gnomADe_AMR_AF','gnomADe_AFR_AF','gnomADe_NFE_AF']:
          if not variant_data[pop] is None:
             ## MAF for this population >= 0.05
             if float(variant_data[pop]) >= 0.05:
                variant_data["CLINGEN_VICC_SBVS1"] = True
-      for pop in ['EAS_AF_GNOMAD','SAS_AF_GNOMAD','AMR_AF_GNOMAD','AFR_AF_GNOMAD','NFE_AF_GNOMAD']:
+      for pop in ['gnomADe_SAS_AF','gnomADe_EAS_AF','gnomADe_AMR_AF','gnomADe_AFR_AF','gnomADe_NFE_AF']:
          if not variant_data[pop] is None:
             ## MAF for this population >= 0.01 (< 0.05)
             if float(variant_data[pop]) >= 0.01 and variant_data["CLINGEN_VICC_SBVS1"] is False:
                variant_data["CLINGEN_VICC_SBS1"] = True
 
-      missing_pop_freq = 0
+      #missing_pop_freq = 0
       approx_zero_pop_freq = 0
-      for pop in ['EAS_AF_GNOMAD','SAS_AF_GNOMAD','AMR_AF_GNOMAD','AFR_AF_GNOMAD','NFE_AF_GNOMAD']:
+      for pop in ['gnomADe_SAS_AF','gnomADe_EAS_AF','gnomADe_AMR_AF','gnomADe_AFR_AF','gnomADe_NFE_AF']:
          ## no MAF recorded in gnomAD for this population
          if variant_data[pop] is None:
-            missing_pop_freq = missing_pop_freq + 1
+            approx_zero_pop_freq = approx_zero_pop_freq + 1
          else:
             ## Very low MAF for this population
             if float(variant_data[pop]) < 0.0001:
                approx_zero_pop_freq = approx_zero_pop_freq + 1
     
       ## check if variant is missing or with MAF approximately zero in all five major gnomAD subpopulations (exome set)
-      if missing_pop_freq == 5 or approx_zero_pop_freq == 5:
+      if approx_zero_pop_freq == 5:
          variant_data["CLINGEN_VICC_OP4"] = True
+  
     
    ## check if variant is a loss-of-function variant (LOFTEE) in a tumor suppressor gene (Cancer Gene Census/CancerMine)
-   if "TUMOR_SUPPRESSOR" in variant_data.keys() and \
+   if "TSG" in variant_data.keys() and \
       "ONCOGENE" in variant_data.keys() and \
       "LOSS_OF_FUNCTION" in variant_data.keys() and \
       "Consequence" in variant_data.keys():
 
-      if variant_data['LOSS_OF_FUNCTION'] is True and variant_data['TUMOR_SUPPRESSOR'] is True:
+      if variant_data['LOSS_OF_FUNCTION'] is True and variant_data['TSG'] is True:
          variant_data['CLINGEN_VICC_OVS1'] = True
   
       ## check if variant is creating a stop-lost or protein-length change in oncogene/tumor suppressor genes
       if variant_data['CLINGEN_VICC_OVS1'] is False and \
          ((re.match(r'^(inframe_deletion|inframe_insertion)', variant_data['Consequence']) and \
-            (variant_data['TUMOR_SUPPRESSOR'] is True or variant_data['ONCOGENE'] is True)) or \
+            (variant_data['TSG'] is True or variant_data['ONCOGENE'] is True)) or \
          (re.match(r'^(stop_lost)', variant_data['Consequence']) and \
-            variant_data['TUMOR_SUPPRESSOR'] is True)):
+            variant_data['TSG'] is True)):
             variant_data['CLINGEN_VICC_OM2'] = True
    
    ## check if variant is silent (synonymous|splice) and outside critical splice region
@@ -393,25 +390,25 @@ def assign_oncogenicity_evidence(rec = None, tumortype = "Any"):
                   else:
                      variant_data[e] = f'{variant_data[e]}|{code}'
 
-   lb_upper_limit = -1
-   lb_lower_limit = -6
-   b_upper_limit = -7
+   likely_benign_upper_limit = -1
+   likely_benign_lower_limit = -6
+   benign_upper_limit = -7
    #vus_lower_limit = 0
    #vus_upper_limit = 4
-   lo_lower_limit = 6
-   lo_upper_limit = 9
-   o_lower_limit = 10
+   likely_oncogenic_lower_limit = 5
+   likely_oncogenic_upper_limit = 9
+   oncogenic_lower_limit = 10
 
    variant_data['ONCOGENICITY_SCORE'] = onc_score_benign + onc_score_pathogenic
-   if variant_data['ONCOGENICITY_SCORE'] <= lb_upper_limit and \
-      variant_data['ONCOGENICITY_SCORE'] >= lb_lower_limit:
+   if variant_data['ONCOGENICITY_SCORE'] <= likely_benign_upper_limit and \
+      variant_data['ONCOGENICITY_SCORE'] >= likely_benign_lower_limit:
       variant_data['ONCOGENICITY_CLASSIFICATION'] = "Likely_Benign"
-   if variant_data['ONCOGENICITY_SCORE'] >= lo_lower_limit and \
-      variant_data['ONCOGENICITY_SCORE'] <= lo_upper_limit:
+   if variant_data['ONCOGENICITY_SCORE'] >= likely_oncogenic_lower_limit and \
+      variant_data['ONCOGENICITY_SCORE'] <= likely_oncogenic_upper_limit:
       variant_data['ONCOGENICITY_CLASSIFICATION'] = "Likely_Oncogenic"
-   if variant_data['ONCOGENICITY_SCORE'] <= b_upper_limit:
+   if variant_data['ONCOGENICITY_SCORE'] <= benign_upper_limit:
       variant_data['ONCOGENICITY_CLASSIFICATION'] = "Benign"
-   if variant_data['ONCOGENICITY_SCORE'] >= o_lower_limit:
+   if variant_data['ONCOGENICITY_SCORE'] >= oncogenic_lower_limit:
       variant_data['ONCOGENICITY_CLASSIFICATION'] = "Oncogenic"
    
    for e in ['ONCOGENICITY_SCORE',
diff --git a/src/gvanno/lib/gvanno/utils.py b/src/gvanno/lib/gvanno/utils.py
new file mode 100644
index 0000000..ba3b878
--- /dev/null
+++ b/src/gvanno/lib/gvanno/utils.py
@@ -0,0 +1,159 @@
+#!/usr/bin/env python
+import sys
+import subprocess
+import shutil
+import logging
+import os
+import time
+import errno
+import platform
+import string
+import random
+
+
+def getlogger(logger_name):
+    logger = logging.getLogger(logger_name)
+    logger.setLevel(logging.DEBUG)
+    # create console handler and set level to debug
+    ch = logging.StreamHandler(sys.stdout)
+    ch.setLevel(logging.DEBUG)
+    # add ch to logger
+    logger.addHandler(ch)
+    # create formatter
+    formatter = logging.Formatter(
+        "%(asctime)s - %(name)s - %(levelname)s - %(message)s", "20%y-%m-%d %H:%M:%S")
+    # add formatter to ch
+    ch.setFormatter(formatter)
+    return logger
+
+def error_message(message, logger):
+    logger.error("")
+    logger.error(message)
+    logger.error("")
+    sys.exit(1)
+
+def warn_message(message, logger):
+    logger.warning("")
+    logger.warning(message)
+    logger.warning("")
+ 
+def random_id_generator(size = 10, chars = string.ascii_lowercase + string.digits):
+    return ''.join(random.choice(chars) for _ in range(size))   
+
+#def random_string(length):
+#    letters = string.ascii_lowercase
+#    result_str = ''.join(random.choice(letters) for i in range(length))
+#    return result_str
+
+def check_subprocess(logger, command, debug):
+    if debug:
+        logger.info(command)
+    try:
+        output = subprocess.check_output(
+            str(command), stderr=subprocess.STDOUT, shell=True)
+        if len(output) > 0:
+            print(str(output.decode()).rstrip())
+    except subprocess.CalledProcessError as e:
+        print(e.output.decode())
+        exit(0)
+
+def get_loftee_dir():
+    #pcgr_conda_env = conda_prefix_basename()
+    return "/opt/vep/src/ensembl-vep/modules"
+
+def perl_cmd():
+    """Retrieve abs path to locally installed conda Perl or first in PATH,
+    e.g. conda/env/pcgr/bin/perl
+    """
+    perl = shutil.which(os.path.join(get_pcgr_bin(), "perl"))
+    if perl:
+        return perl
+    else:
+        return shutil.which("perl")
+
+def get_perl_exports():
+    """Environmental exports to use conda installed perl.
+    """
+    perl_path = os.path.normpath(perl_cmd())      # /conda/env/pcgr/bin/perl
+    perl_path_parent = os.path.dirname(perl_path) # /conda/env/pcgr/bin
+    out = f"unset PERL5LIB && export PATH={perl_path_parent}:\"$PATH\""
+    return out
+
+def is_integer(n):
+    try:
+        float(n)
+    except ValueError:
+        return False
+    else:
+        return float(n).is_integer()
+
+# https://stackoverflow.com/a/10840586/2169986
+def remove_file(filename):
+    try:
+        os.remove(filename)
+    except OSError as e:
+        if e.errno != errno.ENOENT: # errno.ENOENT = no such file or directory
+            raise # re-raise exception if a different error occurred
+
+
+# from https://github.com/bcbio/bcbio-nextgen/blob/master/bcbio/utils.py
+def safe_makedir(dname):
+    """Make a directory if it does not exist, handling concurrent race conditions.
+    """
+    if not dname:
+        return dname
+    num_tries = 0
+    max_tries = 5
+    while not os.path.exists(dname):
+        # we could get an error here if multiple processes are creating
+        # the directory at the same time. Grr, concurrency.
+        try:
+            os.makedirs(dname)
+        except OSError:
+            if num_tries > max_tries:
+                raise
+            num_tries += 1
+            time.sleep(2)
+    return dname
+
+def sort_bed(unsorted_bed_fname: str, sorted_bed_fname: str, debug = False, logger = None):
+    
+    ## Sort regions in target BED
+    if os.path.exists(unsorted_bed_fname) and os.stat(unsorted_bed_fname).st_size != 0:
+        cmd_sort_custom_bed1 = 'egrep \'^[0-9]\' ' + str(unsorted_bed_fname) + \
+            ' | sort -k1,1n -k2,2n -k3,3n > ' + str(sorted_bed_fname)
+        cmd_sort_custom_bed2 = 'egrep -v \'^[0-9]\' ' + str(unsorted_bed_fname) + \
+            ' | egrep \'^[XYM]\' | sort -k1,1 -k2,2n -k3,3n >> ' + str(sorted_bed_fname)
+
+        check_subprocess(logger, cmd_sort_custom_bed1, debug)
+        check_subprocess(logger, cmd_sort_custom_bed2, debug)
+        if not debug:
+            remove(str(unsorted_bed_fname))
+    else:
+        err_msg = 'File ' + str(unsorted_bed_fname) + ' does not exist or is empty'
+        error_message(err_msg, logger)
+
+
+def check_file_exists(fname: str, logger = None) -> bool:
+    err = 0
+    if not os.path.isfile(fname):
+        err = 1
+    else:
+        if os.stat(fname).st_size == 0:
+            err = 1
+    if err == 1:
+        err_msg = f"File {fname} does not exist or has zero size"
+        error_message(err_msg, logger)
+    
+    return(True)
+
+def check_tabix_file(fname: str, logger = None) -> bool:
+    tabix_file = os.path.abspath(f'{fname}.tbi')
+    if not os.path.exists(tabix_file):
+        err_msg = f"Tabix file (i.e. '.gz.tbi') is not present for the bgzipped VCF input file ({fname}" + \
+            "). Please make sure your input VCF is properly compressed and indexed (bgzip + tabix)"
+        error_message(err_msg, logger)
+    else:
+        ## check file size is more than zero
+        check_file_exists(tabix_file)
+    return(True)
diff --git a/src/gvanno/lib/gvanno/variant.py b/src/gvanno/lib/gvanno/variant.py
new file mode 100644
index 0000000..6b78d75
--- /dev/null
+++ b/src/gvanno/lib/gvanno/variant.py
@@ -0,0 +1,190 @@
+#!/usr/bin/env python
+
+import os
+import re
+import pandas as pd
+import numpy as np
+import warnings
+
+warnings.simplefilter(action='ignore', category=pd.errors.PerformanceWarning)
+
+
+def set_genotype(variant_set: pd.DataFrame, logger) -> pd.DataFrame:
+    """
+    Set verbose genotype (homozygous, heterozygous) for each variant
+    """
+    variant_set['GENOTYPE'] = '.'    
+    if {'GT'}.issubset(variant_set.columns):
+        logger.info("Assignment of genotype (homozygous, heterozygous) for each variant based on 'GT' tag")
+        heterozygous_states = []
+        ref_allele_index = 0
+        while ref_allele_index < 20:
+            alt_allele_index = ref_allele_index + 1
+            while alt_allele_index <= 20:
+                phased_gt_1 = str(ref_allele_index) + "|" + str(alt_allele_index)
+                phased_gt_2 = str(alt_allele_index) + "|" + str(ref_allele_index)
+                unphased_gt_1 = str(ref_allele_index) + "/" + str(alt_allele_index)
+                unphased_gt_2 = str(alt_allele_index) + "/" + str(ref_allele_index)
+                gts = [phased_gt_1, phased_gt_2, unphased_gt_1, unphased_gt_2]
+                heterozygous_states.extend(gts)
+                
+                alt_allele_index = alt_allele_index + 1
+            ref_allele_index = ref_allele_index + 1
+            
+        homozygous_states = []    
+        hom_allele_index = 1
+        while hom_allele_index <= 20:
+            phased_gt = str(hom_allele_index) + "|" + str(hom_allele_index)
+            unphased_gt = str(hom_allele_index) + "/" + str(hom_allele_index)
+            homozygous_states.extend([phased_gt, unphased_gt])
+            hom_allele_index = hom_allele_index + 1
+        
+        variant_set.loc[variant_set['GT'].isin(homozygous_states), "GENOTYPE"] = "homozygous"
+        variant_set.loc[variant_set['GT'].isin(heterozygous_states),"GENOTYPE"] = "heterozygous"
+    else:
+        variant_set['GENOTYPE'] = "undefined"
+    
+    variant_set = variant_set.astype({'GENOTYPE':'string'})  
+    return(variant_set)
+
+def append_annotations(vcf2tsv_gz_fname: str, gvanno_db_dir: str, logger):
+    
+    clinvar_tsv_fname = os.path.join(gvanno_db_dir, 'variant','tsv','clinvar', 'clinvar.tsv.gz')
+    protein_domain_tsv_fname = os.path.join(gvanno_db_dir, 'misc','tsv','protein_domain', 'protein_domain.tsv.gz') 
+    gene_xref_tsv_fname = os.path.join(gvanno_db_dir, 'gene','tsv','gene_transcript_xref', 'gene_transcript_xref.tsv.gz')
+    vcf2tsv_df = None
+    clinvar_data_df = None
+    
+    num_recs_with_clinvar_hits = 0
+    
+    if os.path.exists(vcf2tsv_gz_fname):
+        vcf2tsv_df = pd.read_csv(
+            vcf2tsv_gz_fname, skiprows=[0], sep="\t", na_values=".",
+            low_memory = False)
+        if {'CHROM','POS','REF','ALT','CLINVAR_MSID','PFAM_DOMAIN','ENTREZGENE'}.issubset(vcf2tsv_df.columns):
+            for elem in ['CHROM','POS','REF','ALT','CLINVAR_MSID','PFAM_DOMAIN','ENTREZGENE']:
+                vcf2tsv_df = vcf2tsv_df.astype({elem:'string'})            
+            vcf2tsv_df['CLINVAR_MSID'] = vcf2tsv_df['CLINVAR_MSID'].str.replace("\\.[0-9]{1,}$", "", regex = True)
+            vcf2tsv_df['PFAM_DOMAIN'] = vcf2tsv_df['PFAM_DOMAIN'].str.replace("\\.[0-9]{1,}$", "", regex = True)
+            vcf2tsv_df['ENTREZGENE'] = vcf2tsv_df['ENTREZGENE'].str.replace("\\.[0-9]{1,}$", "", regex = True)
+            vcf2tsv_df["VAR_ID"] = vcf2tsv_df["CHROM"].str.cat(
+                vcf2tsv_df["POS"], sep = "_").str.cat(
+                    vcf2tsv_df["REF"], sep = "_").str.cat(
+                        vcf2tsv_df["ALT"], sep = "_")
+
+            if {'CLINVAR_TRAITS_ALL'}.issubset(vcf2tsv_df.columns):
+                vcf2tsv_df.drop('CLINVAR_TRAITS_ALL', inplace=True, axis=1)
+        
+            ## check number of variants with ClinVar ID's
+            num_recs_with_clinvar_hits = vcf2tsv_df["CLINVAR_MSID"].notna().sum()
+            ## check number of variants with PFAM ID's
+            num_recs_with_pfam_hits = vcf2tsv_df["PFAM_DOMAIN"].notna().sum()
+            ## check number of variants with Ensembl gene ID's
+            num_recs_with_entrez_hits = vcf2tsv_df["ENTREZGENE"].notna().sum()
+    
+            #print(str(num_recs_with_entrez_hits))
+            ## merge variant set with ClinVar trait and variant origin annotations
+            if num_recs_with_clinvar_hits > 0:
+                if os.path.exists(clinvar_tsv_fname):
+                    clinvar_data_df = pd.read_csv(
+                        clinvar_tsv_fname, sep="\t", 
+                        usecols=["variation_id","origin_simple","VAR_ID","trait"],
+                        low_memory = False)
+                    clinvar_data_df['CLINVAR_TRAITS_ALL'] = clinvar_data_df['origin_simple'].str.capitalize().str.cat(
+                        clinvar_data_df['trait'], sep = " - ")
+                    clinvar_data_df['CLINVAR_MSID'] = clinvar_data_df['variation_id']
+                    clinvar_data_df = clinvar_data_df.astype({'CLINVAR_MSID':'string'})
+                    clinvar_data_df['CLINVAR_MSID'] = clinvar_data_df['CLINVAR_MSID'].str.replace("\\.[0-9]{1,}$", "", regex = True)
+                    clinvar_data_df = clinvar_data_df[['VAR_ID','CLINVAR_MSID','CLINVAR_TRAITS_ALL']]
+                    
+                    vcf2tsv_df = vcf2tsv_df.merge(
+                        clinvar_data_df, left_on=["VAR_ID", "CLINVAR_MSID"], right_on=["VAR_ID", "CLINVAR_MSID"], how="left")
+                else:
+                    logger.error(f"Could not find {clinvar_tsv_fname} needed for ClinVar variant annotation - exiting")
+            else:
+                vcf2tsv_df['CLINVAR_TRAITS_ALL'] = '.'
+                
+            
+            ## merge variant set with PFAM domain annotations
+            if num_recs_with_pfam_hits > 0:
+                
+                vcf2tsv_df.drop('PFAM_DOMAIN_NAME', inplace=True, axis=1)
+                
+                if os.path.exists(protein_domain_tsv_fname):
+                    prot_domains_data_df = pd.read_csv(
+                        protein_domain_tsv_fname, sep="\t", usecols=["pfam_id","pfam_name"]).drop_duplicates()
+                    prot_domains_data_df.rename(columns = {'pfam_id':'PFAM_DOMAIN', 'pfam_name':'PFAM_DOMAIN_NAME'}, inplace = True)                                        
+                    vcf2tsv_df = vcf2tsv_df.merge(prot_domains_data_df, left_on=["PFAM_DOMAIN"], right_on=["PFAM_DOMAIN"], how="left")
+                else:
+                    logger.error(f"Could not find {protein_domain_tsv_fname} needed for PFAM domain annotation - exiting")
+            else:
+                vcf2tsv_df['PFAM_DOMAIN_NAME'] = '.'
+            
+            if num_recs_with_entrez_hits > 0:
+                
+                if {'GENENAME'}.issubset(vcf2tsv_df.columns):
+                    vcf2tsv_df.drop('GENENAME', inplace=True, axis=1)
+                
+                if os.path.exists(gene_xref_tsv_fname):
+                    gene_xref_df = pd.read_csv(
+                        gene_xref_tsv_fname, sep="\t", na_values=".", 
+                        usecols=["entrezgene","name"])
+                    gene_xref_df = gene_xref_df[gene_xref_df['entrezgene'].notnull()].drop_duplicates()
+                    gene_xref_df = gene_xref_df[gene_xref_df['entrezgene'].notna()].drop_duplicates()
+                    gene_xref_df["entrezgene"] = gene_xref_df["entrezgene"].astype(float).astype(int).astype(str)
+                    vcf2tsv_df["ENTREZGENE"] = vcf2tsv_df["ENTREZGENE"].astype(str)
+                    vcf2tsv_df.loc[vcf2tsv_df["ENTREZGENE"].isna(), "ENTREZGENE"] = "-1"
+                    gene_xref_df.rename(columns = {'entrezgene':'ENTREZGENE', 'name':'GENENAME'}, inplace = True)                
+                    vcf2tsv_df = vcf2tsv_df.merge(gene_xref_df, left_on=["ENTREZGENE"], right_on=["ENTREZGENE"], how="left")
+                    vcf2tsv_df["ENTREZGENE"] = vcf2tsv_df['ENTREZGENE'].str.replace("\\.[0-9]{1,}$", "", regex = True)
+                else:
+                    logger.error(f"Could not find {gene_xref_tsv_fname} needed for gene name annotation - exiting")
+            else:
+                vcf2tsv_df['GENENAME'] = '.'
+    
+    return(vcf2tsv_df)
+
+def clean_annotations(variant_set: pd.DataFrame, sample_id, genome_assembly, logger) -> pd.DataFrame:
+    
+    if {'Consequence','EFFECT_PREDICTIONS','CLINVAR_CONFLICTED'}.issubset(variant_set.columns):
+        variant_set.rename(columns = {'Consequence':'CONSEQUENCE'}, inplace = True)
+        variant_set['EFFECT_PREDICTIONS'] = variant_set['EFFECT_PREDICTIONS'].str.replace("\\.&|\\.$", "NA&", regex = True)
+        variant_set['EFFECT_PREDICTIONS'] = variant_set['EFFECT_PREDICTIONS'].str.replace("&$", "", regex = True)
+        variant_set['EFFECT_PREDICTIONS'] = variant_set['EFFECT_PREDICTIONS'].str.replace("&", ", ", regex = True)
+        variant_set['clinvar_conflicted_bool'] = True
+        variant_set.loc[variant_set['CLINVAR_CONFLICTED'] == 1, "clinvar_conflicted_bool"] = True
+        variant_set.loc[variant_set['CLINVAR_CONFLICTED'] != 1, "clinvar_conflicted_bool"] = False
+        variant_set.drop('CLINVAR_CONFLICTED', inplace=True, axis=1)        
+        variant_set.rename(columns = {'clinvar_conflicted_bool':'CLINVAR_CONFLICTED'}, inplace = True)
+        
+    
+    if not {'VCF_SAMPLE_ID'}.issubset(variant_set.columns):
+        variant_set['VCF_SAMPLE_ID'] = str(sample_id)
+    variant_set['SAMPLE_ID'] = str(sample_id)
+    variant_set['GENOME_VERSION'] = str(genome_assembly)
+    if {'CHROM','POS','REF','ALT',}.issubset(variant_set.columns):      
+        variant_set['GENOMIC_CHANGE'] = variant_set['CHROM'].astype(str) + ":g." + variant_set['POS'].astype(str) + \
+            variant_set['REF'].astype(str) + ">" + variant_set['ALT'].astype(str)
+   
+    ## Make sure that specific tags are formatted as integers (not float) during to_csv export
+    if {'AMINO_ACID_END','AMINO_ACID_START'}.issubset(variant_set.columns):
+        variant_set.loc[variant_set['AMINO_ACID_START'].isna(),"AMINO_ACID_START"] = -1        
+        variant_set.loc[variant_set['AMINO_ACID_END'].isna(),"AMINO_ACID_END"] = -1
+        variant_set['AMINO_ACID_END'] = variant_set['AMINO_ACID_END'].astype(float).astype(int)
+        variant_set['AMINO_ACID_START'] = variant_set['AMINO_ACID_START'].astype(float).astype(int)
+    
+    for elem in ['NUM_SUBMITTERS','ALLELE_ID','ENTREZGENE','REVIEW_STATUS_STARS']:
+        vcf_info_tag = 'CLINVAR_' + str(elem)
+        if vcf_info_tag in variant_set.columns:
+            variant_set.loc[variant_set[vcf_info_tag].notna(), vcf_info_tag] = \
+                variant_set.loc[variant_set[vcf_info_tag].notna(), vcf_info_tag].astype(str).astype(float).astype(int)
+    
+    for vcf_info_tag in ['ONCOGENE_RANK','TSG_RANK','TCGA_PANCANCER_COUNT','CGC_TIER','DISTANCE',
+                         'EXON_AFFECTED','INTRON_POSITION','EXON_POSITION']:
+        if vcf_info_tag in variant_set.columns:                        
+            variant_set.loc[variant_set[vcf_info_tag].notna(), vcf_info_tag] = \
+                variant_set.loc[variant_set[vcf_info_tag].notna(), vcf_info_tag].astype(str).astype(float).astype(int)
+    
+    variant_set = set_genotype(variant_set, logger)
+    
+    return variant_set
\ No newline at end of file
diff --git a/src/gvanno/lib/gvanno/vcf.py b/src/gvanno/lib/gvanno/vcf.py
new file mode 100644
index 0000000..b39fa48
--- /dev/null
+++ b/src/gvanno/lib/gvanno/vcf.py
@@ -0,0 +1,92 @@
+#!/usr/bin/env python
+
+import logging
+
+from lib.gvanno.utils import error_message, warn_message, check_subprocess
+from cyvcf2 import VCF
+from typing import Union
+
+def get_vcf_info_tags(vcf_fname):
+    vcf = VCF(vcf_fname)
+    info_tags = {}
+    for e in vcf.header_iter():
+        header_element = e.info()
+        if 'ID' in header_element.keys() and 'HeaderType' in header_element.keys():
+            if header_element['HeaderType'] == 'INFO':
+                info_tags[str(header_element['ID'])] = 1
+
+    return info_tags
+
+
+def print_vcf_header(vcf_fname, vcfheader_file, logger, chromline_only=False):
+    if chromline_only == True:
+        check_subprocess(
+            logger, f'bgzip -dc {vcf_fname} | egrep \'^#\' | egrep \'^#CHROM\' >> {vcfheader_file}', debug=False)
+    else:
+        check_subprocess(
+            logger, f'bgzip -dc {vcf_fname} | egrep \'^#\' | egrep -v \'^#CHROM\' > {vcfheader_file}', debug=False)
+
+def detect_reserved_info_tag(tag, tag_name, logger):
+    reserved_tags = ['AA', 'AC', 'AF', 'AN', 'BQ', 'CIGAR', 'DB', 'DP', 'END',
+                     'H2', 'H3', 'MQ', 'MQ0', 'NS', 'SB', 'SOMATIC', 'VALIDATED', '1000G']
+    if tag in reserved_tags:
+        err_msg = f'Custom INFO tag ({tag_name}) needs another name - \'{tag}\' is a reserved field in the VCF specification (INFO)'
+        return error_message(err_msg, logger)
+
+    reserved_format_tags = ['GT', 'DP', 'FT', 'GL',
+                            'GLE', 'GQ', 'PL', 'HQ', 'PS', 'PQ', 'EC', 'MQ']
+    if tag in reserved_format_tags:
+        err_msg = 'Custom INFO tag ({tag_name}) needs another name - \'{tag}\' is a reserved field in the VCF specification (FORMAT)'
+        return error_message(err_msg, logger)
+
+def check_retained_vcf_info_tags(vcf: VCF, retained_info_tags: str, logger: logging.Logger) -> int:
+
+    """
+    Function that compares the INFO tags in the query VCF and retained INFO tags set by the user as retained for output
+    If any retained tag is not in query VCF, an error will be returned
+    """
+
+    tags = str(retained_info_tags).split(',')
+    info_elements_query_vcf = []
+
+    #vcf = VCF(input_vcf)
+    logger.info('Checking if existing INFO tags of query VCF file matches retained INFO tags set by the user')
+    ret = 1
+    for e in vcf.header_iter():
+        header_element = e.info()
+        if 'ID' in header_element.keys() and 'HeaderType' in header_element.keys():
+            if header_element['HeaderType'] == 'INFO':
+                info_elements_query_vcf.append(header_element['ID'])
+
+    for t in tags:
+        if not t in info_elements_query_vcf:
+            err_msg = "Retained INFO tag '" + str(t) + "' not found among INFO tags in query VCF - make sure retained VCF INFO tags are set correctly"
+            error_message(err_msg, logger)
+        else:
+            logger.info("Retained INFO tag '" + str(t) + "' detected among INFO tags in query VCF")
+        
+    return ret
+
+
+
+def check_existing_vcf_info_tags(vcf: VCF, populated_infotags: dict, logger) -> int:
+
+    """
+    Function that compares the INFO tags in the query VCF and the INFO tags generated by gvanno
+    If any coinciding tags, an error will be returned
+    """
+
+    logger.info('Checking if existing INFO tags of query VCF file coincide with gvanno INFO tags')
+    ret = 1
+    for e in vcf.header_iter():
+        header_element = e.info()
+        if 'ID' in header_element.keys() and 'HeaderType' in header_element.keys():
+            if header_element['HeaderType'] == 'INFO':
+                if header_element['ID'] in populated_infotags.keys():
+                    err_msg = 'INFO tag ' + str(header_element['ID']) + ' in the query VCF coincides with a VCF ' + \
+                        'annotation tag produced by gvanno - please remove or rename this tag in your query VCF'
+                    return error_message(err_msg, logger)
+
+    logger.info('No query VCF INFO tags coincide with populated INFO tags by gvanno')
+    return ret
+
diff --git a/src/gvanno/lib/gvanno/vep.py b/src/gvanno/lib/gvanno/vep.py
new file mode 100644
index 0000000..f0b50c0
--- /dev/null
+++ b/src/gvanno/lib/gvanno/vep.py
@@ -0,0 +1,322 @@
+#!/usr/bin/env python
+
+import os,re
+import csv
+import gzip
+
+from lib.gvanno.annoutils import assign_cds_exon_intron_annotations
+from lib.gvanno import gvanno_vars
+
+
+def get_csq_record_annotations(csq_fields, varkey, logger, vep_csq_fields_map, transcript_xref_map):
+    """
+    Generates a dictionary object containing the annotations of a CSQ record.
+
+    Parameters:
+    - csq_fields (list): A list of CSQ fields.
+    - varkey (str): The VARKEY value.
+    - logger (Logger): The logger object.
+    - vep_csq_fields_map (dict): A dictionary mapping VEP CSQ fields to their indices.
+    - transcript_xref_map (dict): A dictionary mapping Ensembl transcript IDs to their annotations.
+
+    Returns:
+    - csq_record (dict): A dictionary object containing the annotations of a CSQ record.
+    """
+
+    j = 0
+    csq_record = {}
+
+    csq_record['VARKEY'] = varkey
+    ensembl_transcript_id = '.'
+                
+    # loop over block annotation elements (separated with '|'), and assign their values to the csq_record dictionary object
+    while (j < len(csq_fields)):
+        if j in vep_csq_fields_map['index2field']:
+                        
+            ## consider non-empty CSQ fields
+            if csq_fields[j] != '':
+                csq_record[vep_csq_fields_map['index2field'][j]] = str(csq_fields[j])
+                if vep_csq_fields_map['index2field'][j] == 'Feature':
+                    
+                    ensembl_transcript_id = str(csq_fields[j])
+                    if ensembl_transcript_id in transcript_xref_map:
+                        for annotation in transcript_xref_map[ensembl_transcript_id].keys():
+                            if annotation != 'SYMBOL':
+                                ## assign additional (non-VEP provided) gene/transcript annotations from the custom 
+                                ## transcript_xref_map as key,value pairs in the csq_record object
+                                csq_record[annotation] = transcript_xref_map[ensembl_transcript_id][annotation]
+                    else:
+                        if re.match(r'ENST', ensembl_transcript_id):
+                            logger.warning(
+                                'Could not find transcript xrefs for ' + str(ensembl_transcript_id))
+
+                # Specifically assign PFAM protein domain as a csq_record key
+                if vep_csq_fields_map['index2field'][j] == 'DOMAINS':
+                    domain_identifiers = str(csq_fields[j]).split('&')
+                    for v in domain_identifiers:
+                        if v.startswith('Pfam'):
+                            csq_record['PFAM_DOMAIN'] = str(re.sub(r'\.[0-9]{1,}$', '', re.sub(r'Pfam:', '', v)))                            
+
+                # Assign COSMIC/DBSNP mutation ID's as individual key,value pairs in the csq_record object
+                if vep_csq_fields_map['index2field'][j] == 'Existing_variation':
+                    var_identifiers = str(csq_fields[j]).split('&')
+                    parsed_identifiers = {'COSMIC_MUTATION_ID':[], 'DBSNPRSID':[]}                                
+                    for v in var_identifiers:
+                        if v.startswith('COSV') or v.startswith('COSM'):
+                            parsed_identifiers['COSMIC_MUTATION_ID'].append(v)                                   
+                        if v.startswith('rs'):
+                            parsed_identifiers['DBSNPRSID'].append(v)
+                    for db in parsed_identifiers.keys():
+                        if len(parsed_identifiers[db]) > 0:
+                            csq_record[db] = '&'.join(parsed_identifiers[db])   
+                
+                ## Sort (potentially multiple) variant consequence elements from VEP (they appear unsorted in some cases) 
+                ## Example: intron_variant&splice_region_variant                            
+                if vep_csq_fields_map['index2field'][j] == "Consequence":
+                    consequence_elements = sorted(str(csq_fields[j]).split('&'))
+                    csq_record['Consequence'] = '&'.join(consequence_elements)
+
+            else:                            
+                csq_record[vep_csq_fields_map['index2field'][j]] = None
+        j = j + 1
+    
+    ## if VEP/Ensembl does not provide a symbol, use symbol provided by PCGR/CPSR gene_transcript_xref map
+    if csq_record['SYMBOL'] is None and ensembl_transcript_id != ".":
+        if ensembl_transcript_id in transcript_xref_map:
+            csq_record['SYMBOL'] = transcript_xref_map[ensembl_transcript_id]['SYMBOL']
+    
+    # Assign coding status, protein change, coding sequence change, last exon/intron status etc as VCF info tags
+    csq_record = assign_cds_exon_intron_annotations(csq_record)
+    
+    return(csq_record)
+
+
+def pick_single_gene_csq(vep_csq_results, 
+                         pick_criteria_ordered = "mane_select,mane_plus_clinical,canonical,appris,tsl,biotype,ccds,rank,length", 
+                         logger = None):
+
+    
+    csq_candidates = []
+
+    for csq_elem in vep_csq_results['picked_gene_csq']:
+        if csq_elem is None:
+            continue
+        csq_candidate = {}
+
+        ## default values (undefined properties)
+        csq_candidate['mane_select'] = 1
+        csq_candidate['mane_plus_clinical'] = 1        
+        csq_candidate['canonical'] = 1
+        csq_candidate['appris'] = 8
+        csq_candidate['biotype'] = 1
+        csq_candidate['tsl'] = 6
+        csq_candidate['ccds'] = 1
+        csq_candidate['rank'] = 42
+        
+        ## set to picked as default
+        csq_candidate['PICKED'] = True
+        csq_candidate['varkey'] = csq_elem['VARKEY']
+
+        ## MANE select status - lower value prioritized
+        if not csq_elem['MANE_SELECT'] is None:
+            csq_candidate['mane_select'] = 0
+        
+        ## MANE PLUS clnical status - lower value prioritized
+        if not csq_elem['MANE_PLUS_CLINICAL'] is None:
+            csq_candidate['mane_plus_clinical'] = 0
+
+        ## CANONICAL status - lower value prioritized
+        if not csq_elem['CANONICAL'] is None:
+            if csq_elem['CANONICAL'] is True:
+                csq_candidate['canonical'] = 0
+
+        ## APPRIS level - lower value prioritized
+        if not csq_elem['APPRIS'] is None:
+            if not 'ALTERNATIVE' in csq_elem['APPRIS']:
+                csq_candidate['appris'] = int(re.sub(r'[A-Z]{1,}:?', '', csq_elem['APPRIS']))
+            else:
+                csq_candidate['appris'] = int(re.sub(r'ALTERNATIVE:','', csq_elem['APPRIS'])) + 5
+
+        ## Biotype - lower value prioritized
+        if not csq_elem['BIOTYPE'] is None:
+            if csq_elem['BIOTYPE'] == 'protein_coding':
+                csq_candidate['biotype'] = 0
+
+        ## CCDS - lower value prioritized
+        if not csq_elem['CCDS'] is None:
+            csq_candidate['ccds'] = 0
+
+        ## Consequence rank - lower value prioritized
+        if not csq_elem['Consequence'] is None:
+            main_cons = csq_elem['Consequence'].split('&')[0]
+            if main_cons in gvanno_vars.VEP_consequence_rank:
+                csq_candidate['rank'] = int(gvanno_vars.VEP_consequence_rank[main_cons])
+            else:
+                warn_msg = f"Missing Consequence in gvanno_vars.VEP_consequence_rank: {csq_elem['Consequence']} -  '{main_cons}'"
+                if not logger is None:
+                    logger.warn(warn_msg)                      
+
+        ## TSL - lower value prioritized
+        if not csq_elem['TSL'] is None:
+            csq_candidate['tsl'] = int(csq_elem['TSL'])
+
+        csq_candidates.append(csq_candidate)
+
+
+    ## Go through pick criteria in pre-defined order 
+    ## - set 'PICKED' = False for all csq elements with a score above the minimum value for a given criterion
+    ## - when there is only one element with 'PICKED' equal to TRUE, break out of the loop, and report the chosen transcript CSQ element
+    chosen_csq_index = 0
+    for rank_criterion in pick_criteria_ordered.split(','):
+        if rank_criterion != 'length':        
+            lowest_score = 100
+            i = 0
+            for candidate in csq_candidates:
+                if candidate[rank_criterion] <= lowest_score:
+                    lowest_score = candidate[rank_criterion]                    
+                i = i + 1
+            
+            for candidate in csq_candidates:
+                if candidate[rank_criterion] > lowest_score:
+                    candidate['PICKED'] = False
+                
+            j = 0
+            num_picked = 0
+            for candidate in csq_candidates:
+                if candidate['PICKED'] is True:
+                    num_picked += 1
+                    chosen_csq_index = j
+                j = j + 1
+
+            if num_picked == 1:
+                break
+
+    return(chosen_csq_index)
+
+def parse_vep_csq(rec, transcript_xref_map, vep_csq_fields_map, vep_pick_order, logger, pick_only=True, 
+                  csq_identifier='CSQ', debug = 0):
+
+    """
+    Function that parses the comma-separated CSQ elements found in the rec.INFO object (VCF)
+    - creates an individual CSQ record for all transcript-specific elements provided as comma-separated elements in the CSQ tag
+    - each individual record is gathered as a dictionary of properties (defined by vep_csq_field_map), i.e.
+    - 'CSQ=A|missense_variant|KRAS++' in the VCF INFO element gives csq_record['Consequence'] = 'missense_variant', 
+       csq_record['SYMBOL'] = 'KRAS' etc. 
+    - if argument 'pick_only' is TRUE, only elements with 'PICK' == 1' is chosen
+    """
+    
+    all_csq_pick = []
+    alternative_csq_pick = []
+    all_transcript_consequences = []
+
+    varkey = str(rec.CHROM) + '_' + str(rec.POS) + '_' + str(rec.REF) + '_' + str(','.join(rec.ALT))
+    
+    ## Retrieve the INFO element provided by VEP (default 'CSQ') in the VCF object, and 
+    ## loop through all transcript-specific consequence blocks provided, e.g. 
+    #  CSQ=A|intron_variant|||.., A|splice_region_variant|||, and so on.
+    for csq in rec.INFO.get(csq_identifier).split(','):
+        #print(csq)
+        csq_fields = csq.split('|')
+
+        entrezgene = '.'
+
+        ## Entrez gene identifier is not provided by VEP, pull out this from 'transcript_xref_map' for a given 
+        ## vtranscript-specific CSQ block
+        ##  - used for 'consequence_entry' object that are added to 'vep_all_csq' array
+        k = 0
+        while(k < len(csq_fields)):
+            if k in vep_csq_fields_map['index2field']:
+                if vep_csq_fields_map['index2field'][k] == 'Feature':
+                    ensembl_transcript_id = csq_fields[k]
+                    if ensembl_transcript_id != '' and ensembl_transcript_id.startswith('ENST'):
+                        if ensembl_transcript_id in transcript_xref_map.keys():
+                            if 'ENTREZGENE' in transcript_xref_map[ensembl_transcript_id].keys():
+                                entrezgene = transcript_xref_map[ensembl_transcript_id]['ENTREZGENE']
+            k = k + 1
+        
+        
+        ## CPSR - consider all consequences (considering that a variant may overlap other, non-CPSR targets)
+        if pick_only is False: 
+            csq_record = get_csq_record_annotations(csq_fields, varkey, logger, vep_csq_fields_map, transcript_xref_map)                    
+            if 'Feature_type' in csq_record:
+                if csq_record['Feature_type'] == 'RegulatoryFeature':
+                    #print(str(csq_record))
+                    all_csq_pick.append(csq_record)
+                    alternative_csq_pick.append(csq_record)
+                if csq_record['Feature_type'] == 'Transcript':
+                    all_csq_pick.append(csq_record)                     
+                else:
+                    ## intergenic
+                    if csq_record['Feature_type'] is None:
+                        alternative_csq_pick.append(csq_record)
+
+        ## PCGR - consider picked consequence only                
+        else:
+            # loop over VEP consequence blocks PICK'ed according to VEP's ranking scheme
+            # only consider the primary/picked consequence when expanding with annotation tags
+                
+            if csq_fields[vep_csq_fields_map['field2index']['PICK']] == "1":                
+                csq_record = get_csq_record_annotations(csq_fields, varkey, logger, vep_csq_fields_map, transcript_xref_map)                           
+                # Append transcript consequence to all_csq_pick
+                all_csq_pick.append(csq_record)
+        symbol = "."
+        hgvsc = "."
+        hgvsp = "."
+        exon = "."
+        if csq_fields[vep_csq_fields_map['field2index']['EXON']] != "":
+            if "/" in csq_fields[vep_csq_fields_map['field2index']['EXON']]:
+                exon = str(csq_fields[vep_csq_fields_map['field2index']['EXON']].split('/')[0])
+        if csq_fields[vep_csq_fields_map['field2index']['SYMBOL']] != "":
+            symbol = str(csq_fields[vep_csq_fields_map['field2index']['SYMBOL']])
+        if csq_fields[vep_csq_fields_map['field2index']['HGVSc']] != "":
+            hgvsc = str(csq_fields[vep_csq_fields_map['field2index']['HGVSc']].split(':')[1])
+        if csq_fields[vep_csq_fields_map['field2index']['HGVSp']] != "":
+            hgvsp = str(csq_fields[vep_csq_fields_map['field2index']['HGVSp']].split(':')[1])
+        consequence_entry = (str(csq_fields[vep_csq_fields_map['field2index']['Consequence']]) + ":" +  
+            str(symbol) + ":" + 
+            str(entrezgene) + ":" +
+            str(hgvsc) + ":" + 
+            str(hgvsp) + ":" + 
+            str(exon) + ":" +
+            str(csq_fields[vep_csq_fields_map['field2index']['Feature_type']]) + ":" + 
+            str(csq_fields[vep_csq_fields_map['field2index']['Feature']]) + ":" + 
+            str(csq_fields[vep_csq_fields_map['field2index']['BIOTYPE']]))
+        all_transcript_consequences.append(consequence_entry)
+
+    ## CPSR - consider all picked VEP blocks
+    if pick_only is False:
+        if len(all_csq_pick) == 0:
+            if len(alternative_csq_pick) > 0:
+                all_csq_pick = alternative_csq_pick
+            else:
+                logger.info(f"NO VEP BLOCK FOUND")
+                exit(1)
+        else:
+            ## don't consider regulatory consequence as single chosen consequence
+            if len(all_csq_pick) == 1 and len(alternative_csq_pick) > 0:
+                if all_csq_pick[0]['Feature_type'] == 'RegulatoryFeature':
+                    all_csq_pick = alternative_csq_pick
+        
+    vep_csq_results = {}
+    vep_csq_results['picked_gene_csq'] = all_csq_pick
+    vep_csq_results['all_csq'] = all_transcript_consequences
+    vep_csq_results['picked_csq'] = None
+    vep_chosen_csq_idx = 0
+
+    ## If multiple transcript-specific variant consequences highlighted by --pick_allele_gene, 
+    ## prioritize/choose block of consequence according to 'vep_pick_order'
+    if len(vep_csq_results['picked_gene_csq']) > 1:
+        vep_chosen_csq_idx = pick_single_gene_csq(vep_csq_results, pick_criteria_ordered = vep_pick_order, logger = logger)        
+        vep_csq_results['picked_csq'] = vep_csq_results['picked_gene_csq'][vep_chosen_csq_idx]
+    else:
+        ## check that size if 1, otherwise prompt error below
+        #logger.info('ERROR: No VEP block chosen by --pick_allele_gene')
+        if len(vep_csq_results['picked_gene_csq']) == 1:
+            vep_csq_results['picked_csq'] = vep_csq_results['picked_gene_csq'][vep_chosen_csq_idx]
+        else:            
+            logger.error('ERROR: No VEP block chosen by --pick_allele_gene')
+
+
+
+    return (vep_csq_results)
+