sigven
diff --git a/‎README.md
+22-23 b/‎README.md
+22-23
diff --git a/‎data-raw/RELEASE_NOTES
+14 b/‎data-raw/RELEASE_NOTES
+14
diff --git a/‎gvanno.py
+8-7 b/‎gvanno.py
+8-7
diff --git a/‎src/.DS_Store
0 Bytes b/‎src/.DS_Store
0 Bytes
diff --git a/‎src/Dockerfile
+2-2 b/‎src/Dockerfile
+2-2
diff --git a/‎src/buildDocker.sh
+1-1 b/‎src/buildDocker.sh
+1-1
diff --git a/‎src/gvanno.tgz
960 Bytes b/‎src/gvanno.tgz
960 Bytes
diff --git a/‎src/gvanno/gvanno_summarise.py
+25-9 b/‎src/gvanno/gvanno_summarise.py
+25-9
@@ -15,30 +15,29 @@ The germline variant annotator (*gvanno*) is a software package intended for ana
 *gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The workflow relies heavily upon [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record. Note that if your input VCF contains data (genotypes) from multiple samples (i.e. a multisample VCF), the output TSV file will contain one line/record __per sample variant__.
 
 ### News
-* April 22nd 2021 - **dev update**
-  * Data updates (ClinVar, UniProt, GWAS Catalog, dbNSFP, Pfam, Open Targets Platform)
-  * Software update (VEP 103)
+* May 24th 2021 - **1.4.2 release**
+  * Software update (VEP 104)
+  * Data updates: ClinVar, GWAS catalog, CancerMine, Pfam, dbNSFP, UniProt
   * Two new options added:
-	  * `--vep_regulatory` - annotates variants for overlap with regulatory regions
+	  * `--vep_regulatory` - annotates variants for overlap with regulatory regions (details below)
 	  * `--docker-uid` - set Docker user id
-* December 7th 2020 - **1.4.1 release**
-  * Data updates (ClinVar, UniProt, GWAS Catalog, Open Targets Platform)
-  * Software update (VEP 102)
-  * Skipped DisGenet annotations (Open Targets serve similar purpose)
+  * New variant annotations for enhanced non-coding interpretation:
+	  * _REGULATORY_ANNOTATION_ : A comma-separated list of regulatory annotations from VEP's `--regulatory` option, i.e. __TF_binding_site__, __enhancer/promoter/open_chromatin__, __CTCF_binding_site__ etc. Included when the `--vep_regulatory` option is turned on in gvanno.
+	  * _NCER_PERCENTILE__: A genome-wide percentile rank score from the ncER algorithm (**n**on-**c**oding **E**ssential **R**egulation), [Wells et al., Nat Comm. (2019)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6868241/).
 
 ### Annotation resources
 
-* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v103 (GENCODE v37/v19 as the gene reference dataset)
+* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v104 (GENCODE v38/v19 as the gene reference dataset)
 * [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.2, March 2021)
 * [gnomAD](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (release 2.1, October 2018) - from VEP
-* [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 153) - from VEP
+* [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 154) - from VEP
 * [1000 Genomes Project - phase3](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) - Germline variant frequencies genome-wide (May 2013) - from VEP
-* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of variants related to human health/disease phenotypes (April 2021)
-* [CancerMine](http://bionlp.bcgsc.ca/cancermine/) - literature-mined database of drivers, oncogenes and tumor suppressors in cancer (version 34)
+* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of variants related to human health/disease phenotypes (May 2021)
+* [CancerMine](http://bionlp.bcgsc.ca/cancermine/) - literature-mined database of drivers, oncogenes and tumor suppressors in cancer (version 35)
 * [Open Targets Platform](https://targetvalidation.org) - Target-disease and target-drug associations (2021_02, February 2021)
 * [UniProt/SwissProt KnowledgeBase](http://www.uniprot.org) - Resource on protein sequence and functional information (2021_02, April 2021)
 * [Pfam](http://pfam.xfam.org) - Database of protein families and domains (v34.0, March 2021)
-* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (April 12th 2021)
+* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (May 19th 2021)
 
 
 ### Getting started
@@ -72,15 +71,15 @@ An installation of Python (version >=_3.6_) is required to run *gvanno*. Check t
 
 #### STEP 2: Download *gvanno* and data bundle
 
-1. Clone the latest version in development
+1. [Download the latest version](https://github.com/sigven/gvanno/releases/tag/v1.4.2) (gvanno run script, v1.4.2)
 2. Download and unpack the latest assembly-specific data bundle in the gvanno directory
-   * [grch37 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20210422.tgz) (approx 18Gb)
-   * [grch38 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch38.20210422.tgz) (approx 20Gb)
+   * [grch37 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20210523.tgz) (approx 19Gb)
+   * [grch38 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch38.20210523.tgz) (approx 20Gb)
    * *Unpacking*: `gzip -dc gvanno.databundle.grch37.YYYYMMDD.tgz | tar xvf -`
 
     A _data/_ folder within the _gvanno-X.X_ software folder should now have been produced
-3. Pull the [gvanno Docker image (dev)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2.4Gb):
-   * `docker pull sigven/gvanno:dev` (gvanno annotation engine)
+3. Pull the [gvanno Docker image (1.4.2)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2.4Gb):
+   * `docker pull sigven/gvanno:1.4.2` (gvanno annotation engine)
 
 #### STEP 3: Input preprocessing
 
@@ -126,7 +125,7 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt
 				    Number of forks for Variant Effect Predictor (VEP) processing, default: 4
 	--vep_buffer_size VEP_BUFFER_SIZE
 				    Variant buffer size (variants read into memory simultaneously) for Variant Effect Predictor (VEP) processing
-				    - set lower to reduce memory usage, default: 5000
+				    - set lower to reduce memory usage, higher to increase speed, default: 500
 	--vep_pick_order VEP_PICK_ORDER
 				    Comma-separated string of ordered transcript properties for primary variant pick in
 				    Variant Effect Predictor (VEP) processing, default: canonical,appris,biotype,ccds,rank,tsl,length,mane
@@ -145,10 +144,10 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt
 
 The _examples_ folder contains an example VCF file. Analysis of the example VCF can be performed by the following command:
 
-	python ~/gvanno-dev/gvanno.py
-	--query_vcf ~/gvanno-dev/examples/example.grch37.vcf.gz
-	--gvanno_dir ~/gvanno-dev
-	--output_dir ~/gvanno-dev
+	python ~/gvanno-1.4.2/gvanno.py
+	--query_vcf ~/gvanno-1.4.2/examples/example.grch37.vcf.gz
+	--gvanno_dir ~/gvanno-1.4.2
+	--output_dir ~/gvanno-1.4.2
 	--sample_id example
 	--genome_assembly grch37
 	--container docker
 
@@ -0,0 +1,14 @@
+##GVANNO_SOFTWARE_VERSION = 1.4.3
+##GVANNO_DB_VERSION = 20210523
+pfam = v34.0 (March 2021)
+ncER = v1.0 (March 2019)
+uniprot = release 2021_02
+corum = release 3.0 (20180903)
+onekg = phase 3 (20130502)
+dbsnp = build 154/153
+dbnsfp = v4.2 (March 2021)
+gnomad = r2.1 (October 2018)
+gwas = May 2021 (20210519)
+clinvar = May 2021 (20210501)
+opentargets = 2021_02
+gencode = 38/19
@@ -11,10 +11,10 @@
 import platform
 from argparse import RawTextHelpFormatter
 
-GVANNO_VERSION = 'dev'
-DB_VERSION = 'GVANNO_DB_VERSION = 20210422'
-VEP_VERSION = '103'
-GENCODE_VERSION = '37'
+GVANNO_VERSION = '1.4.2'
+DB_VERSION = 'GVANNO_DB_VERSION = 20210523'
+VEP_VERSION = '104'
+GENCODE_VERSION = '38'
 VEP_ASSEMBLY = "GRCh38"
 DOCKER_IMAGE_VERSION = 'sigven/gvanno:' + str(GVANNO_VERSION)
 
@@ -41,8 +41,8 @@ def __main__():
    optional_vep.add_argument('--vep_lof_prediction', action = "store_true", help = "Predict loss-of-function variants with Loftee plugin " + \
       "in Variant Effect Predictor (VEP), default: %(default)s")
    optional_vep.add_argument('--vep_n_forks', default = 4, help="Number of forks for Variant Effect Predictor (VEP) processing, default: %(default)s")
-   optional_vep.add_argument('--vep_buffer_size', default = 5000, help="Variant buffer size (variants read into memory simultaneously) " + \
-      "for Variant Effect Predictor (VEP) processing\n- set lower to reduce memory usage, default: %(default)s")
+   optional_vep.add_argument('--vep_buffer_size', default = 500, help="Variant buffer size (variants read into memory simultaneously) " + \
+      "for Variant Effect Predictor (VEP) processing\n- set lower to reduce memory usage, higher to increase speed, default: %(default)s")
    optional_vep.add_argument('--vep_pick_order', default = "canonical,appris,biotype,ccds,rank,tsl,length,mane", help="Comma-separated string " + \
       "of ordered transcript properties for primary variant pick in\nVariant Effect Predictor (VEP) processing, default: %(default)s")
    optional_vep.add_argument('--vep_skip_intergenic', action = "store_true", help="Skip intergenic variants in Variant Effect Predictor (VEP) processing, default: %(default)s")
@@ -384,7 +384,8 @@ def run_gvanno(arg_dict, host_directories):
       logger = getlogger("gvanno-summarise")
       logger.info("STEP 3: Summarise gene and variant annotations with gvanno-summarise")
       gvanno_summarise_command = str(container_command_run2) + "gvanno_summarise.py " + str(vep_vcfanno_vcf) + ".gz " + \
-         os.path.join(data_dir, "data", str(arg_dict['genome_assembly'])) + " " + str(int(arg_dict['vep_lof_prediction'])) + docker_command_run_end
+         os.path.join(data_dir, "data", str(arg_dict['genome_assembly'])) + " " + str(int(arg_dict['vep_lof_prediction'])) + \
+            " "  + str(int(arg_dict['vep_regulatory'])) + docker_command_run_end
       check_subprocess(gvanno_summarise_command)
       logger.info("Finished")
 
 
@@ -23,7 +23,7 @@ RUN apt-get update && apt-get -y install \
 ENV OPT /opt/vep
 ENV OPT_SRC $OPT/src
 ENV HTSLIB_DIR $OPT_SRC/htslib
-ENV BRANCH release/103
+ENV BRANCH release/104
 
 # Working directory
 WORKDIR $OPT_SRC
@@ -65,7 +65,7 @@ RUN if [ "$BRANCH" = "master" ]; \
   rm -rf kent-335_base_bak
 
 # Setup bioperl-ext
-WORKDIR bioperl-ext/Bio/Ext/Align/
+WORKDIR $OPT_SRC/bioperl-ext/Bio/Ext/Align/
 RUN perl -pi -e"s|(cd libs.+)CFLAGS=\\\'|\$1CFLAGS=\\\'-fPIC |" Makefile.PL
 
 # Install htslib binaries (for 'bgzip' and 'tabix')
 
@@ -4,5 +4,5 @@ cp /Users/sigven/research/docker/pcgr/src/pcgr/lib/annoutils.py gvanno/lib/
 tar czvfh gvanno.tgz gvanno/
 echo "Build the Docker Image"
 TAG=`date "+%Y%m%d"`
-docker build --no-cache -t sigven/gvanno:$TAG --rm=true .
+docker build -t sigven/gvanno:$TAG --rm=true .
 
@@ -18,11 +18,12 @@ def __main__():
    parser.add_argument('vcf_file', help='VCF file with VEP-annotated query variants (SNVs/InDels)')
    parser.add_argument('gvanno_db_dir',help='gvanno data directory')
    parser.add_argument('lof_prediction',default=0,type=int,help='VEP LoF prediction setting (0/1)')
+   parser.add_argument('regulatory_annotation',default=0,type=int,help='Inclusion of VEP regulatory annotations (0/1)')
    args = parser.parse_args()
 
-   extend_vcf_annotations(args.vcf_file, args.gvanno_db_dir, args.lof_prediction)
+   extend_vcf_annotations(args.vcf_file, args.gvanno_db_dir, args.lof_prediction, args.regulatory_annotation)
 
-def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0):
+def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0, regulatory_annotation = 0):
    """
    Function that reads VEP/vcfanno-annotated VCF and extends the VCF INFO column with tags from
    1. CSQ elements within the primary transcript consequence picked by VEP, e.g. SYMBOL, Feature, Gene, Consequence etc.
@@ -40,13 +41,19 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0):
    vep_csq_fields_map = meta_vep_dbnsfp_info['vep_csq_fieldmap']
    vcf = VCF(query_vcf)
    for tag in vcf_infotags_meta:
-      if lof_prediction == 0:
+      if lof_prediction == 0 and regulatory_annotation == 0:
+         if not tag.startswith('LoF') and not tag.startswith('REGULATORY_'):
+            vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']),'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
+      elif lof_prediction == 1 and regulatory_annotation == 0:
+         if not tag.startswith('REGULATORY_'):
+            vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']),'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
+      elif lof_prediction == 0 and regulatory_annotation == 1:
          if not tag.startswith('LoF'):
             vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']),'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
       else:
          vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']),'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
 
-   
+
    w = Writer(out_vcf, vcf)
    current_chrom = None
    num_chromosome_records_processed = 0
@@ -107,11 +114,20 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0):
       num_chromosome_records_processed += 1
       gvanno_xref = annoutils.make_transcript_xref_map(rec, gvanno_xref_map, xref_tag = "GVANNO_XREF")
 
-      csq_record_results = annoutils.parse_vep_csq(rec, gvanno_xref, vep_csq_fields_map, logger, pick_only = True, csq_identifier = 'CSQ')
-      if 'vep_all_csq' in csq_record_results:
-         rec.INFO['VEP_ALL_CSQ'] = ','.join(csq_record_results['vep_all_csq'])
-      if 'vep_block' in csq_record_results:
-         vep_csq_records = csq_record_results['vep_block']
+      if regulatory_annotation == 1:
+         csq_record_results_all = annoutils.parse_vep_csq(rec, gvanno_xref, vep_csq_fields_map, logger, pick_only = False, csq_identifier = 'CSQ')
+
+         if 'vep_block' in csq_record_results_all:
+            vep_csq_records_all = csq_record_results_all['vep_block']
+            rec.INFO['REGULATORY_ANNOTATION'] = annoutils.map_regulatory_variant_annotations(vep_csq_records_all)
+
+      csq_record_results_pick = annoutils.parse_vep_csq(rec, gvanno_xref, vep_csq_fields_map, logger, pick_only = True, csq_identifier = 'CSQ')
+
+      if 'vep_all_csq' in csq_record_results_pick:
+         rec.INFO['VEP_ALL_CSQ'] = ','.join(csq_record_results_pick['vep_all_csq'])
+      if 'vep_block' in csq_record_results_pick:
+         vep_csq_records = csq_record_results_pick['vep_block']
+
          block_idx = 0
          record = vep_csq_records[block_idx]
          for k in record: