Skip to content

Commit ec027f5

Browse files
committed
1.4.2 release
1 parent 083abbf commit ec027f5

10 files changed

+202
-109
lines changed

README.md

+22-23
Original file line numberDiff line numberDiff line change
@@ -15,30 +15,29 @@ The germline variant annotator (*gvanno*) is a software package intended for ana
1515
*gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The workflow relies heavily upon [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record. Note that if your input VCF contains data (genotypes) from multiple samples (i.e. a multisample VCF), the output TSV file will contain one line/record __per sample variant__.
1616

1717
### News
18-
* April 22nd 2021 - **dev update**
19-
* Data updates (ClinVar, UniProt, GWAS Catalog, dbNSFP, Pfam, Open Targets Platform)
20-
* Software update (VEP 103)
18+
* May 24th 2021 - **1.4.2 release**
19+
* Software update (VEP 104)
20+
* Data updates: ClinVar, GWAS catalog, CancerMine, Pfam, dbNSFP, UniProt
2121
* Two new options added:
22-
* `--vep_regulatory` - annotates variants for overlap with regulatory regions
22+
* `--vep_regulatory` - annotates variants for overlap with regulatory regions (details below)
2323
* `--docker-uid` - set Docker user id
24-
* December 7th 2020 - **1.4.1 release**
25-
* Data updates (ClinVar, UniProt, GWAS Catalog, Open Targets Platform)
26-
* Software update (VEP 102)
27-
* Skipped DisGenet annotations (Open Targets serve similar purpose)
24+
* New variant annotations for enhanced non-coding interpretation:
25+
* _REGULATORY_ANNOTATION_ : A comma-separated list of regulatory annotations from VEP's `--regulatory` option, i.e. __TF_binding_site__, __enhancer/promoter/open_chromatin__, __CTCF_binding_site__ etc. Included when the `--vep_regulatory` option is turned on in gvanno.
26+
* _NCER_PERCENTILE__: A genome-wide percentile rank score from the ncER algorithm (**n**on-**c**oding **E**ssential **R**egulation), [Wells et al., Nat Comm. (2019)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6868241/).
2827

2928
### Annotation resources
3029

31-
* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v103 (GENCODE v37/v19 as the gene reference dataset)
30+
* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v104 (GENCODE v38/v19 as the gene reference dataset)
3231
* [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.2, March 2021)
3332
* [gnomAD](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (release 2.1, October 2018) - from VEP
34-
* [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 153) - from VEP
33+
* [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 154) - from VEP
3534
* [1000 Genomes Project - phase3](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) - Germline variant frequencies genome-wide (May 2013) - from VEP
36-
* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of variants related to human health/disease phenotypes (April 2021)
37-
* [CancerMine](http://bionlp.bcgsc.ca/cancermine/) - literature-mined database of drivers, oncogenes and tumor suppressors in cancer (version 34)
35+
* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of variants related to human health/disease phenotypes (May 2021)
36+
* [CancerMine](http://bionlp.bcgsc.ca/cancermine/) - literature-mined database of drivers, oncogenes and tumor suppressors in cancer (version 35)
3837
* [Open Targets Platform](https://targetvalidation.org) - Target-disease and target-drug associations (2021_02, February 2021)
3938
* [UniProt/SwissProt KnowledgeBase](http://www.uniprot.org) - Resource on protein sequence and functional information (2021_02, April 2021)
4039
* [Pfam](http://pfam.xfam.org) - Database of protein families and domains (v34.0, March 2021)
41-
* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (April 12th 2021)
40+
* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (May 19th 2021)
4241

4342

4443
### Getting started
@@ -72,15 +71,15 @@ An installation of Python (version >=_3.6_) is required to run *gvanno*. Check t
7271

7372
#### STEP 2: Download *gvanno* and data bundle
7473

75-
1. Clone the latest version in development
74+
1. [Download the latest version](https://github.com/sigven/gvanno/releases/tag/v1.4.2) (gvanno run script, v1.4.2)
7675
2. Download and unpack the latest assembly-specific data bundle in the gvanno directory
77-
* [grch37 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20210422.tgz) (approx 18Gb)
78-
* [grch38 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch38.20210422.tgz) (approx 20Gb)
76+
* [grch37 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20210523.tgz) (approx 19Gb)
77+
* [grch38 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch38.20210523.tgz) (approx 20Gb)
7978
* *Unpacking*: `gzip -dc gvanno.databundle.grch37.YYYYMMDD.tgz | tar xvf -`
8079

8180
A _data/_ folder within the _gvanno-X.X_ software folder should now have been produced
82-
3. Pull the [gvanno Docker image (dev)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2.4Gb):
83-
* `docker pull sigven/gvanno:dev` (gvanno annotation engine)
81+
3. Pull the [gvanno Docker image (1.4.2)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2.4Gb):
82+
* `docker pull sigven/gvanno:1.4.2` (gvanno annotation engine)
8483

8584
#### STEP 3: Input preprocessing
8685

@@ -126,7 +125,7 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt
126125
Number of forks for Variant Effect Predictor (VEP) processing, default: 4
127126
--vep_buffer_size VEP_BUFFER_SIZE
128127
Variant buffer size (variants read into memory simultaneously) for Variant Effect Predictor (VEP) processing
129-
- set lower to reduce memory usage, default: 5000
128+
- set lower to reduce memory usage, higher to increase speed, default: 500
130129
--vep_pick_order VEP_PICK_ORDER
131130
Comma-separated string of ordered transcript properties for primary variant pick in
132131
Variant Effect Predictor (VEP) processing, default: canonical,appris,biotype,ccds,rank,tsl,length,mane
@@ -145,10 +144,10 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt
145144

146145
The _examples_ folder contains an example VCF file. Analysis of the example VCF can be performed by the following command:
147146

148-
python ~/gvanno-dev/gvanno.py
149-
--query_vcf ~/gvanno-dev/examples/example.grch37.vcf.gz
150-
--gvanno_dir ~/gvanno-dev
151-
--output_dir ~/gvanno-dev
147+
python ~/gvanno-1.4.2/gvanno.py
148+
--query_vcf ~/gvanno-1.4.2/examples/example.grch37.vcf.gz
149+
--gvanno_dir ~/gvanno-1.4.2
150+
--output_dir ~/gvanno-1.4.2
152151
--sample_id example
153152
--genome_assembly grch37
154153
--container docker

data-raw/RELEASE_NOTES

+14
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
##GVANNO_SOFTWARE_VERSION = 1.4.3
2+
##GVANNO_DB_VERSION = 20210523
3+
pfam = v34.0 (March 2021)
4+
ncER = v1.0 (March 2019)
5+
uniprot = release 2021_02
6+
corum = release 3.0 (20180903)
7+
onekg = phase 3 (20130502)
8+
dbsnp = build 154/153
9+
dbnsfp = v4.2 (March 2021)
10+
gnomad = r2.1 (October 2018)
11+
gwas = May 2021 (20210519)
12+
clinvar = May 2021 (20210501)
13+
opentargets = 2021_02
14+
gencode = 38/19

gvanno.py

+8-7
Original file line numberDiff line numberDiff line change
@@ -11,10 +11,10 @@
1111
import platform
1212
from argparse import RawTextHelpFormatter
1313

14-
GVANNO_VERSION = 'dev'
15-
DB_VERSION = 'GVANNO_DB_VERSION = 20210422'
16-
VEP_VERSION = '103'
17-
GENCODE_VERSION = '37'
14+
GVANNO_VERSION = '1.4.2'
15+
DB_VERSION = 'GVANNO_DB_VERSION = 20210523'
16+
VEP_VERSION = '104'
17+
GENCODE_VERSION = '38'
1818
VEP_ASSEMBLY = "GRCh38"
1919
DOCKER_IMAGE_VERSION = 'sigven/gvanno:' + str(GVANNO_VERSION)
2020

@@ -41,8 +41,8 @@ def __main__():
4141
optional_vep.add_argument('--vep_lof_prediction', action = "store_true", help = "Predict loss-of-function variants with Loftee plugin " + \
4242
"in Variant Effect Predictor (VEP), default: %(default)s")
4343
optional_vep.add_argument('--vep_n_forks', default = 4, help="Number of forks for Variant Effect Predictor (VEP) processing, default: %(default)s")
44-
optional_vep.add_argument('--vep_buffer_size', default = 5000, help="Variant buffer size (variants read into memory simultaneously) " + \
45-
"for Variant Effect Predictor (VEP) processing\n- set lower to reduce memory usage, default: %(default)s")
44+
optional_vep.add_argument('--vep_buffer_size', default = 500, help="Variant buffer size (variants read into memory simultaneously) " + \
45+
"for Variant Effect Predictor (VEP) processing\n- set lower to reduce memory usage, higher to increase speed, default: %(default)s")
4646
optional_vep.add_argument('--vep_pick_order', default = "canonical,appris,biotype,ccds,rank,tsl,length,mane", help="Comma-separated string " + \
4747
"of ordered transcript properties for primary variant pick in\nVariant Effect Predictor (VEP) processing, default: %(default)s")
4848
optional_vep.add_argument('--vep_skip_intergenic', action = "store_true", help="Skip intergenic variants in Variant Effect Predictor (VEP) processing, default: %(default)s")
@@ -384,7 +384,8 @@ def run_gvanno(arg_dict, host_directories):
384384
logger = getlogger("gvanno-summarise")
385385
logger.info("STEP 3: Summarise gene and variant annotations with gvanno-summarise")
386386
gvanno_summarise_command = str(container_command_run2) + "gvanno_summarise.py " + str(vep_vcfanno_vcf) + ".gz " + \
387-
os.path.join(data_dir, "data", str(arg_dict['genome_assembly'])) + " " + str(int(arg_dict['vep_lof_prediction'])) + docker_command_run_end
387+
os.path.join(data_dir, "data", str(arg_dict['genome_assembly'])) + " " + str(int(arg_dict['vep_lof_prediction'])) + \
388+
" " + str(int(arg_dict['vep_regulatory'])) + docker_command_run_end
388389
check_subprocess(gvanno_summarise_command)
389390
logger.info("Finished")
390391

src/.DS_Store

0 Bytes
Binary file not shown.

src/Dockerfile

+2-2
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@ RUN apt-get update && apt-get -y install \
2323
ENV OPT /opt/vep
2424
ENV OPT_SRC $OPT/src
2525
ENV HTSLIB_DIR $OPT_SRC/htslib
26-
ENV BRANCH release/103
26+
ENV BRANCH release/104
2727

2828
# Working directory
2929
WORKDIR $OPT_SRC
@@ -65,7 +65,7 @@ RUN if [ "$BRANCH" = "master" ]; \
6565
rm -rf kent-335_base_bak
6666

6767
# Setup bioperl-ext
68-
WORKDIR bioperl-ext/Bio/Ext/Align/
68+
WORKDIR $OPT_SRC/bioperl-ext/Bio/Ext/Align/
6969
RUN perl -pi -e"s|(cd libs.+)CFLAGS=\\\'|\$1CFLAGS=\\\'-fPIC |" Makefile.PL
7070

7171
# Install htslib binaries (for 'bgzip' and 'tabix')

src/buildDocker.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,5 +4,5 @@ cp /Users/sigven/research/docker/pcgr/src/pcgr/lib/annoutils.py gvanno/lib/
44
tar czvfh gvanno.tgz gvanno/
55
echo "Build the Docker Image"
66
TAG=`date "+%Y%m%d"`
7-
docker build --no-cache -t sigven/gvanno:$TAG --rm=true .
7+
docker build -t sigven/gvanno:$TAG --rm=true .
88

src/gvanno.tgz

960 Bytes
Binary file not shown.

src/gvanno/gvanno_summarise.py

+25-9
Original file line numberDiff line numberDiff line change
@@ -18,11 +18,12 @@ def __main__():
1818
parser.add_argument('vcf_file', help='VCF file with VEP-annotated query variants (SNVs/InDels)')
1919
parser.add_argument('gvanno_db_dir',help='gvanno data directory')
2020
parser.add_argument('lof_prediction',default=0,type=int,help='VEP LoF prediction setting (0/1)')
21+
parser.add_argument('regulatory_annotation',default=0,type=int,help='Inclusion of VEP regulatory annotations (0/1)')
2122
args = parser.parse_args()
2223

23-
extend_vcf_annotations(args.vcf_file, args.gvanno_db_dir, args.lof_prediction)
24+
extend_vcf_annotations(args.vcf_file, args.gvanno_db_dir, args.lof_prediction, args.regulatory_annotation)
2425

25-
def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0):
26+
def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0, regulatory_annotation = 0):
2627
"""
2728
Function that reads VEP/vcfanno-annotated VCF and extends the VCF INFO column with tags from
2829
1. CSQ elements within the primary transcript consequence picked by VEP, e.g. SYMBOL, Feature, Gene, Consequence etc.
@@ -40,13 +41,19 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0):
4041
vep_csq_fields_map = meta_vep_dbnsfp_info['vep_csq_fieldmap']
4142
vcf = VCF(query_vcf)
4243
for tag in vcf_infotags_meta:
43-
if lof_prediction == 0:
44+
if lof_prediction == 0 and regulatory_annotation == 0:
45+
if not tag.startswith('LoF') and not tag.startswith('REGULATORY_'):
46+
vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']),'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
47+
elif lof_prediction == 1 and regulatory_annotation == 0:
48+
if not tag.startswith('REGULATORY_'):
49+
vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']),'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
50+
elif lof_prediction == 0 and regulatory_annotation == 1:
4451
if not tag.startswith('LoF'):
4552
vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']),'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
4653
else:
4754
vcf.add_info_to_header({'ID': tag, 'Description': str(vcf_infotags_meta[tag]['description']),'Type':str(vcf_infotags_meta[tag]['type']), 'Number': str(vcf_infotags_meta[tag]['number'])})
4855

49-
56+
5057
w = Writer(out_vcf, vcf)
5158
current_chrom = None
5259
num_chromosome_records_processed = 0
@@ -107,11 +114,20 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0):
107114
num_chromosome_records_processed += 1
108115
gvanno_xref = annoutils.make_transcript_xref_map(rec, gvanno_xref_map, xref_tag = "GVANNO_XREF")
109116

110-
csq_record_results = annoutils.parse_vep_csq(rec, gvanno_xref, vep_csq_fields_map, logger, pick_only = True, csq_identifier = 'CSQ')
111-
if 'vep_all_csq' in csq_record_results:
112-
rec.INFO['VEP_ALL_CSQ'] = ','.join(csq_record_results['vep_all_csq'])
113-
if 'vep_block' in csq_record_results:
114-
vep_csq_records = csq_record_results['vep_block']
117+
if regulatory_annotation == 1:
118+
csq_record_results_all = annoutils.parse_vep_csq(rec, gvanno_xref, vep_csq_fields_map, logger, pick_only = False, csq_identifier = 'CSQ')
119+
120+
if 'vep_block' in csq_record_results_all:
121+
vep_csq_records_all = csq_record_results_all['vep_block']
122+
rec.INFO['REGULATORY_ANNOTATION'] = annoutils.map_regulatory_variant_annotations(vep_csq_records_all)
123+
124+
csq_record_results_pick = annoutils.parse_vep_csq(rec, gvanno_xref, vep_csq_fields_map, logger, pick_only = True, csq_identifier = 'CSQ')
125+
126+
if 'vep_all_csq' in csq_record_results_pick:
127+
rec.INFO['VEP_ALL_CSQ'] = ','.join(csq_record_results_pick['vep_all_csq'])
128+
if 'vep_block' in csq_record_results_pick:
129+
vep_csq_records = csq_record_results_pick['vep_block']
130+
115131
block_idx = 0
116132
record = vep_csq_records[block_idx]
117133
for k in record:

0 commit comments

Comments
 (0)