diff --git a/README.md b/README.md index bdd7f29..41ed2f1 100755 --- a/README.md +++ b/README.md @@ -15,34 +15,29 @@ The germline variant annotator (*gvanno*) is a software package intended for ana *gvanno* accepts query files encoded in the VCF format, and can analyze both SNVs and short InDels. The workflow relies heavily upon [Ensembl’s Variant Effect Predictor (VEP)](http://www.ensembl.org/info/docs/tools/vep/index.html), and [vcfanno](https://github.com/brentp/vcfanno). It produces an annotated VCF file and a file of tab-separated values (.tsv), the latter listing all annotations pr. variant record. Note that if your input VCF contains data (genotypes) from multiple samples (i.e. a multisample VCF), the output TSV file will contain one line/record __per sample variant__. ### News +* September 24th 2022 - **1.5.0 release** + * Data updates: ClinVar, GENCODE GWAS catalog, CancerMine, Open Targets Platform + * Software updates: VEP 107 + * Excluded UniProt KB from annotation tracks * December 21st 2021 - **1.4.4 release** * Data updates: ClinVar, GWAS catalog, CancerMine, UniProt KB, Open Targets Platform * Software updates: VEP (v105) * August 25th 2021 - **1.4.3 release** * Data updates: ClinVar, GWAS catalog, CancerMine, UniProt, Open Targets Platform -* May 24th 2021 - **1.4.2 release** - * Software update (VEP 104) - * Data updates: ClinVar, GWAS catalog, CancerMine, Pfam, dbNSFP, UniProt - * Two new options added: - * `--vep_regulatory` - annotates variants for overlap with regulatory regions (details below) - * `--docker-uid` - set Docker user id - * New variant annotations for enhanced non-coding interpretation: - * _REGULATORY_ANNOTATION_ : A comma-separated list of regulatory annotations from VEP's `--regulatory` option, i.e. __TF_binding_site__, overlap with __enhancer/promoter/open_chromatin__, __CTCF_binding_site__ etc. Included when the `--vep_regulatory` option is turned on in gvanno. - * _NCER_PERCENTILE_: A genome-wide percentile rank score from the ncER algorithm (**n**on-**c**oding **E**ssential **R**egulation), [Wells et al., Nat Comm. (2019)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6868241/). - -### Annotation resources (v1.4.4) - -* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v105 (GENCODE v39/v19 as the gene reference dataset) + +### Annotation resources (v1.5.0) + +* [VEP](http://www.ensembl.org/info/docs/tools/vep/index.html) - Variant Effect Predictor v107 (GENCODE v41/v19 as the gene reference dataset) * [dBNSFP](https://sites.google.com/site/jpopgen/dbNSFP) - Database of non-synonymous functional predictions (v4.2, March 2021) * [gnomAD](http://gnomad.broadinstitute.org/) - Germline variant frequencies exome-wide (release 2.1, October 2018) - from VEP * [dbSNP](http://www.ncbi.nlm.nih.gov/SNP/) - Database of short genetic variants (build 154) - from VEP * [1000 Genomes Project - phase3](ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/) - Germline variant frequencies genome-wide (May 2013) - from VEP -* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of variants related to human health/disease phenotypes (December 2021) -* [CancerMine](http://bionlp.bcgsc.ca/cancermine/) - literature-mined database of drivers, oncogenes and tumor suppressors in cancer (version 41, December 2021) -* [Open Targets Platform](https://targetvalidation.org) - Target-disease and target-drug associations (2021_11, Nocember 2021) -* [UniProt/SwissProt KnowledgeBase](http://www.uniprot.org) - Resource on protein sequence and functional information (2021_04, November 2021) +* [ClinVar](http://www.ncbi.nlm.nih.gov/clinvar/) - Database of variants related to human health/disease phenotypes (September 2022) +* [CancerMine](http://bionlp.bcgsc.ca/cancermine/) - literature-mined database of drivers, oncogenes and tumor suppressors in cancer (version 47, July 2022) +* [Open Targets Platform](https://targetvalidation.org) - Target-disease and target-drug associations (2022_06, June 2022) * [Pfam](http://pfam.xfam.org) - Database of protein families and domains (v35.0, November 2021) -* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (December 7th 2021) +* [Mutation hotspots](cancerhotspots.org) - Database of mutation hotspots in cancer +* [NHGRI-EBI GWAS Catalog](https://www.ebi.ac.uk/gwas/home) - Catalog of published genome-wide association studies (August 26th 2022) ### Getting started @@ -76,17 +71,17 @@ An installation of Python (version >=_3.6_) is required to run *gvanno*. Check t #### STEP 2: Download *gvanno* and data bundle -1. [Download the latest version](https://github.com/sigven/gvanno/releases/tag/v1.4.4) (gvanno run script, v1.4.4) +1. [Download the latest version](https://github.com/sigven/gvanno/releases/tag/v1.5.0) (gvanno run script, v1.5.0) 2. Download (preferably using `wget`) and unpack the latest assembly-specific data bundle in the gvanno directory - * [grch37 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20211221.tgz) (approx 18Gb) - * [grch38 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch38.20211221.tgz) (approx 20Gb) + * [grch37 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20220921.tgz) (approx 20Gb) + * [grch38 data bundle](http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch38.20220921.tgz) (approx 28Gb) * Example commands: - * `wget http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20211221.tgz` + * `wget http://insilico.hpc.uio.no/pcgr/gvanno/gvanno.databundle.grch37.20220921.tgz` * `gzip -dc gvanno.databundle.grch37.YYYYMMDD.tgz | tar xvf -` - A _data/_ folder within the _gvanno-1.4.4_ software folder should now have been produced -3. Pull the [gvanno Docker image (1.4.4)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2.2Gb): - * `docker pull sigven/gvanno:1.4.4` (gvanno annotation engine) + A _data/_ folder within the _gvanno-1.5.0_ software folder should now have been produced +3. Pull the [gvanno Docker image (1.5.0)](https://hub.docker.com/r/sigven/gvanno/) from DockerHub (approx 2.2Gb): + * `docker pull sigven/gvanno:1.5.0` (gvanno annotation engine) #### STEP 3: Input preprocessing @@ -115,7 +110,7 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt --query_vcf QUERY_VCF VCF input file with germline query variants (SNVs/InDels). --gvanno_dir GVANNO_DIR - Directory that contains the gvanno data bundle, e.g. ~/gvanno-1.4.4 + Directory that contains the gvanno data bundle, e.g. ~/gvanno-1.5.0 --output_dir OUTPUT_DIR Output directory --genome_assembly {grch37,grch38} @@ -152,10 +147,10 @@ Run the workflow with **gvanno.py**, which takes the following arguments and opt The _examples_ folder contains an example VCF file. Analysis of the example VCF can be performed by the following command: - python ~/gvanno-1.4.4/gvanno.py - --query_vcf ~/gvanno-1.4.4/examples/example.grch37.vcf.gz - --gvanno_dir ~/gvanno-1.4.4 - --output_dir ~/gvanno-1.4.4 + python ~/gvanno-1.5.0/gvanno.py + --query_vcf ~/gvanno-1.5.0/examples/example.grch37.vcf.gz + --gvanno_dir ~/gvanno-1.5.0 + --output_dir ~/gvanno-1.5.0 --sample_id example --genome_assembly grch37 --container docker diff --git a/data-raw/RELEASE_NOTES b/data-raw/RELEASE_NOTES index e498c37..c453903 100755 --- a/data-raw/RELEASE_NOTES +++ b/data-raw/RELEASE_NOTES @@ -1,14 +1,13 @@ -##GVANNO_SOFTWARE_VERSION = 1.4.4 -##GVANNO_DB_VERSION = 20211221 +##GVANNO_SOFTWARE_VERSION = 1.5.0 +##GVANNO_DB_VERSION = 20220913 pfam = v35.0 (November 2021) ncER = v1.0 (March 2019) -uniprot = release 2021_04 -corum = release 3.0 (20180903) +uniprot = release 2022_03 onekg = phase 3 (20130502) -dbsnp = build 154/153 +dbsnp = build 154 dbnsfp = v4.2 (April 2021) gnomad = r2.1 (October 2018) -gwas = December 2021 (20211207) -clinvar = December 2021 (20211130) -opentargets = 2021_11 -gencode = 39/19 +gwas = August 2022 (20220826) +clinvar = September 2022 (20220831) +opentargets = 2022_06 +gencode = 41/19 diff --git a/examples/example.grch37.vcf.gz b/examples/example.grch37.vcf.gz index a52cc8e..41d34ee 100755 Binary files a/examples/example.grch37.vcf.gz and b/examples/example.grch37.vcf.gz differ diff --git a/examples/example.grch37.vcf.gz.tbi b/examples/example.grch37.vcf.gz.tbi index f1c9f18..8639251 100755 Binary files a/examples/example.grch37.vcf.gz.tbi and b/examples/example.grch37.vcf.gz.tbi differ diff --git a/gvanno.py b/gvanno.py index d3ca476..bc51352 100755 --- a/gvanno.py +++ b/gvanno.py @@ -11,10 +11,10 @@ import platform from argparse import RawTextHelpFormatter -GVANNO_VERSION = '1.4.4' -DB_VERSION = 'GVANNO_DB_VERSION = 20211221' -VEP_VERSION = '105' -GENCODE_VERSION = 'v39' +GVANNO_VERSION = '1.5.0' +DB_VERSION = 'GVANNO_DB_VERSION = 20220921' +VEP_VERSION = '107' +GENCODE_VERSION = 'v41' VEP_ASSEMBLY = "GRCh38" DOCKER_IMAGE_VERSION = 'sigven/gvanno:' + str(GVANNO_VERSION) @@ -294,6 +294,9 @@ def run_gvanno(arg_dict, host_directories): container_command_run2 = container_command_run2 + " -W /workdir/output " + 'src/gvanno.sif' + " sh -c \"" docker_command_run_end = '\"' + #logger.info(container_command_run1) + #logger.info(container_command_run2) + ## GVANNO|start - Log key information about sample, options and assembly logger = getlogger("gvanno-start") logger.info("--- Germline variant annotation (gvanno) workflow ----") @@ -306,6 +309,7 @@ def run_gvanno(arg_dict, host_directories): logger.info("STEP 0: Validate input data") vcf_validate_command = str(container_command_run1) + "gvanno_validate_input.py " + str(data_dir) + " " + str(input_vcf_docker) + " " + \ str(vcf_validation) + " " + str(arg_dict['genome_assembly']) + docker_command_run_end + #logger.info(vcf_validate_command) check_subprocess(vcf_validate_command) logger.info('Finished') @@ -378,10 +382,10 @@ def run_gvanno(arg_dict, host_directories): ## GVANNO|vcfanno - annotate VCF against a number of variant annotation resources print() logger = getlogger('gvanno-vcfanno') - logger.info("STEP 2: Clinical/functional variant annotations with gvanno-vcfanno (ClinVar, ncER, dbNSFP, GWAS catalog, UniProtKB, cancerhotspots.org)") + logger.info("STEP 2: Clinical/functional variant annotations with gvanno-vcfanno (ClinVar, ncER, dbNSFP, GWAS catalog, cancerhotspots.org)") logger.info('vcfanno configuration - number of processes (-p): ' + str(arg_dict['vcfanno_n_processes'])) gvanno_vcfanno_command = str(container_command_run2) + "gvanno_vcfanno.py --num_processes " + str(arg_dict['vcfanno_n_processes']) + \ - " --dbnsfp --clinvar --ncer --uniprot --gvanno_xref --gwas --cancer_hotspots " + str(vep_vcf) + ".gz " + str(vep_vcfanno_vcf) + \ + " --dbnsfp --clinvar --ncer --gvanno_xref --gwas --cancer_hotspots " + str(vep_vcf) + ".gz " + str(vep_vcfanno_vcf) + \ " " + os.path.join(data_dir, "data", str(arg_dict['genome_assembly'])) + docker_command_run_end check_subprocess(gvanno_vcfanno_command) logger.info("Finished") @@ -412,7 +416,7 @@ def run_gvanno(arg_dict, host_directories): print() ## GVANNO|vcf2tsv - convert VCF to TSV with https://github.com/sigven/vcf2tsv logger = getlogger("gvanno-vcf2tsv") - logger.info("STEP 4: Converting VCF to TSV with https://github.com/sigven/vcf2tsv") + logger.info("STEP 4: Converting VCF to TSV with https://github.com/sigven/vcf2tsvpy") gvanno_vcf2tsv_command_pass = str(container_command_run2) + "vcf2tsv.py " + str(output_pass_vcf) + " --compress " + str(output_pass_tsv) + docker_command_run_end gvanno_vcf2tsv_command_all = str(container_command_run2) + "vcf2tsv.py " + str(output_vcf) + " --compress --keep_rejected " + str(output_tsv) + docker_command_run_end logger.info("Conversion of VCF variant data to records of tab-separated values - PASS variants only") diff --git a/src/Dockerfile b/src/Dockerfile index 07a7732..b228d05 100644 --- a/src/Dockerfile +++ b/src/Dockerfile @@ -22,7 +22,7 @@ RUN apt-get update && apt-get -y install \ ENV OPT /opt/vep ENV OPT_SRC $OPT/src ENV HTSLIB_DIR $OPT_SRC/htslib -ENV BRANCH release/105 +ENV BRANCH release/107 # Working directory WORKDIR $OPT_SRC @@ -166,7 +166,7 @@ ENV PERL5LIB $PERL5LIB_TMP WORKDIR / ADD loftee_1.0.3.tgz $OPT/src/ensembl-vep/modules ADD UTRannotator.tgz $OPT/src/ensembl-vep/modules -RUN wget -q "https://raw.githubusercontent.com/Ensembl/VEP_plugins/release/105/NearestExonJB.pm" -O $OPT/src/ensembl-vep/modules/NearestExonJB.pm +RUN wget -q "https://raw.githubusercontent.com/Ensembl/VEP_plugins/release/107/NearestExonJB.pm" -O $OPT/src/ensembl-vep/modules/NearestExonJB.pm # Final steps @@ -250,7 +250,9 @@ RUN rm miniconda.sh # update conda & install vt RUN /conda/bin/conda update conda +#RUN /conda/bin/conda update python RUN /conda/bin/conda install -c bioconda vt +#RUN /conda/bin/conda install -c bioconda vcf2tsvpy ## Clean Up RUN apt-get clean autoclean @@ -271,7 +273,7 @@ RUN rm -rf $HOME/src/ensembl-vep/t/ RUN rm -f $HOME/src/v335_base.tar.gz RUN rm -f $HOME/src/release-1-6-924.zip RUN rm -rf /samtools-1.10.tar.bz2 -RUN rm -f /conda/bin/python +#RUN rm -f /conda/bin/python ADD gvanno.tgz / ENV PATH=$PATH:/conda/bin:/gvanno diff --git a/src/buildDocker.sh b/src/buildDocker.sh index 43fe4f3..9a84197 100755 --- a/src/buildDocker.sh +++ b/src/buildDocker.sh @@ -4,5 +4,5 @@ cp /Users/sigven/research/software/vcf2tsv/vcf2tsv.py gvanno/ tar czvfh gvanno.tgz gvanno/ echo "Build the Docker Image" TAG=`date "+%Y%m%d"` -docker build --no-cache -t sigven/gvanno:$TAG --rm=true . +docker build -t sigven/gvanno:$TAG --rm=true . diff --git a/src/gvanno.tgz b/src/gvanno.tgz index 93280b7..ea3136e 100644 Binary files a/src/gvanno.tgz and b/src/gvanno.tgz differ diff --git a/src/gvanno/gvanno_summarise.py b/src/gvanno/gvanno_summarise.py index e87ab2b..d8bc1bd 100755 --- a/src/gvanno/gvanno_summarise.py +++ b/src/gvanno/gvanno_summarise.py @@ -33,8 +33,8 @@ def extend_vcf_annotations(query_vcf, gvanno_db_directory, lof_prediction = 0, r """ ## read VEP and PCGR tags to be appended to VCF file - vcf_infotags_meta = annoutils.read_infotag_file(os.path.join(gvanno_db_directory,'gvanno_infotags.tsv')) - gvanno_xref_map = annoutils.read_genexref_namemap(os.path.join(gvanno_db_directory,'gvanno_xref', 'gvanno_xref.namemap.tsv')) + vcf_infotags_meta = annoutils.read_infotag_file(logger, os.path.join(gvanno_db_directory,'gvanno_infotags.tsv')) + gvanno_xref_map = annoutils.read_genexref_namemap(logger, os.path.join(gvanno_db_directory,'gvanno_xref', 'gvanno_xref_namemap.tsv')) out_vcf = re.sub(r'\.vcf(\.gz){0,}$','.annotated.vcf',query_vcf) meta_vep_dbnsfp_info = annoutils.vep_dbnsfp_meta_vcf(query_vcf, vcf_infotags_meta) diff --git a/src/gvanno/gvanno_validate_input.py b/src/gvanno/gvanno_validate_input.py index e71f74a..8701a64 100755 --- a/src/gvanno/gvanno_validate_input.py +++ b/src/gvanno/gvanno_validate_input.py @@ -34,7 +34,7 @@ def check_existing_vcf_info_tags(input_vcf, gvanno_directory, genome_assembly, l If any coinciding tags, an error will be returned """ - gvanno_infotags_desc = annoutils.read_infotag_file(os.path.join(gvanno_directory,'data',genome_assembly,'gvanno_infotags.tsv')) + gvanno_infotags_desc = annoutils.read_infotag_file(logger, os.path.join(gvanno_directory,'data',genome_assembly,'gvanno_infotags.tsv')) vcf = VCF(input_vcf) logger.info('Checking if existing INFO tags of query VCF file coincide with gvanno INFO tags') diff --git a/src/gvanno/gvanno_vcfanno.py b/src/gvanno/gvanno_vcfanno.py index 45a1a67..7840a21 100755 --- a/src/gvanno/gvanno_vcfanno.py +++ b/src/gvanno/gvanno_vcfanno.py @@ -20,7 +20,7 @@ def __main__(): parser.add_argument("--ncer",action = "store_true", help="Annotate VCF with ranking of variant deleteriousness in non-coding regions (ncER)") parser.add_argument("--clinvar",action = "store_true", help="Annotate VCF with annotations from ClinVar") parser.add_argument("--dbnsfp",action = "store_true", help="Annotate VCF with annotations from database of non-synonymous functional predictions") - parser.add_argument("--uniprot",action = "store_true", help="Annotate VCF with protein functional features from the UniProt Knowledgebase") + #parser.add_argument("--uniprot",action = "store_true", help="Annotate VCF with protein functional features from the UniProt Knowledgebase") parser.add_argument("--gvanno_xref",action = "store_true", help="Annotate VCF with transcript annotations from gvanno (protein complexes, disease associations, etc)") parser.add_argument("--gwas",action = "store_true", help="Annotate VCF with against known loci associated with cancer, as identified from genome-wide association studies (GWAS)") parser.add_argument("--cancer_hotspots",action = "store_true", help="Annotate VCF with mutation hotspots from cancerhotspots.org") @@ -31,7 +31,7 @@ def __main__(): vcfheader_file = args.out_vcf + '.tmp.' + str(random.randrange(0,10000000)) + '.header.txt' vcfanno_conf_fname = args.out_vcf + '.tmp.conf.toml' print_vcf_header(args.query_vcf, vcfheader_file, chromline_only = False) - run_vcfanno(args.num_processes, args.query_vcf, vcfanno_conf_fname, query_info_tags, vcfheader_file, args.gvanno_db_dir, args.out_vcf, args.ncer, args.clinvar, args.dbnsfp, args.uniprot, args.gvanno_xref,args.gwas, args.cancer_hotspots) + run_vcfanno(args.num_processes, args.query_vcf, vcfanno_conf_fname, query_info_tags, vcfheader_file, args.gvanno_db_dir, args.out_vcf, args.ncer, args.clinvar, args.dbnsfp, args.gvanno_xref,args.gwas, args.cancer_hotspots) def prepare_vcfanno_configuration(vcfanno_data_directory, vcfanno_conf_fname, vcfheader_file, logger, datasource_info_tags, query_info_tags, datasource): @@ -41,15 +41,14 @@ def prepare_vcfanno_configuration(vcfanno_data_directory, vcfanno_conf_fname, vc append_to_conf_file(datasource, datasource_info_tags, vcfanno_data_directory, vcfanno_conf_fname) append_to_vcf_header(vcfanno_data_directory, datasource, vcfheader_file) -def run_vcfanno(num_processes, query_vcf, vcfanno_conf_fname, query_info_tags, vcfheader_file, gvanno_db_directory, output_vcf, ncer, clinvar, dbnsfp, uniprot, gvanno_xref,gwas, cancer_hotspots): +def run_vcfanno(num_processes, query_vcf, vcfanno_conf_fname, query_info_tags, vcfheader_file, gvanno_db_directory, output_vcf, ncer, clinvar, dbnsfp, gvanno_xref,gwas, cancer_hotspots): """ Function that annotates a VCF file with vcfanno against a user-defined set of germline and somatic VCF files """ clinvar_info_tags = ["CLINVAR_MSID","CLINVAR_PMID","CLINVAR_CLNSIG","CLINVAR_VARIANT_ORIGIN","CLINVAR_CONFLICTED","CLINVAR_UMLS_CUI","CLINVAR_HGVSP", "CLINVAR_UMLS_CUI_SOMATIC","CLINVAR_CLNSIG_SOMATIC","CLINVAR_PMID_SOMATIC","CLINVAR_ALLELE_ID","CLINVAR_MOLECULAR_EFFECT", - "CLINVAR_REVIEW_STATUS_STARS"] + "CLINVAR_REVIEW_STATUS_STARS","CLINVAR_CLASSIFICATION"] dbnsfp_info_tags = ["DBNSFP"] - uniprot_info_tags = ["UNIPROT_FEATURE"] gvanno_xref_info_tags = ["GVANNO_XREF"] gwas_info_tags = ["GWAS_HIT"] ncer_info_tags = ["NCER_PERCENTILE"] @@ -61,8 +60,6 @@ def run_vcfanno(num_processes, query_vcf, vcfanno_conf_fname, query_info_tags, v prepare_vcfanno_configuration(gvanno_db_directory, vcfanno_conf_fname, vcfheader_file, logger, clinvar_info_tags, query_info_tags, "clinvar") if dbnsfp is True: prepare_vcfanno_configuration(gvanno_db_directory, vcfanno_conf_fname, vcfheader_file, logger, dbnsfp_info_tags, query_info_tags, "dbnsfp") - if uniprot is True: - prepare_vcfanno_configuration(gvanno_db_directory, vcfanno_conf_fname, vcfheader_file, logger, uniprot_info_tags, query_info_tags, "uniprot") if gvanno_xref is True: prepare_vcfanno_configuration(gvanno_db_directory, vcfanno_conf_fname, vcfheader_file, logger, gvanno_xref_info_tags, query_info_tags, "gvanno_xref") if gwas is True: @@ -97,7 +94,7 @@ def append_to_conf_file(datasource, datasource_info_tags, gvanno_db_directory, v The datasource defines the set of tags that will be appended during annotation """ fh = open(vcfanno_conf_fname,'a') - if datasource != 'uniprot' and datasource != 'gvanno_xref' and datasource != 'ncer': + if datasource != 'gvanno_xref' and datasource != 'ncer': fh.write('[[annotation]]\n') fh.write('file="' + str(gvanno_db_directory) + '/' + str(datasource) + '/' + str(datasource) + '.vcf.gz"\n') fields_string = 'fields = ["' + '","'.join(datasource_info_tags) + '"]' @@ -106,8 +103,8 @@ def append_to_conf_file(datasource, datasource_info_tags, gvanno_db_directory, v fh.write(fields_string + '\n') fh.write(ops_string + '\n\n') else: - if datasource == 'uniprot' or datasource == 'gvanno_xref' or datasource == 'ncer': - if datasource == 'uniprot' or datasource == 'gvanno_xref': + if datasource == 'gvanno_xref' or datasource == 'ncer': + if datasource == 'gvanno_xref': fh.write('[[annotation]]\n') fh.write('file="' + str(gvanno_db_directory) + '/' + str(datasource) + '/' + str(datasource) + '.bed.gz"\n') fh.write('columns=[4]\n') diff --git a/src/gvanno/lib/annoutils.py b/src/gvanno/lib/annoutils.py index 720b7ea..2b8362d 100755 --- a/src/gvanno/lib/annoutils.py +++ b/src/gvanno/lib/annoutils.py @@ -13,10 +13,12 @@ -def read_genexref_namemap(gene_xref_namemap_tsv): +def read_genexref_namemap(logger, gene_xref_namemap_tsv): namemap_xref = {} ##dictionary returned if not os.path.exists(gene_xref_namemap_tsv): - return namemap_xref + logger.info("ERROR: File '" + str(gene_xref_namemap_tsv) + "' does not exist - exiting") + exit(1) + #return namemap_xref tsvfile = open(gene_xref_namemap_tsv, 'r') reader = csv.DictReader(tsvfile, delimiter='\t') for row in reader: @@ -24,7 +26,7 @@ def read_genexref_namemap(gene_xref_namemap_tsv): return namemap_xref -def read_infotag_file(vcf_info_tags_tsv): +def read_infotag_file(logger, vcf_info_tags_tsv): """ Function that reads a file that lists VCF INFO tags produced by PCGR/CPSR/gvanno. An example line of the VCF info tag file is the following: @@ -36,6 +38,7 @@ def read_infotag_file(vcf_info_tags_tsv): """ info_tag_xref = {} ##dictionary returned if not os.path.exists(vcf_info_tags_tsv): + logger.info("ERROR: File '" + str(vcf_info_tags_tsv) + "' does not exist - exiting") return info_tag_xref tsvfile = open(vcf_info_tags_tsv, 'r') reader = csv.DictReader(tsvfile, delimiter='\t') @@ -629,7 +632,7 @@ def parse_vep_csq(rec, transcript_xref_map, vep_csq_fields_map, logger, pick_onl all_csq_pick = [] all_transcript_consequences = [] - + for csq in rec.INFO.get(csq_identifier).split(','): csq_fields = csq.split('|') @@ -654,15 +657,19 @@ def parse_vep_csq(rec, transcript_xref_map, vep_csq_fields_map, logger, pick_onl while(j < len(csq_fields)): if j in vep_csq_fields_map['index2field']: if csq_fields[j] != '': - #print(str(vep_csq_fields_map['index2field'][j]) + '\t' + str(csq_fields[j])) csq_record[vep_csq_fields_map['index2field'][j]] = str(csq_fields[j]) if vep_csq_fields_map['index2field'][j] == 'Feature': ensembl_transcript_id = str(csq_fields[j]) if ensembl_transcript_id in transcript_xref_map: for annotation in transcript_xref_map[ensembl_transcript_id].keys(): if annotation != 'SYMBOL': - ## assign additional gene/transcript annotations from the custom transcript xref map (PCGR/CPSR) as key,value pairs in the csq_record object + ## assign additional gene/transcript annotations from the custom transcript + ## xref map (PCGR/CPSR/gvanno) as key,value pairs in the csq_record object csq_record[annotation] = transcript_xref_map[ensembl_transcript_id][annotation] + else: + ## If not VEP provides SYMBOL, append SYMBOL provided by xref map + if csq_record['SYMBOL'] is None: + csq_record[annotation] = transcript_xref_map[ensembl_transcript_id][annotation] else: if re.match(r'ENST', ensembl_transcript_id): logger.warning('Could not find transcript xrefs for ' + str(ensembl_transcript_id)) @@ -698,13 +705,20 @@ def parse_vep_csq(rec, transcript_xref_map, vep_csq_fields_map, logger, pick_onl ## Assign coding status, protein change, coding sequence change, last exon/intron status etc assign_cds_exon_intron_annotations(csq_record) ## Append transcript consequence to all_csq_pick - #print(csq_record) all_csq_pick.append(csq_record) symbol = '.' + hgvsc = ".:." + hgvsp = ".:." if csq_fields[vep_csq_fields_map['field2index']['SYMBOL']] != "": symbol = str(csq_fields[vep_csq_fields_map['field2index']['SYMBOL']]) + if csq_fields[vep_csq_fields_map['field2index']['HGVSc']] != "": + hgvsc = str(csq_fields[vep_csq_fields_map['field2index']['HGVSc']]) + if csq_fields[vep_csq_fields_map['field2index']['HGVSp']] != "": + hgvsp = str(csq_fields[vep_csq_fields_map['field2index']['HGVSp']]) consequence_entry = (str(csq_fields[vep_csq_fields_map['field2index']['Consequence']]) + ':' + str(symbol) + ':' + + str(hgvsc) + ':' + + str(hgvsp) + ':' + str(csq_fields[vep_csq_fields_map['field2index']['Feature_type']]) + ':' + str(csq_fields[vep_csq_fields_map['field2index']['Feature']]) + ':' + str(csq_fields[vep_csq_fields_map['field2index']['BIOTYPE']])) diff --git a/src/gvanno/vcf2tsv.py b/src/gvanno/vcf2tsv.py index 522668b..7d5b544 100755 --- a/src/gvanno/vcf2tsv.py +++ b/src/gvanno/vcf2tsv.py @@ -7,7 +7,7 @@ import math import subprocess -version = '0.3.5' +version = '0.3.7.1' def __main__(): @@ -169,6 +169,7 @@ def vcf2tsv(query_vcf, out_tsv, skip_info_data, skip_genotype_data, keep_rejecte #print(str(vcf_info_data)) #dictionary, with sample names as keys, values being genotype data (dictionary with format tags as keys) vcf_sample_genotype_data = {} + if len(samples) > 0 and skip_genotype_data is False: gt_cyvcf = rec.gt_types i = 0 @@ -200,7 +201,15 @@ def vcf2tsv(query_vcf, out_tsv, skip_info_data, skip_genotype_data, keep_rejecte ## sample-wise while j < dim[0]: if sample_dat[j].size > 1: + d = ','.join(str(e) for e in np.ndarray.tolist(sample_dat[j])) + if column_types[format_tag] == 'Float': + d = ','.join(str(round(e, 4)) for e in np.ndarray.tolist(sample_dat[j])) + ## undefined/missing value + if '-2147483648' in d: + d = d.replace('-2147483648', '.') + if 'nan' in d.casefold(): + d = d.casefold().replace('nan', '.') if samples[j] in vcf_sample_genotype_data: vcf_sample_genotype_data[samples[j]][format_tag] = d else: @@ -212,11 +221,13 @@ def vcf2tsv(query_vcf, out_tsv, skip_info_data, skip_genotype_data, keep_rejecte d = str(sample_dat[j]) if column_types[format_tag] == 'Integer': d = str(sample_dat[j][0]) + ## undefined/missing value + if d == '-2147483648' or d.casefold() == 'nan': + d = '.' if samples[j] in vcf_sample_genotype_data: vcf_sample_genotype_data[samples[j]][format_tag] = d j = j + 1 - #print(str(vcf_sample_genotype_data)) tsv_elements = [] tsv_elements.append(fixed_fields_string) if skip_info_data is False: