Skip to content

Commit 624418a

Browse files
author
Manavalan Gajapathy
committed
Merge branch 'vep_annotation' into 'master'
VCF annotation using VEP See merge request center-for-computational-genomics-and-data-science/sciops/ditto!1
2 parents b615aad + 78a485c commit 624418a

File tree

10 files changed

+1077
-0
lines changed

10 files changed

+1077
-0
lines changed

.gitignore

Lines changed: 94 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,94 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
5+
# C extensions
6+
*.so
7+
8+
# Distribution / packaging
9+
.Python
10+
build/
11+
develop-eggs/
12+
dist/
13+
downloads/
14+
eggs/
15+
.eggs/
16+
lib/
17+
lib64/
18+
parts/
19+
sdist/
20+
var/
21+
*.egg-info/
22+
.installed.cfg
23+
*.egg
24+
25+
# PyInstaller
26+
# Usually these files are written by a python script from a template
27+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
28+
*.manifest
29+
*.spec
30+
31+
# Installer logs
32+
pip-log.txt
33+
pip-delete-this-directory.txt
34+
35+
# Unit test / coverage reports
36+
htmlcov/
37+
.tox/
38+
.coverage
39+
.coverage.*
40+
.cache
41+
nosetests.xml
42+
coverage.xml
43+
*,cover
44+
45+
# Translations
46+
*.mo
47+
*.pot
48+
49+
# Django stuff:
50+
*.log
51+
52+
# Sphinx documentation
53+
docs/_build/
54+
55+
# PyBuilder
56+
target/
57+
58+
# DotEnv configuration
59+
.env
60+
61+
# conda
62+
.conda
63+
64+
# Database
65+
*.db
66+
*.rdb
67+
68+
# Pycharm
69+
.idea
70+
71+
# Jupyter NB Checkpoints
72+
.ipynb_checkpoints/
73+
74+
# exclude data from source control by default
75+
# data/
76+
variant_annotation/data/
77+
78+
#snakemake
79+
.snakemake/
80+
81+
82+
# exclude test data used for development
83+
to_be_deleted/test_data/data/ref
84+
to_be_deleted/test_data/data/reads
85+
86+
#logs
87+
logs/
88+
89+
# vscode
90+
.vscode/
91+
92+
# .java/fonts dir get created when creating fastqc conda env
93+
.java/
94+

.gitmodules

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
[submodule "variant_annotation/configs/snakemake_profile"]
2+
path = variant_annotation/configs/snakemake_profile
3+
url = git@gitlab.rc.uab.edu:center-for-computational-genomics-and-data-science/sciops/pipelines/small_variant_caller_pipeline.git
4+
[submodule "variant_annotation/configs/snakemake_slurm_profile"]
5+
path = variant_annotation/configs/snakemake_slurm_profile
6+
url = git@gitlab.rc.uab.edu:center-for-computational-genomics-and-data-science/sciops/external-projects/snakemake_slurm_profile.git
Binary file not shown.

variant_annotation/.test/data/raw/testing_variants_hg38.vcf

Lines changed: 679 additions & 0 deletions
Large diffs are not rendered by default.

variant_annotation/README.md

Lines changed: 66 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,66 @@
1+
# Variant annotation
2+
3+
Annotated variants in VCF using Variant Effect Predictor (VEP).
4+
5+
Script [`src/run_pipeline.sh`](src/run_pipeline.sh) runs the snakemake workflow, which sets up VEP and then uses it for annotation.
6+
7+
## Setup
8+
9+
1. Create necessary directories to store log files
10+
11+
```sh
12+
cd variant_annotation
13+
mkdir -p logs/rule_logs
14+
```
15+
16+
2. Create dataset config YAML and populate with paths
17+
18+
```sh
19+
touch ~/.ditto_datasets.yaml
20+
```
21+
22+
Enter path info into the YAML file in the following format
23+
24+
```yml
25+
cadd_snv: "/path/to/data/cadd/hg38/v1.6/whole_genome_SNVs.tsv.gz"
26+
cadd_indel: "/path/to/data/cadd/raw/hg38/v1.6/gnomad.genomes.r3.0.indel.tsv.gz"
27+
gerp: "/path/to/data/gerp/processed/hg38/v1.6/gerp_score_hg38.bg.gz"
28+
gnomad_genomes: "/path/to/data/gnomad/v3.0/data/gnomad.genomes.r3.0.sites.vcf.bgz"
29+
clinvar: "/path/to/data/clinvar/data/grch38/20210119/clinvar_20210119.vcf.gz"
30+
dbNSFP: "/path/to/data/dbnsfp/processed/v4.1a_20200616/dbNSFP4.1a_variant.complete.bgz"
31+
```
32+
33+
## Datasets in custom format
34+
35+
Two of the datasets listed in the datasets YAML require custom formatting for use with the VEP annotator. The following
36+
describes that formatting process that will need to be performed.
37+
38+
**gerp:**
39+
40+
- GERP is extracted from the annotation database distributed by CADD found [here](https://cadd.gs.washington.edu/download)
41+
- Format GERP base-wise RS scores from extracted annotation file into final compressed BedGraph file
42+
43+
**dbNSFP:**
44+
45+
- dbNSFP data is extracted from dbNSFP zip found [here](https://sites.google.com/site/jpopgen/dbNSFP)
46+
- per chromosome tab-seperated value files are extracted from the zip, sorted by GRCh38/hg38 coordinates, joined
47+
into a single file, bgzipped and indexed.
48+
49+
All other dataset files listed in the config file are in usable in the format provided by their originating source.
50+
51+
## How to run
52+
53+
- To run in current session (Note: only runs main Snakemake process in current session, Snakemake will still send jobs
54+
to Slurm):
55+
56+
```sh
57+
cd variant_annotation
58+
./src/run_pipeline.sh -v .test/data/raw/testing_variants_hg38.vcf -o .test/data/processed/vep -d ~/.ditto_datasets.yaml
59+
```
60+
61+
- To run it as slurm job:
62+
63+
```sh
64+
cd variant_annotation
65+
./src/run_pipeline.sh -s -v .test/data/raw/testing_variants_hg38.vcf -o .test/data/processed/vep -d ~/.ditto_datasets.yaml
66+
```
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
{
2+
"__default__": {
3+
"ntasks": 1,
4+
"partition": "short",
5+
"cpus-per-task": "{threads}",
6+
"mem": "20G",
7+
"output": "logs/rule_logs/{rule}-%j.log",
8+
"error": "logs/rule_logs/{rule}-%j.err"
9+
}
10+
}
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
channels:
2+
- bioconda
3+
- conda-forge
4+
dependencies:
5+
- ensembl-vep =102
6+
- bcftools =1.10.2
Submodule snakemake_slurm_profile added at 4ecaf55

variant_annotation/src/Snakefile

Lines changed: 135 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,135 @@
1+
"""
2+
This pipeline annotates VCF using Variant Effect Predictor
3+
1. Sets up VEP cache and plugins
4+
2. Using cache, plugins and other locally available datasets, annoate variants in VCF
5+
"""
6+
7+
from pathlib import Path
8+
9+
# datasets to use for annotations
10+
configfile: config["datasets"]
11+
12+
13+
#### VEP parameters ####
14+
VEP_CACHE = 'homo_sapiens_refseq'
15+
SPECIES = 'homo_sapiens'
16+
REF_BUILD = "GRCh38"
17+
ENSEMBL_DATASET_VERSION = "102"
18+
HGVS = False
19+
STATS = False
20+
21+
### I/O parameters
22+
INPUT_VCF = config["vcf"]
23+
PROCESSED_DIR = Path(config["outdir"])
24+
EXTERNAL_DIR = Path("data/external")
25+
26+
if not (INPUT_VCF.endswith('vcf') or INPUT_VCF.endswith('vcf.gz')):
27+
print (f"Error: Input file extension not in expected format: found {INPUT_VCF}, expecting *.vcf or *.vcf.gz")
28+
raise SystemExit(1)
29+
30+
INPUT_VCF = Path(INPUT_VCF)
31+
OUTPUT_VCF = PROCESSED_DIR / ((INPUT_VCF.name).rstrip(".gz").rstrip(".vcf") + "_vep-annotated.vcf.gz")
32+
33+
34+
rule all:
35+
input:
36+
OUTPUT_VCF
37+
38+
39+
rule get_vep_cache:
40+
output:
41+
cache = directory(EXTERNAL_DIR / "vep" / "cache" / VEP_CACHE),
42+
params:
43+
species = VEP_CACHE,
44+
build = REF_BUILD,
45+
release = ENSEMBL_DATASET_VERSION,
46+
plugins = "CADD"
47+
message:
48+
"Retrieves VEP cache data"
49+
conda:
50+
"../configs/env/vep.yaml"
51+
shell:
52+
r"""
53+
vep_install --AUTO cfp \
54+
--SPECIES {params.species} \
55+
--ASSEMBLY {params.build} \
56+
--PLUGINS {params.plugins} \
57+
--CACHE_VERSION {params.release} \
58+
--CACHEDIR {output.cache} \
59+
--CONVERT \
60+
--NO_UPDATE
61+
"""
62+
63+
64+
rule get_vep_plugins:
65+
output:
66+
directory(EXTERNAL_DIR / "vep" / "plugins"),
67+
message:
68+
"Downloads VEP plugins"
69+
params:
70+
release = ENSEMBL_DATASET_VERSION
71+
wrapper:
72+
"0.59.2/bio/vep/plugins"
73+
74+
75+
rule annotate_variants:
76+
input:
77+
calls = INPUT_VCF,
78+
cache = EXTERNAL_DIR / "vep" / "cache" / VEP_CACHE,
79+
plugins = EXTERNAL_DIR / "vep" / "plugins",
80+
cadd_snv = config['cadd_snv'],
81+
cadd_indel = config['cadd_indel'],
82+
gerp = config['gerp'],
83+
gnomad_genomes = config['gnomad_genomes'],
84+
clinvar = config['clinvar'],
85+
dbNSFP = config['dbNSFP'],
86+
output:
87+
calls = OUTPUT_VCF,
88+
message:
89+
"Annotated vcf using VEP with CADD, gnomad-exomes, gnomad-genomes and GERP. "
90+
f"VEP cache used: {VEP_CACHE}, ref build: {REF_BUILD}, Ensemble version: {ENSEMBL_DATASET_VERSION}"
91+
params:
92+
release = ENSEMBL_DATASET_VERSION,
93+
species = SPECIES,
94+
build = REF_BUILD,
95+
refseq_flag = "--refseq" if 'refseq' in VEP_CACHE else "",
96+
hgvs_flag = "--hgvs" if HGVS else "",
97+
stats_flag = lambda wildcards, output: f"--stats_file {output.stats}" if STATS else "--no_stats",
98+
gnomad_fields = "AC,AN,AF,AF_afr,AF_afr_female,AF_afr_male,AF_ami,AF_ami_female,AF_ami_male,AF_amr,AF_amr_female,AF_amr_male,AF_asj,AF_asj_female," \
99+
"AF_asj_male,AF_eas,AF_eas_female,AF_eas_male,AF_female,AF_fin,AF_fin_female,AF_fin_male,AF_male,AF_nfe,AF_nfe_female,AF_nfe_male," \
100+
"AF_oth,AF_oth_female,AF_oth_male,AF_raw,AF_sas,AF_sas_female,AF_sas_male",
101+
clinvar_fields = "AF_ESP,AF_EXAC,AF_TGP,ALLELEID,CLNDN,CLNDNINCL,CLNDISDB,CLNDISDBINCL,CLNREVSTAT,CLNSIG,CLNSIGCONF,CLNSIGINCL,CLNVC,GENEINFO,MC,ORIGIN,RS,SSR",
102+
dbNSFP_fields = "LRT_score,MutationTaster_score,MutationAssessor_score,FATHMM_score,PROVEAN_score,VEST4_score,MetaSVM_score,MetaLR_score,M-CAP_score," \
103+
"CADD_phred,DANN_score,fathmm-MKL_coding_score,GenoCanyon_score,integrated_fitCons_score,GERP++_RS,phyloP100way_vertebrate,phyloP30way_mammalian," \
104+
"phastCons100way_vertebrate,phastCons30way_mammalian,SiPhy_29way_logOdds,Eigen-raw_coding,Eigen-raw_coding_rankscore,Eigen-phred_coding," \
105+
"Eigen-PC-raw_coding,Eigen-PC-raw_coding_rankscore,Eigen-PC-phred_coding",
106+
warnings_file = lambda wildcards, output: str(output.calls).replace('.vcf.gz', '_STDOUT_warnings.txt'),
107+
threads: 8
108+
conda:
109+
"../configs/env/vep.yaml"
110+
shell:
111+
r"""
112+
# using bcftools view as it might catch vcf-related errors (https://stackoverflow.com/a/63371639/3998252)
113+
bcftools view {input.calls} | \
114+
vep --fork {threads} \
115+
--format vcf \
116+
--vcf \
117+
--offline \
118+
--cache \
119+
--cache_version {params.release} \
120+
--species {params.species} \
121+
--assembly {params.build} \
122+
{params.refseq_flag} {params.hgvs_flag} \
123+
--sift s --polyphen s \
124+
--dir_cache {input.cache} \
125+
--dir_plugins {input.plugins} \
126+
--plugin CADD,{input.cadd_snv},{input.cadd_indel} \
127+
--plugin dbNSFP,{input.dbNSFP},{params.dbNSFP_fields} \
128+
--custom {input.gerp},GERP,bed \
129+
--custom {input.gnomad_genomes},gnomADv3,vcf,exact,0,{params.gnomad_fields} \
130+
--custom {input.clinvar},clinvar,vcf,exact,0,{params.clinvar_fields} \
131+
{params.stats_flag} \
132+
--warning_file {params.warnings_file} \
133+
--compress_output bgzip \
134+
--output_file {output.calls}
135+
"""

0 commit comments

Comments
 (0)