Skip to content

Commit bcf8822

Browse files
committed
Merge branch 'CAGI6-version' into 'master'
Ditto pipeline just after CAGI6 project See merge request center-for-computational-genomics-and-data-science/sciops/ditto!3
2 parents dd06512 + c665340 commit bcf8822

32 files changed

+2916
-35
lines changed

.gitignore

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@ __pycache__/
77

88
# Distribution / packaging
99
.Python
10+
#env/
1011
build/
1112
develop-eggs/
1213
dist/
@@ -18,9 +19,18 @@ lib64/
1819
parts/
1920
sdist/
2021
var/
22+
dask-worker-space/
2123
*.egg-info/
2224
.installed.cfg
2325
*.egg
26+
*.err
27+
*.out
28+
*.db
29+
*.py*.sh
30+
*.tsv
31+
*.csv
32+
*.gz*
33+
2434

2535
# PyInstaller
2636
# Usually these files are written by a python script from a template
@@ -41,6 +51,7 @@ htmlcov/
4151
nosetests.xml
4252
coverage.xml
4353
*,cover
54+
*.pdf
4455

4556
# Translations
4657
*.mo
@@ -72,12 +83,13 @@ target/
7283
.ipynb_checkpoints/
7384

7485
# exclude data from source control by default
75-
# data/
76-
variant_annotation/data/
86+
/data/
87+
cagi*/
7788

7889
#snakemake
7990
.snakemake/
80-
91+
# data/
92+
variant_annotation/data/
8193

8294
# exclude test data used for development
8395
to_be_deleted/test_data/data/ref
@@ -92,3 +104,4 @@ logs/
92104
# .java/fonts dir get created when creating fastqc conda env
93105
.java/
94106

107+
/.vscode/settings.json

README.md

Lines changed: 138 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,140 @@
11
# DITTO
22

3-
Diagnosis prediction tool using AI
3+
***!!! For research purposes only !!!***
4+
5+
- [DITTO](#ditto)
6+
- [Data](#data)
7+
- [Usage](#usage)
8+
- [Installation](#installation)
9+
- [Requirements](#requirements)
10+
- [Activate conda environment](#activate-conda-environment)
11+
- [Steps to run DITTO predictions](#steps-to-run-ditto-predictions)
12+
- [Run VEP annotation](#run-vep-annotation)
13+
- [Parse VEP annotations](#parse-vep-annotations)
14+
- [Filter variants for Ditto prediction](#filter-variants-for-ditto-prediction)
15+
- [DITTO prediction](#ditto-prediction)
16+
- [Combine with Exomiser scores](#combine-with-exomiser-scores)
17+
- [Cohort level analysis](#cohort-level-analysis)
18+
- [Contact information](#contact-information)
19+
20+
**Aim:** We aim to develop a pipeline for accurate and rapid prioritization of variants using patient’s genotype (VCF) and/or phenotype (HPO) information.
21+
22+
## Data
23+
24+
Input for this project is a single sample VCF file. This will be annotated using VEP and given to Ditto for predictions.
25+
26+
## Usage
27+
28+
### Installation
29+
30+
Installation simply requires fetching the source code. Following are required:
31+
32+
- Git
33+
34+
To fetch source code, change in to directory of your choice and run:
35+
36+
```sh
37+
git clone -b master \
38+
--recurse-submodules \
39+
git@gitlab.rc.uab.edu:center-for-computational-genomics-and-data-science/sciops/ditto.git
40+
```
41+
42+
### Requirements
43+
44+
*OS:*
45+
46+
Currently works only in Linux OS. Docker versions may need to be explored later to make it useable in Mac (and
47+
potentially Windows).
48+
49+
*Tools:*
50+
51+
- Anaconda3
52+
- Tested with version: 2020.02
53+
54+
### Activate conda environment
55+
56+
Change in to root directory and run the commands below:
57+
58+
```sh
59+
# create conda environment. Needed only the first time.
60+
conda env create --file configs/envs/testing.yaml
61+
62+
# if you need to update existing environment
63+
conda env update --file configs/envs/testing.yaml
64+
65+
# activate conda environment
66+
conda activate testing
67+
```
68+
69+
### Steps to run DITTO predictions
70+
71+
Remove variants with `*` in `ALT Allele` column. These are called "Spanning or overlapping deletions" introduced in the VCF v4.3 specification. More on this [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035531912-Spanning-or-overlapping-deletions-allele-).
72+
Current version of VEP that we're using doesn't support these variants. We will work on this in our future release.
73+
74+
```sh
75+
bcftools annotate -e'ALT="*" || type!="snp"' path/to/indexed_vcf.gz -Oz -o path/to/indexed_vcf_filtered.vcf.gz
76+
```
77+
78+
#### Run VEP annotation
79+
80+
Please look at the steps to run VEP [here](variant_annotation/README.md)
81+
82+
83+
#### Parse VEP annotations
84+
85+
Please look at the steps to parse VEP annotations [here](annotation_parsing/README.md)
86+
87+
88+
#### Filter variants for Ditto prediction
89+
90+
Filtering step includes imputation and one-hot encoding of columns.
91+
92+
```sh
93+
python src/Ditto/filter.py -i path/to/parsed_vcf_file.tsv -O path/to/output_directory
94+
```
95+
96+
Output from this step includes -
97+
98+
```directory
99+
output_directory/
100+
├── data.csv <--- used for Ditto predictions
101+
├── Nulls.csv - indicates number of Nulls in each column
102+
├── stats_nssnv.csv - variant stats from the vcf
103+
├── correlation_plot.pdf- Plot to check if any columns are directly correlated (cutoff >0.95)
104+
└── columns.csv - columns before and after filtering step
105+
106+
```
107+
108+
#### Ditto prediction
109+
110+
```sh
111+
python src/Ditto/predict.py -i path/to/output_directory/data.csv --sample sample_name -o path/to/output_directory/ditto_predictions.csv -o100 .path/to/output_directory/ditto_predictions_100.csv
112+
```
113+
114+
#### Combine with Exomiser scores
115+
116+
If phenotype terms are present for the sample, one could use Exomiser to rank genes and then prioritize Ditto predictions according to the phenotype. Once you have Exomiser scores, please run the following command to combine Exomiser and Ditto scores
117+
118+
```sh
119+
python src/Ditto/combine_scores.py --raw .path/to/parsed_vcf_file.tsv --sample sample_name --ditto path/to/output_directory/ditto_predictions.csv -ep path/to/exomiser_scores/directory -o .path/to/output_directory/predictions_with_exomiser.csv -o100 path/to/output_directory/predictions_with_exomiser_100.csv
120+
```
121+
122+
123+
### Cohort level analysis
124+
125+
Please refer to [CAGI6-RGP](https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/sciops/mana/mini_projects/rgp_cagi6) project for filtering and annotation of variants as done above for single sample VCF along with calculating Exomiser scores.
126+
127+
For predictions, make necessary directory edits to the snakemake [workflow](workflow/Snakefile) and run the following command.
128+
129+
```sh
130+
sbatch src/predict_variant_score.sh
131+
```
132+
133+
**Note**: The commit used for CAGI6 challenge pipeline is [be97cf5d](https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/sciops/ditto/-/merge_requests/3/diffs?commit_id=be97cf5dbfcb099ac82ef28d5d8b0919f28aed99). It was used along with annotated VCFs and exomiser scores obtained from [rgp_cagi6 workflow](https://gitlab.rc.uab.edu/center-for-computational-genomics-and-data-science/sciops/mana/mini_projects/rgp_cagi6).
134+
135+
136+
## Contact information
137+
138+
For issues, please send an email with clear description to
139+
140+
Tarun Mamidi - tmamidi@uab.edu

annotation_parsing/parse_annotated_vars.py

Lines changed: 23 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -17,10 +17,10 @@ def parse_n_print(vcf, outfile):
1717
output_header = ["Chromosome", "Position", "Reference Allele", "Alternate Allele"] + \
1818
line.replace(" Allele|"," VEP_Allele_Identifier|").split("Format: ")[1].rstrip(">").rstrip('"').split("|")
1919
elif line.startswith("#CHROM"):
20-
vcf_header = line.split("\t")
20+
vcf_header = line.split("\t")
2121
else:
2222
break
23-
23+
2424
for idx, sample in enumerate(vcf_header):
2525
if idx > 8:
2626
output_header.append(f"{sample} allele depth")
@@ -36,7 +36,8 @@ def parse_n_print(vcf, outfile):
3636
line = line.rstrip("\n")
3737
cols = line.split("\t")
3838
csq = parse_csq(next(filter(lambda info: info.startswith("CSQ="),cols[7].split(";"))).replace("CSQ=",""))
39-
var_info = parse_var_info(vcf_header, cols)
39+
#print(line, file=open("var_info.txt", "w"))
40+
#var_info = parse_var_info(vcf_header, cols)
4041
alt_alleles = cols[4].split(",")
4142
alt2csq = format_alts_for_csq_lookup(cols[3], alt_alleles)
4243
for alt_allele in alt_alleles:
@@ -45,14 +46,14 @@ def parse_n_print(vcf, outfile):
4546
possible_alt_allele4lookup = alt_allele
4647
try:
4748
write_parsed_variant(
48-
out,
49-
vcf_header,
50-
cols[0],
51-
cols[1],
52-
cols[3],
53-
alt_allele,
54-
csq[possible_alt_allele4lookup],
55-
var_info[alt_allele]
49+
out,
50+
vcf_header,
51+
cols[0],
52+
cols[1],
53+
cols[3],
54+
alt_allele,
55+
csq[possible_alt_allele4lookup]
56+
#,var_info[alt_allele]
5657
)
5758
except KeyError:
5859
print("Variant annotation matching based on allele failed!")
@@ -62,15 +63,15 @@ def parse_n_print(vcf, outfile):
6263
raise SystemExit(1)
6364

6465

65-
def write_parsed_variant(out_fp, vcf_header, chr, pos, ref, alt, annots, var_info):
66+
def write_parsed_variant(out_fp, vcf_header, chr, pos, ref, alt, annots):#, var_info):
6667
var_list = [chr, pos, ref, alt]
6768
for annot_info in annots:
6869
full_fmt_list = var_list + annot_info
69-
for idx, sample in enumerate(vcf_header):
70-
if idx > 8:
71-
full_fmt_list.append(str(var_info[sample]["alt_depth"]))
72-
full_fmt_list.append(str(var_info[sample]["total_depth"]))
73-
full_fmt_list.append(str(var_info[sample]["prct_reads"]))
70+
#for idx, sample in enumerate(vcf_header):
71+
# if idx > 8:
72+
# full_fmt_list.append(str(var_info[sample]["alt_depth"]))
73+
# full_fmt_list.append(str(var_info[sample]["total_depth"]))
74+
# full_fmt_list.append(str(var_info[sample]["prct_reads"]))
7475

7576
out_fp.write("\t".join(full_fmt_list) + "\n")
7677

@@ -103,9 +104,9 @@ def parse_csq(csq):
103104
parsed_annot = annot.split("|")
104105
if parsed_annot[0] not in csq_allele_dict:
105106
csq_allele_dict[parsed_annot[0]] = list()
106-
107+
107108
csq_allele_dict[parsed_annot[0]].append(parsed_annot)
108-
109+
109110
return csq_allele_dict
110111

111112

@@ -129,13 +130,13 @@ def parse_var_info(headers, cols):
129130
alt_depth = int(ad_info[alt_index + 1])
130131
total_depth = sum([int(dp) for dp in ad_info])
131132
prct_reads = (alt_depth / total_depth) * 100
132-
133+
133134
allele_dict[sample] = {
134135
"alt_depth": alt_depth,
135136
"total_depth": total_depth,
136137
"prct_reads": prct_reads
137138
}
138-
139+
139140
parsed_alleles[alt_allele] = allele_dict
140141

141142
return parsed_alleles
@@ -184,5 +185,5 @@ def is_valid_file(p, arg):
184185

185186
inputf = Path(ARGS.input_vcf)
186187
outputf = Path(ARGS.output) if ARGS.output else inputf.parent / inputf.stem.rstrip(".vcf") + ".tsv"
187-
188+
188189
parse_n_print(inputf, outputf)

configs/cluster_config.json

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
{
2+
"__default__": {
3+
"ntasks": 1,
4+
"partition": "express",
5+
"cpus-per-task": "{threads}",
6+
"mem-per-cpu": "4G",
7+
"output": "logs/rule_logs/{rule}-%j.log"
8+
},
9+
"ditto_filter": {
10+
"partition": "largemem",
11+
"mem-per-cpu": "200G"
12+
},
13+
"combine_scores": {
14+
"mem-per-cpu": "50G"
15+
}
16+
}

0 commit comments

Comments
 (0)