uab-cgds-worthey · JmScherer · Dec 12, 2025 · Dec 12, 2025 · Dec 12, 2025 · Dec 12, 2025
diff --git a/.gitignore b/.gitignore
@@ -81,6 +81,7 @@ work
 .nextflow/
 .nextflow*
 report*
+work_dir/
 
 #logs
 logs/
@@ -90,3 +91,6 @@ logs/
 
 # .java/fonts dir get created when creating fastqc conda env
 .java/
+
+# DITTO output
+data/output/
diff --git a/.test_data/file_list.txt b/.test_data/file_list.txt
@@ -1,2 +1,2 @@
 .test_data/oc_test_data.vcf.gz
-.test_data/testing_variants_hg38.vcf.gz
+.test_data/testing_variants_hg38.vcf.gz
diff --git a/README.md b/README.md
@@ -14,6 +14,50 @@ Markdown](https://github.com/uab-cgds-worthey/DITTO/actions/workflows/linting.ym
 DITTO is an explainable neural network that can be helpful for accurate and rapid interpretation of small
 genetic variants for pathogenicity using patient’s genotype (VCF) information.
 
+## Getting Started
+
+- [Prerequisites](#prerequisites)
+- [Using DITTO](#using-ditto)
+  - [Webapp](#webapp)
+  - [API](#api)
+  - [Prediction](#prediction)
+    - [Local Prediction](#local-prediction)
+    - [HPC Prediction with Cheaha](#hpc-prediction-with-cheaha)
+- [Reproducing the DITTO model](#reproducing-the-ditto-model)
+- [Download DITTO DB (Precomputed scores)](#download-ditto-db-precomputed-scores)
+- [How to cite?](#how-to-cite)
+- [Contact](#contact-information)
+
+## Prerequisites
+
+The following prerequisites are required to be installed in the target envrionment for deploying and running DITTO
+prediction model.
+
+### Tools
+
+- [Python 3.10](https://www.python.org/) - [Install](https://www.python.org/downloads/)
+  - The specified OpenCravat version requires Python 3.10
+- [Anaconda3 25.7+](https://anaconda.com/) - [install](https://www.anaconda.com/docs/getting-started/anaconda/install)
+- [OpenCravat 2.4.1](https://www.opencravat.org/) - [install](https://github.com/KarchinLab/open-cravat/releases/tag/2.4.1)
+- [Git](https://git-scm.com/)
+  - Setup with your favorite git client. Here is a [GitHub Guide](https://github.com/git-guides/install-git)
+  for different platforms.
+- [Nextflow 22.10.7+](https://www.nextflow.io/) - [install](https://www.nextflow.io/docs/latest/install.html)
+
+> ***NOTE:*** Current version of OpenCravat that we're using doesn't support "Spanning or overlapping deletions"
+> variants i.e. variants with `*` in `ALT Allele` column. More on these variants
+<!-- markdown-link-check-disable -->
+> [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035531912-Spanning-or-overlapping-deletions-allele).
+<!-- markdown-link-check-enable -->
+> These will be ignored when running the pipeline.
+
+### System Requirements
+
+- CPU: >2
+- RAM: ~25GB for a WGS VCF sample
+- Storage: 1TB
+  - The storage requirements are for hosting the OpenCravat annotators ~600GB of data required to store all annotators
+
 ## Using DITTO
 
 DITTO scores for variants can be obtained by the below 3 ways. Webapp and API are for single variant analysis and the
@@ -30,89 +74,129 @@ DITTO is available for public use at this [website](https://cgds-ditto.streamlit
 DITTO is not hosted as a public API but one can serve up locally to query DITTO scores. Please follow the instructions
 in this [GitHub repo](https://github.com/uab-cgds-worthey/DITTO-API).
 
-### Setting up to use locally
+### Prediction
+
+#### Installation
+
+To fetch DITTO source code, change in to directory of your choice and run:
+
+```sh
+git clone https://github.com/uab-cgds-worthey/DITTO.git
+cd DITTO
+```
+
+### Local Prediction
 
 > ***NOTE:*** This setup will allow one to annotate a VCF sample and make DITTO predictions. Currently tested only in
 > Cheaha (UAB HPC) because of resource limitations to download datasets from OpenCRAVAT.
 > Docker versions may need to be explored later to make it useable in Mac and Windows.
 
-#### System Requirements
+#### NextFlow Conda Vs. Mamba Setup
 
-*Tools:*
+***NOTE:*** If the user has conda running with Mamba instead of Conda, NextFlow can be configured to use Mamba instead
+by modifying the `configs/nextflow/local.config` file and updating the **useMamba** parameter to reflect the user's
+environment:
 
-- Anaconda3
-- OpenCravat-2.4.1
-- Git
+```yaml
+# This parameter is defaulted to false, change to true if using Mamba
 
-*Resources:*
+useMamba = true
+```
 
-- CPU: > 2
-- Storage: ~1TB
-- RAM: ~25GB for a WGS VCF sample
+#### Setup Steps
 
-#### Installation
+- ***Setup OpenCravat (only one-time installation)***
 
-Requirements:
+  Please follow the steps mentioned in [install_openCravat.md](docs/install_openCravat.md).
 
-- DITTO repo from GitHub
-- OpenCravat with databases to annotate
-- Nextflow >=22.10.7
+- ***Setup Nextflow***
 
-To fetch DITTO source code, change in to directory of your choice and run:
+  Create an environment via conda. Below is an example to install `nextflow`.
 
-```sh
-git clone https://github.com/uab-cgds-worthey/DITTO.git
-```
+  ```sh
+  # create environment. Needed only the first time. Please use the above link if you're not using Mac.
+  conda env create -f ./configs/conda/ditto-env.yaml
 
-#### Run DITTO pipeline on UAB Cheaha
+  conda activate ditto-env
+  ```
 
-To run on UAB cheaha, please update the `model.job` (outdir and samplesheet) and `.test_data/file_list.txt` (inout vcfs)
- files with complete file paths and submit a slurm job using the command below
+- ***Sample Sheet***
 
-```sh
-sbatch model.job
-```
+  Please make a samplesheet `.test_data/file_list.txt` with VCF files (incl. path). One can supply either relative paths
+  or absolute paths to files for the vcf.gz files. Relative paths need to be relative to the work directory that DITTO
+  was executed from.
 
-#### Run DITTO pipeline outside of UAB Cheaha
+  Example `file_list.txt` with relative paths:
 
-***Setup OpenCravat (only one-time installation)***
+  ```bash
+  .test_data/oc_test_data.vcf.gz
+  .test_data/testing_variants_hg38.vcf.gz
 
-Please follow the steps mentioned in [install_openCravat.md](docs/install_openCravat.md).
+  # Example, will become: /Users/<username>/Workspace/DITTO/.test_data/oc_test_data.vcf.gz
+  ```
 
-> ***NOTE:*** Current version of OpenCravat that we're using doesn't support "Spanning or overlapping deletions"
-> variants i.e. variants with `*` in `ALT Allele` column. More on these variants
-<!-- markdown-link-check-disable -->
-> [here](https://gatk.broadinstitute.org/hc/en-us/articles/360035531912-Spanning-or-overlapping-deletions-allele).
-<!-- markdown-link-check-enable -->
-> These will be ignored when running the pipeline.
+  Or absolute paths
 
-***Setup Nextflow***
+  ```bash
+  /Users/<username>/Desktop/test_data/oc_test_data.vcf.gz
+  /Users/<username>/Desktop/test_data/testing_variants_hg38.vcf.gz
 
-Create an environment via conda. Below is an example to install `nextflow`.
+  # Example is using MacOS Desktop folder with test_data directory
+  ```
 
-- [Anaconda virtual environment](https://docs.anaconda.com/free/anaconda/install/index.html)
+  This will run DITTO prediction for both vcf files in the `file_list.txt`.
 
-```sh
-# create environment. Needed only the first time. Please use the above link if you're not using Mac.
-conda create --name ditto-env
+- ***Run the NextFlow pipeline***
 
-conda activate ditto-env
+  Please make sure to edit the directory paths as needed and run the pipeline as shown below.
 
-# Install nextflow
-conda install bioconda::nextflow
-```
+  ```sh
+  # Note: NextFlow work directory is defined as `-work-dir` in the run command parameters
+  # Note: `--output` cannot be relative, set a path nextflow can access. ex. `/tmp/DITTO/output`
 
-Please make a samplesheet `.test_data/file_list.txt` with VCF files (incl. path).
-Please make sure to edit the directory paths as needed and run
-the pipeline as shown below.
+  nextflow run pipeline.nf \
+    -work-dir ./work_dir \
+    --build hg38 -c ./configs/nextflow/local.config -with-report \
+    --sample_sheet .test_data/file_list.txt \
+    --oc_modules /<path-to>/opencravat/modules \
+    --outdir $PWD/data/output
+  ```
+
+### HPC Prediction with Cheaha
+
+To run on UAB cheaha, see the [installation](#installation) step to clone the DITTO repository into a Cheaha directory.
+
+- Create a text file listing the path to VCF file(s) (1 path per line) with variants to score
+  - Paths can be full absolute paths **or** relative paths (relative to the directory where the pipeline will be run
+    from, **note** the directory where the `pipeline.nf` file is)
+- See the example input file [.test_data/file_list.txt](.test_data/file_list.txt) (lists 2 testing example input VCFs)
+  for reference or as an input file for testing (default behavior of `model.job`)
+  - One can supply either relative paths or absolute paths to files for the vcf.gz files. Relative paths need to be
+    relative to the work directory that DITTO was executed from.
+
+  Example `file_list.txt` with relative paths:
+
+  ```bash
+  .test_data/oc_test_data.vcf.gz
+  .test_data/testing_variants_hg38.vcf.gz
+
+  # Example, will become: /home/<username>/Workspace/DITTO/.test_data/oc_test_data.vcf.gz
+  ```
+
+  Or absolute paths
+
+  ```bash
+  /home/<username>/test_data/oc_test_data.vcf.gz
+  /home/<username>/test_data/testing_variants_hg38.vcf.gz
+
+  # Example is using Linux home directory with a test_data directory
+  ```
+
+- Update `model.job` (change the `--sample_sheet` option to your input file with VCF path(s) and
+  `--outdir` to the desired output location of DITTO predictions)
 
 ```sh
-nextflow run pipeline.nf \
-  --outdir ./data/ \
-  -work-dir ./wor_dir \
-  --build hg38 -with-report \
-  --oc_modules /data/opencravat/modules \
-  --sample_sheet .test_data/file_list
+sbatch model.job
 ```
 
 ## Reproducing the DITTO model
@@ -125,17 +209,19 @@ Precomputed scores for all possible SNVs and known Indels from gnomADv3.0 in mai
 are available to download here - <https://s3.lts.rc.uab.edu/cgds-public/dittodb/dittodb.html>
 
 ## How to cite?
+
 <!-- markdown-link-check-disable -->
 Mamidi, T.K.K.; Wilk, B.M.; Gajapathy, M.; Worthey, E.A. DITTO: An Explainable Machine-Learning Model for
 Transcript-Specific Variant Pathogenicity Prediction. Preprints 2024, 2024040837. <https://doi.org/10.20944/preprints202404.0837.v1>
 <!-- markdown-link-check-enable -->
+
 ## Contact information
 
 For queries, please open a GitHub issue.
 
 For urgent queries, send an email with clear description to
 
-|Name | Email |
-|------|--------|
-|Tarun Mamidi | <[email protected]>|
-|Liz Worthey | <[email protected]>|
+|    Name      |        Email       |
+|--------------|--------------------|
+| Tarun Mamidi | <[email protected]>  |
+| Liz Worthey  | <[email protected]> |
diff --git a/configs/conda/ditto-env.yaml b/configs/conda/ditto-env.yaml
@@ -0,0 +1,9 @@
+name: ditto-env
+
+channels:
+  - bioconda
+  - conda-forge
+
+dependencies:
+  - nextflow=22.10
+  - conda=23.1
diff --git a/configs/envs/ditto-nf.yaml → configs/conda/ditto-nf.yaml b/configs/envs/ditto-nf.yaml → configs/conda/ditto-nf.yaml
@@ -1,10 +1,14 @@
+name: ditto-nf
+
 channels:
   - conda-forge
+
 dependencies:
   - python=3.10.11
   - pandas=2.0.1
   - numpy=1.23.5
   - pyaml=23.7.0
   - pip=23.2.1
   - pip:
+    - --only-binary h5py
     - tensorflow==2.11
diff --git a/configs/envs/environment.yaml → configs/conda/environment.yaml b/configs/envs/environment.yaml → configs/conda/environment.yaml
diff --git a/configs/envs/open-cravat.yaml → configs/conda/open-cravat.yaml b/configs/envs/open-cravat.yaml → configs/conda/open-cravat.yaml
@@ -1,7 +1,11 @@
+name: opencravat-env
+
 dependencies:
   - pip
+  - python=3.10
   - pip:
     - pytabix==0.1
     - open-cravat==2.4.1
+
     #- joblib==1.3.2
     #- git+https://github.com/tkmamidi/open-cravat.git
diff --git a/cheaha.config → configs/nextflow/cheaha.config b/cheaha.config → configs/nextflow/cheaha.config
diff --git a/configs/nextflow/local.config b/configs/nextflow/local.config
@@ -0,0 +1,8 @@
+conda {
+    enabled = true
+    useMamba = false
+}
+
+env {
+    TMPDIR="/tmp/DITTO/"
+}
diff --git a/configs/mypackage/mypackage.py → configs/opencravat/mypackage/mypackage.py b/configs/mypackage/mypackage.py → configs/opencravat/mypackage/mypackage.py
diff --git a/configs/mypackage/mypackage.yml → configs/opencravat/mypackage/mypackage.yml b/configs/mypackage/mypackage.yml → configs/opencravat/mypackage/mypackage.yml
diff --git a/configs/opencravat_test_config.json → ...gs/opencravat/opencravat_test_config.json b/configs/opencravat_test_config.json → ...gs/opencravat/opencravat_test_config.json
diff --git a/configs/opencravat_train_config.json → ...s/opencravat/opencravat_train_config.json b/configs/opencravat_train_config.json → ...s/opencravat/opencravat_train_config.json
diff --git a/docs/build_DITTO.md b/docs/build_DITTO.md
@@ -49,10 +49,10 @@ Create environment and install dependencies
 
 ```sh
 # create conda environment. Needed only the first time.
-conda env create --file configs/envs/environment.yaml
+conda env create --file configs/conda/environment.yaml
 
 # if you need to update existing environment
-conda env update --file configs/envs/environment.yaml
+conda env update --file configs/conda/environment.yaml
 
 # activate conda environment
 conda activate training
@@ -74,7 +74,7 @@ oc run clinvar.vcf.gz -l hg38 -t csv --package mypackage -d path/to/output/direc
 
 > ***NOTE:*** By default OpenCravat uses all available CPUs. Please specify the number of CPU cores using this parameter
 > in the above command `--mp 2`. Minimum number of CPUs to use is 2. Also, please make sure to setup `mypackage` from
-> `configs` directory to your modules directory. To learn more about it, please review [OpenCravat's package](https://open-cravat.readthedocs.io/en/latest/Package.html).
+> `configs/opencravat` directory to your modules directory. To learn more about it, please review [OpenCravat's package](https://open-cravat.readthedocs.io/en/latest/Package.html).
 
 ## Preprocessing
 
@@ -84,7 +84,7 @@ the below command
 
 ```sh
 python src/annotation_parsing/parse_single_sample.py -i clinvar.vcf.gz.variant.csv -e parse \
-    -o clinvar.vcf.gz.variant.csv_parsed.csv.gz -c configs/opencravat_train_config.json
+    -o clinvar.vcf.gz.variant.csv_parsed.csv.gz -c configs/opencravat/opencravat_train_config.json
 ```
 
 Filter and process the annotations as shown in this [python
@@ -110,9 +110,9 @@ Follow the below steps to install and add more databases for annotation and befo
 
 1. Install the annotator/database using OpenCravat.
 
-2. Add the annotator to `mypackage/mypackage.yml` and reannotate the clinvar VCF file.
+2. Add the annotator to `configs/opencravat/mypackage/mypackage.yml` and reannotate the clinvar VCF file.
 
-3. Add the annotator to the [train config](../configs/opencravat_train_config.json) and specify how to parse the
+3. Add the annotator to the [train config](../configs/opencravat/opencravat_train_config.json) and specify how to parse the
    annotation.
 
 4. Follow the steps from Preprocessing above to parse, filter, process, tune and train DITTO.

diff --git a/docs/install_openCravat.md b/docs/install_openCravat.md
@@ -62,7 +62,7 @@ oc module install vcfreporter csvreporter tsvreporter -y
 Package is a module which defines module installation and job parameters. To learn more about OpenCravat's package,
 please click [here](https://open-cravat.readthedocs.io/en/latest/Package.html).
 
-Here's the package for DITTO - `configs/mypackage/mypackage.yml`
+Here's the package for DITTO - `configs/opencravat/mypackage/mypackage.yml`
 
 Copy the package directory to the modules directory.
 
@@ -71,5 +71,5 @@ Copy the package directory to the modules directory.
 oc config md
 
 # copy the package to the modules directory
-cp -r configs/mypackage path/to/modules/directory/
+cp -r configs/opencravat/mypackage path/to/modules/directory/
 ```
diff --git a/shap_plots/NMD_SHAP.pdf → docs/shap_plots/NMD_SHAP.pdf b/shap_plots/NMD_SHAP.pdf → docs/shap_plots/NMD_SHAP.pdf
diff --git a/shap_plots/Neural_network_features.pdf → docs/shap_plots/Neural_network_features.pdf b/shap_plots/Neural_network_features.pdf → docs/shap_plots/Neural_network_features.pdf
diff --git a/shap_plots/intergenic_SHAP.pdf → docs/shap_plots/intergenic_SHAP.pdf b/shap_plots/intergenic_SHAP.pdf → docs/shap_plots/intergenic_SHAP.pdf
diff --git a/shap_plots/intron_SHAP.pdf → docs/shap_plots/intron_SHAP.pdf b/shap_plots/intron_SHAP.pdf → docs/shap_plots/intron_SHAP.pdf
diff --git a/shap_plots/missense_SHAP.pdf → docs/shap_plots/missense_SHAP.pdf b/shap_plots/missense_SHAP.pdf → docs/shap_plots/missense_SHAP.pdf
diff --git a/shap_plots/splice site_SHAP.pdf → docs/shap_plots/splice site_SHAP.pdf b/shap_plots/splice site_SHAP.pdf → docs/shap_plots/splice site_SHAP.pdf
diff --git a/model.job b/model.job
@@ -24,10 +24,11 @@ module load Anaconda3
 #conda activate nextflow
 
 #Modify paths to include full paths and run the pipeline here
+#Note: -work-dir is correct with one dash
 /data/project/worthey_lab/tools/nextflow/nextflow-22.10.7/nextflow run pipeline.nf \
-  --outdir /data \
   -work-dir $USER_SCRATCH \
-  --build hg38 -c cheaha.config -with-report \
-  --sample_sheet .test_data/file_list.txt -resume
+  --build hg38 -c ./configs/nextflow/cheaha.config -with-report \
+  --sample_sheet .test_data/file_list.txt -resume \
+  --outdir $PWD/data/output
 
 #https://training.nextflow.io/basic_training/cache_and_resume/#how-to-organize-in-silico-experiments