updates to run DITTO pipeline and instructions in readme

tkmamidi · tkmamidi · commit dd192717b679 · 2024-12-09T22:24:49.000-06:00
diff --git a/.test_data/README b/.test_data/README
@@ -1,7 +1,10 @@
+# Test Data Directory
+
 This directory has 3 files -
 
 `oc_test_data.vcf.gz` - test multi-sample VCF data from OpenCRAVAT
 
 `testing_variants_hg38.vcf.gz` - We custom made a test VCF file with few variants from every chromosome (1-22,X,Y)
 
-`file_list.txt` - contains list of above 2 test vcf files with relative path. This file is used to test nextflow pipeline
+`file_list.txt` - contains list of above 2 test vcf files with relative path. Please add the full paths to this file and
+test nextflow pipeline
diff --git a/README.md b/README.md
@@ -64,7 +64,18 @@ To fetch DITTO source code, change in to directory of your choice and run:
 git clone https://github.com/uab-cgds-worthey/DITTO.git
 ```
 
-#### Setup OpenCravat (only one-time installation)
+#### Run DITTO pipeline on UAB Cheaha
+
+To run on UAB cheaha, please update the `model.job` and `.test_data/file_list.txt` files with complete file paths for all
+necessary files and tools and submit a slurm job using the command below
+
+```sh
+sbatch model.job
+```
+
+#### Run DITTO pipeline outside of UAB Cheaha
+
+***Setup OpenCravat (only one-time installation)***
 
 Please follow the steps mentioned in [install_openCravat.md](docs/install_openCravat.md).
 
@@ -75,7 +86,7 @@ Please follow the steps mentioned in [install_openCravat.md](docs/install_openCr
 <!-- markdown-link-check-enable -->
 > These will be ignored when running the pipeline.
 
-#### Run DITTO pipeline
+***Setup Nextflow***
 
 Create an environment via conda. Below is an example to install `nextflow`.
 
@@ -91,7 +102,8 @@ conda activate ditto-env
 conda install bioconda::nextflow
 ```
 
-Please make a samplesheet with VCF files (incl. path). Please make sure to edit the directory paths as needed and run
+Please make a samplesheet `.test_data/file_list.txt` with VCF files (incl. path).
+Please make sure to edit the directory paths as needed and run
 the pipeline as shown below.
 
 ```sh
@@ -103,24 +115,18 @@ nextflow run pipeline.nf \
   --sample_sheet .test_data/file_list
 ```
 
-To run on UAB cheaha, please update the `model.job` file and submit a slurm job using the command below
-
-```sh
-sbatch model.job
-```
-
 ## Reproducing the DITTO model
 
 Detailed instructions on reproducing the model is explained in [build_DITTO.md](docs/build_DITTO.md)
 
 ## Download DITTO DB (Precomputed scores)
 
-Precomputed scores for all possible SNVs and known Indels from gnomADv3.0 in main chromosomes in hg38 reference genome 
+Precomputed scores for all possible SNVs and known Indels from gnomADv3.0 in main chromosomes in hg38 reference genome
 are available to download here - <https://s3.lts.rc.uab.edu/cgds-public/dittodb/dittodb.html>
 
 ## How to cite?
 <!-- markdown-link-check-disable -->
-Mamidi, T.K.K.; Wilk, B.M.; Gajapathy, M.; Worthey, E.A. DITTO: An Explainable Machine-Learning Model for 
+Mamidi, T.K.K.; Wilk, B.M.; Gajapathy, M.; Worthey, E.A. DITTO: An Explainable Machine-Learning Model for
 Transcript-Specific Variant Pathogenicity Prediction. Preprints 2024, 2024040837. <https://doi.org/10.20944/preprints202404.0837.v1>
 <!-- markdown-link-check-enable -->
 ## Contact information
diff --git a/cheaha.config b/cheaha.config
@@ -1,6 +1,6 @@
 conda {
     enabled = true
-    cacheDir = '/nextflow/nextflow-conda-env-cache/'
+    cacheDir = '/data/project/worthey_lab/tools/nextflow/nextflow-conda-env-cache/'
 }
 
 // Define the Scratch directory
diff --git a/model.job b/model.job
@@ -5,7 +5,7 @@
 #
 # Number of tasks needed for this job. Generally, used with MPI jobs
 #SBATCH --ntasks=1
-#SBATCH --partition=amd-hdr100-res
+#SBATCH --partition=amd-hdr100
 #SBATCH --time=06:00:00
 #
 # Number of CPUs allocated to each task.
@@ -23,10 +23,10 @@ module load Java/13.0.2
 module load Anaconda3
 #conda activate nextflow
 
-#Modify paths and run the pipeline here
-/data/project/worthey_lab/tools/nextflow/nextflow-22.10.7/nextflow run ../pipeline.nf \
-  --outdir /data/results \
-  -work-dir .work_dir/ \
+#Modify paths to include full paths and run the pipeline here
+/data/project/worthey_lab/tools/nextflow/nextflow-22.10.7/nextflow run pipeline.nf \
+  --outdir /data \
+  -work-dir $USER_SCRATCH \
   --build hg38 -c cheaha.config -with-report \
   --sample_sheet .test_data/file_list.txt -resume
 
diff --git a/pipeline.nf b/pipeline.nf
@@ -30,7 +30,7 @@ log.info """\
 process runOC {
 
   // Define the conda environment file to be used
-  conda '../configs/envs/open-cravat.yaml'
+  conda './configs/envs/open-cravat.yaml'
 
   input:
   path var_ch
@@ -72,7 +72,7 @@ process parseAnnotation {
 process prediction {
 
   // Define the conda environment file to be used
-  conda '../configs/envs/ditto-nf.yaml'
+  conda './configs/envs/ditto-nf.yaml'
 
   input:
   path var_parse_ch

Original file line number	Diff line number	Diff line change
`@@ -1,6 +1,6 @@`
`1`	`1`	`conda {`
`2`	`2`	`enabled = true`
`3`		`- cacheDir = '/nextflow/nextflow-conda-env-cache/'`
	`3`	`+ cacheDir = '/data/project/worthey_lab/tools/nextflow/nextflow-conda-env-cache/'`
`4`	`4`	`}`
`5`	`5`
`6`	`6`	`// Define the Scratch directory`