Update input format and sampleid parsing (#89)

* Update submodule pipeline-Nextflow-module * add submodule pipeline-Nextflow-config * add YAML input * remove CSV input * add BAM parsing, parse sample id from BAM; add retry * add schema validation * remove extra line * comment schema validation and retry for next PR * patient id to sample id in input YAML * update template config * udpate input validation * update pipeval version * update channels for processes * update bam parsing function * update publish dir * update store dir for input validation * Update Testing Section of PR template * remove commented lines * Replace CSV with YAML input * Update CHANGELOG.md * optimize sample classification * Update CHANGELOG.md * remove sample_id as it is not required * update error message to be generic across input BAMs * Update README * define sample with def * Update error message * Add note in README that Tumor BAM can also be run --------- Co-authored-by: Mootor <mmootor@ip-0A125250.rhxrlfvjyzbupc03cc22jkch3c.xx.internal.cloudapp.net> Co-authored-by: Mootor <mmootor@ip-0A125213.rhxrlfvjyzbupc03cc22jkch3c.xx.internal.cloudapp.net> Co-authored-by: Mootor <mmootor@ip-0A125217.rhxrlfvjyzbupc03cc22jkch3c.xx.internal.cloudapp.net>
uclahs-cds · Sep 22, 2023 · 82bffab · 82bffab
1 parent 74a2ae2
commit 82bffab
Show file tree

Hide file tree

Showing 18 changed files with 143 additions and 103 deletions.
diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md
@@ -7,23 +7,23 @@
 ## Testing Results
 
 - Manta
-    - sample:    <!-- e.g. A-mini TWGSAMIN000001-T001-S01-F, TWGSAMIN000001-T002-S02-F -->
-    - input csv: <!-- path/to/input/call-gSV-inputs.csv -->
+    - sample:    <!-- e.g. A-mini TWGSAMIN000001-N001-S01-F -->
+    - input YAML: <!-- path/to/input/call-gSV-inputs.yaml -->
     - config:    <!-- path/to/cpnfig/nextflow-test-amini.config -->
     - output:    <!-- path/to/output/Manta-1.6.0/ -->
 - Delly - gSV
-    - sample:    <!-- e.g. A-mini TWGSAMIN000001-T001-S01-F, TWGSAMIN000001-T002-S02-F -->
-    - input csv: <!-- path/to/input/call-gSV-inputs.csv -->
+    - sample:    <!-- e.g. A-mini TWGSAMIN000001-N001-S01-F -->
+    - input YAML: <!-- path/to/input/call-gSV-inputs.yaml -->
     - config:    <!-- path/to/config/nextflow-test-amini.config -->
     - output:    <!-- path/to/output/Delly-0.8.7/ -->
 - Delly - gCNV
-    - sample:    <!-- e.g. A-mini TWGSAMIN000001-T001-S01-F, TWGSAMIN000001-T002-S02-F -->
-    - input csv: <!-- path/to/input/call-gSV-inputs.csv -->
+    - sample:    <!-- e.g. A-mini TWGSAMIN000001-N001-S01-F -->
+    - input YAML: <!-- path/to/input/call-gSV-inputs.yaml -->
     - config:    <!-- path/to/config/nextflow-test-amini.config -->
     - output:    <!-- path/to/output/Delly-0.8.7/ -->
 - Delly - gSV & gCNV
-    - sample:    <!-- e.g. A-mini TWGSAMIN000001-T001-S01-F, TWGSAMIN000001-T002-S02-F -->
-    - input csv: <!-- path/to/input/call-gSV-inputs.csv -->
+    - sample:    <!-- e.g. A-mini TWGSAMIN000001-N001-S01-F -->
+    - input YAML: <!-- path/to/input/call-gSV-inputs.yaml -->
     - config:    <!-- path/to/config/nextflow-test-amini.config -->
     - output:    <!-- path/to/output/Delly-0.8.7/ -->
 

diff --git a/.gitmodules b/.gitmodules
@@ -1,3 +1,6 @@
 [submodule "external/pipeline-Nextflow-module"]
 	path = external/pipeline-Nextflow-module
 	url = [email protected]:uclahs-cds/pipeline-Nextflow-module.git
+[submodule "external/pipeline-Nextflow-config"]
+	path = external/pipeline-Nextflow-config
+	url = [email protected]:uclahs-cds/pipeline-Nextflow-config.git
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,15 @@ All notable changes to the call-gSV pipeline.
 ---
 
 ## [Unreleased]
+### Changed
+- Update README to reflect YAML support
+- Parse sample ID from input BAM for output directory naming
+
+### Added
+- Add YAML input
+
+### Removed
+- Remove CSV input
 
 ---
 
@@ -26,7 +35,7 @@ All notable changes to the call-gSV pipeline.
 
 ### Changed
 - Update README.md for `4.0.0`
-- Move `save_intermediate_files` from `default.config` to `template.config` and set it to `false` 
+- Move `save_intermediate_files` from `default.config` to `template.config` and set it to `false`
 - Update BCFtools 1.12 to 1.15.1
 - Update Delly 1.0.3 to 1.1.3
 - Update Delly 0.9.1 to 1.0.3
@@ -46,7 +55,7 @@ All notable changes to the call-gSV pipeline.
 - Fix Issue #33: should pass the mappability_map file instead of the exclusion file to regenotype_gCNV_Delly
 
 ### Changed
-- Change the input file schema by removing variant_type,reference_fasta,reference_fasta_index, put them into template.config. 
+- Change the input file schema by removing variant_type,reference_fasta,reference_fasta_index, put them into template.config.
 - Change partition types from lowmem/midmem/execute to F2/F32/F72/M64.
 - Standardize the output structure.
 - Standardize the configuration structure.

diff --git a/README.md b/README.md
@@ -53,9 +53,9 @@ Pipelines should be run **WITH A SINGLE SAMPLE AT TIME**. Otherwise resource all
 
     * Do not directly modify the source `template.config`, but rather you should copy it from the pipeline release folder to your project-specific folder and modify it there
 
-3. Create the input CSV using the [template](input/call-gSV-input.csv).See [Input CSV](#Input-CSV) for detailed description of each column. All columns must exist and should be comma separated in order to run the pipeline successfully.
-   
-   * Again, do not directly modify the source template CSV file.  Instead, copy it from the pipeline release folder to your project-specific folder and modify it there.
+3. Create the input YAML using the [template](input/call-gSV-input.yaml). See [Input YAML](#Input-YAML) for a detailed description.
+
+   * Again, do not directly modify the source template YAML file.  Instead, copy it from the pipeline release folder to your project-specific folder and modify it there.
 
 4. The pipeline can be executed locally using the command below:
 
@@ -64,14 +64,16 @@ nextflow run path/to/main.nf -config path/to/sample-specific.config
 ```
 
 * For example, `path/to/main.nf` could be: `/hot/software/pipeline/pipeline-call-gSV/Nextflow/release/4.0.0/main.nf`
-* `path/to/sample-specific.config` is the path to where you saved your project-specific copy of [template.config](config/template.config) 
+* `path/to/sample-specific.config` is the path to where you saved your project-specific copy of [template.config](config/template.config)
+* `path/to/input.yaml` is the path to where you saved your sample-specific copy of [call-gSV-input.yaml](input/call-gSV-input.yaml)
 
 To submit to UCLAHS-CDS's Azure cloud, use the submission script [here](https://github.com/uclahs-cds/tool-submit-nf) with the command below:
 
 ```bash
 python path/to/submit_nextflow_pipeline.py \
     --nextflow_script path/to/main.nf \
     --nextflow_config path/to/sample-specific.config \
+    --nextflow_yaml path/to/input.yaml \
     --pipeline_run_name <sample_name> \
     --partition_type F16 \
     --email <your UCLA email, [email protected]>
@@ -117,7 +119,7 @@ Currently the following filters are applied by Delly when calling SVs. Parameter
 
 ### 2. Calling Copy Number Variants
 
-The second step of the pipeline identifies any found CNVs. To do this, Delly requires an aligned and sorted BAM file and BAM index as an input, as well as the BCF output from the initial SV calling (to refine breakpoints) and a mappability map. Any CNVs identified are annotated and output as a single BCF file. 
+The second step of the pipeline identifies any found CNVs. To do this, Delly requires an aligned and sorted BAM file and BAM index as an input, as well as the BCF output from the initial SV calling (to refine breakpoints) and a mappability map. Any CNVs identified are annotated and output as a single BCF file.
 
 Currently the following filters are applied by Delly when calling CNVs. Parameters with a "call-gSV default" can be updated in the sample specific nextflow [config](config/template.config) file.
 <br>
@@ -144,7 +146,7 @@ For Delly, VCF files are generated from the BCFs to run the vcf-validate command
 
 ### Regenotyping
 
-The "regenotyping" branch of the call-gSV pipeline allows you to regenotype previously identified SVs or CNVs using Delly. 
+The "regenotyping" branch of the call-gSV pipeline allows you to regenotype previously identified SVs or CNVs using Delly.
 
 ### 1. Regenotyping Structural Variants
 
@@ -160,15 +162,21 @@ The second possible step of the regenotyping pipeline requires an aligned and so
 
 ## Inputs
 
-### Input CSV
-
-The input CSV should have each of the input fields listed below as separate columns, using the same order and comma as column separator. An example of the input CSV can be found [here](input/call-gSV-input.csv).
+### Input YAML
 
 | Field | Type | Description |
 |:------|:-----|:------------|
-| patient | string | The patient name to be passed to final BCF/VCF. No white space is allowed. |
-| sample | string | The sample name to be passed to final BCF/VCF. No white space is allowed. |
-| input_bam | path | Absolute path to the BAM file for the sample. |
+| sample_id | string | Sample ID |
+| normal | path | Set to absolute path to input BAM |
+
+```
+---
+input:
+  BAM:
+    normal:
+      - "/path/to/input/BAM"
+```
+> Note: The pipeline is intended for germline samples. However, if need be as an exceptional case a tumor sample is to be run with this pipeline, it can be done by specifying `tumor` instead of `normal` in the input YAML with a corresponding single tumor BAM path.
 
 ### Nextflow Config File Parameters
 
@@ -179,7 +187,6 @@ The input CSV should have each of the input fields listed below as separate colu
 | `run_discovery` | yes | boolean | Specifies whether or not to run the "disovery" branch of the pipeline. Default value is `true`. (either `run_discovery` or `run_regenotyping` must be `true`) |
 | `run_regenotyping` | yes | boolean | Specifies whether or not to run the "regenotyping" branch of the pipeline. Default value is `false`. (either `run_discovery` or `run_regenotyping` must be `true`) |
 | `merged_sites` | yes | path | The path to the merged sites.bcf file. Must be populated if running the regenotyping branch. |
-| `input_csv` | yes | string | Absolute path to the input CSV file for the pipeline. |
 | `reference_fasta` | yes | path | Absolute path to the reference genome `FASTA` file. The reference genome is used by Delly for SV calling. |
 | `exclusion_file` | yes | path | Absolute path to the delly reference genome `exclusion` file utilized to remove suggested regions for SV calling. On Slurm, an HG38 exclusion file is located at `/hot/ref/tool-specific-input/Delly/hg38/human.hg38.excl.tsv` |
 | `mappability_map` | yes | path | Absolute path to the delly mappability map to support GC and mappability fragment correction in CNV calling |
@@ -198,16 +205,16 @@ An example of the NextFlow Input Parameters Config file can be found [here](conf
 
 ## Outputs
 
-| Output | Output Type | Description |
-|:-------|:---------|:------------|
-| `.bcf` | final | Binary VCF output format with SVs if found. |
-| `.vcf` | intermediate | VCF output format with SVs if found. If output by Manta, these VCFs will be compressed. |
-| `.bcf.csi` | final | CSI-format index for BAM files. |
-| `.validate.txt` | final | output file from vcf-validator. |
-| `.stats.txt` | final | output file from RTG Tools. |
-| `report.html`, `timeline.html` and `trace.txt` | log | A Nextflow report, timeline and trace files. |
-| `*.log.command.*` | log | Process and sample specific logging files created by nextflow. |
-| `*.sha512` | checksum| generates SHA-512 hash to validate file integrity. |
+| Output | Description |
+|:-------|:------------|
+| `.bcf` | Binary VCF output format with SVs if found. |
+| `.vcf` | VCF output format with SVs if found. If output by Manta, these VCFs will be compressed. |
+| `.bcf.csi` | CSI-format index for BAM files. |
+| `.validate.txt` | output file from vcf-validator. |
+| `.stats.txt` | output file from RTG Tools. |
+| `report.html`, `timeline.html` and `trace.txt` | A Nextflow report, timeline and trace files. |
+| `*.log.command.*` | Process and sample specific logging files created by nextflow. |
+| `*.sha512` | generates SHA-512 hash to validate file integrity. |
 ---
 
 ## Testing and Validation
@@ -268,7 +275,7 @@ Metrics below are based on the integration of Delly v1.13 in the `call-gSV` pipe
 | SV breakends | 0 | 219 | 1124 | 0 | `.stats.txt` |
 | Symbolic SVs | 2 | 1559 | 12500 | 11156 | `.stats.txt` |
 | Same as reference | 1 | 263 | 4595 | 1471 | `.stats.txt` |
-| Missing Genotype | 0 | 8 | 38 | 31 | `.stats.txt` | 
+| Missing Genotype | 0 | 8 | 38 | 31 | `.stats.txt` |
 | Total Het/Hom ratio | (2/0) | 1.00 (843/845) | 2.37 (9580/4044) | 1.86 (7251/3905) | `.stats.txt` |
 | Breakend Het/Hom ratio | (0/0) | 0.84 (59/70) | 13.41 (1046/78) | (0/0) | `.stats.txt` |
 | Symbolic SV Het/Hom ratio | (2/0) | 1.01 (784/775) | 2.15 (8534/3966) | 1.86 (7251/3905) | `.stats.txt` |

diff --git a/config/default.config b/config/default.config
@@ -25,7 +25,7 @@ params {
     bcftools_version = "1.15.1"
     vcftools_version = "0.1.16"
     rtgtools_version = "3.12"
-    pipeval_version = "3.0.0"
+    pipeval_version = "4.0.0-rc.2"
 
     // Docker tool versions
     docker_image_delly = "${-> params.docker_container_registry}/delly:${params.delly_version}"

diff --git a/config/methods.config b/config/methods.config
@@ -1,3 +1,6 @@
+import nextflow.util.SysHelper
+includeConfig "../external/pipeline-Nextflow-config/config/bam/bam_parser.config"
+
 methods {
     check_permissions = { path ->
         def filePath = new File(path)
@@ -15,13 +18,35 @@ methods {
             }
         }
 
-    set_log_output_dir = {
+    set_ids_from_bams = {
+        params.sample_to_process = [] as Set
+        params.input.BAM.each { k, v ->
+            v.each { bam_path ->
+                def bam_header = bam_parser.parse_bam_header(bam_path)
+                def sm_tags = bam_header['read_group'].collect{ it['SM'] }.unique()
+
+                if (sm_tags.size() != 1) {
+                    throw new Exception("${bam_path} contains multiple samples! Please run pipeline with a single sample BAM.")
+                    }
+                params.sample_to_process.add(['id': sm_tags[0], 'path': bam_path, 'sample_type': k])
+                }
+            }
+        }
+
+    set_output_dir = {
+        def sample = params.sample_to_process
+            .collect{ it.id }
+
+        if (sample.size() != 1) {
+            throw new Exception("${params.sample_to_process}\n\n Multiple BAMs found in the input! Please run pipeline one sample at a time.")
+        }
+
+        params.sample = sample[0]
 
-        def sample
+        params.output_dir_base = "${params.output_dir}/${manifest.name}-${manifest.version}/${params.sample}"
+        }
 
-        // assumes that project and samples name are in the pipeline.config
-        def reader = new FileReader(params.input_csv)
-        reader.splitEachLine(',') { parts -> [sample = parts[1]] }
+    set_log_output_dir = {
 
         tz = TimeZone.getTimeZone('UTC')
         def date = new Date().format("yyyyMMdd'T'HHmmss'Z'", tz)
@@ -38,18 +63,13 @@ methods {
             params.disease = "${disease}"
             }
         else {
-            params.log_output_dir = "${params.output_dir}/${manifest.name}-${manifest.version}/${sample}/log-${manifest.name}-${manifest.version}-${date}"
+            params.log_output_dir = "${params.output_dir_base}/log-${manifest.name}-${manifest.version}-${date}"
             params.disease = null
             }
 
-        params.sample = "${sample}"
         params.date = "${date}"
         }
 
-    set_output_dir = {
-        params.output_dir = "${params.output_dir}/${manifest.name}-${manifest.version}/${params.sample}"
-        }
-
     // Function to ensure that resource requirements don't go beyond
     // a maximum limit
     check_max = { obj, type ->
@@ -59,28 +79,28 @@ methods {
                     return params.max_memory as nextflow.util.MemoryUnit
                 else
                     return obj
-                } 
+                }
             catch (all) {
                 println "   ### ERROR ###   Max memory '${params.max_memory}' is not valid! Using default value: $obj"
                 return obj
                 }
-            } 
+            }
         else if (type == 'time') {
             try {
                 if (obj.compareTo(params.max_time as nextflow.util.Duration) == 1)
                     return params.max_time as nextflow.util.Duration
                 else
                     return obj
-                } 
+                }
             catch (all) {
                 println "   ### ERROR ###   Max time '${params.max_time}' is not valid! Using default value: $obj"
                 return obj
                 }
-            } 
+            }
         else if (type == 'cpus') {
             try {
                 return Math.min(obj, params.max_cpus as int)
-                } 
+                }
             catch (all) {
                 println "   ### ERROR ###   Max cpus '${params.max_cpus}' is not valid! Using default value: $obj"
                 return obj
@@ -90,7 +110,7 @@ methods {
 
     set_resources_allocation = {
         // Function to ensure that resource requirements don't go beyond
-        // a maximum limit        
+        // a maximum limit
         node_cpus = params.max_cpus
         node_memory_GB = params.max_memory.toGiga()
         // Load base.config by default for all pipelines
@@ -177,7 +197,7 @@ methods {
 
         timeline.enabled = true
         timeline.file = "${params.log_output_dir}/nextflow-log/timeline.html"
-        
+
         report.enabled = true
         report.file = "${params.log_output_dir}/nextflow-log/report.html"
     }
@@ -194,15 +214,14 @@ methods {
 
     // Set up env, timeline, trace, and report above.
     setup = {
-        methods.set_log_output_dir()
-        methods.set_output_dir()
-        methods.check_permissions(params.log_output_dir)
-
         methods.set_env()
+        methods.set_ids_from_bams()
         methods.set_resources_allocation()
+        methods.set_output_dir()
+        methods.set_log_output_dir()
+        methods.check_permissions(params.log_output_dir)
+        methods.set_pipeline_logs()
         methods.set_process()
         methods.set_docker_sudo()
-        methods.set_pipeline_logs()
         }
     }
-
diff --git a/config/template.config b/config/template.config
@@ -26,7 +26,6 @@ params {
 
     save_intermediate_files = false
 
-    input_csv = "path/to/input/csv/"
     output_dir = "where/to/save/outputs/"
 
     reference_fasta = "/hot/ref/reference/GRCh38-BI-20160721/Homo_sapiens_assembly38.fasta"

diff --git a/external/pipeline-Nextflow-config b/external/pipeline-Nextflow-config
diff --git a/external/pipeline-Nextflow-module b/external/pipeline-Nextflow-module
diff --git a/input/call-gSV-input.csv b/input/call-gSV-input.csv
diff --git a/input/call-gSV-input.yaml b/input/call-gSV-input.yaml
@@ -0,0 +1,5 @@
+---
+input:
+  BAM:
+    normal:
+      - "/absolute/path/to/input/BAM"