Skip to content

Commit

Permalink
Update input format and sampleid parsing (#89)
Browse files Browse the repository at this point in the history
* Update submodule pipeline-Nextflow-module

* add submodule pipeline-Nextflow-config

* add YAML input

* remove CSV input

* add BAM parsing, parse sample id from BAM; add retry

* add schema validation

* remove extra line

* comment schema validation and retry for next PR

* patient id to sample id in input YAML

* update template config

* udpate input validation

* update pipeval version

* update channels for processes

* update bam parsing function

* update publish dir

* update store dir for input validation

* Update Testing Section of PR template

* remove commented lines

* Replace CSV with YAML input

* Update CHANGELOG.md

* optimize sample classification

* Update CHANGELOG.md

* remove sample_id as it is not required

* update error message to be generic across input BAMs

* Update README

* define sample with def

* Update error message

* Add note in README that Tumor BAM can also be run

---------

Co-authored-by: Mootor <mmootor@ip-0A125250.rhxrlfvjyzbupc03cc22jkch3c.xx.internal.cloudapp.net>
Co-authored-by: Mootor <mmootor@ip-0A125213.rhxrlfvjyzbupc03cc22jkch3c.xx.internal.cloudapp.net>
Co-authored-by: Mootor <mmootor@ip-0A125217.rhxrlfvjyzbupc03cc22jkch3c.xx.internal.cloudapp.net>
  • Loading branch information
4 people authored Sep 22, 2023
1 parent 74a2ae2 commit 82bffab
Show file tree
Hide file tree
Showing 18 changed files with 143 additions and 103 deletions.
16 changes: 8 additions & 8 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,23 +7,23 @@
## Testing Results

- Manta
- sample: <!-- e.g. A-mini TWGSAMIN000001-T001-S01-F, TWGSAMIN000001-T002-S02-F -->
- input csv: <!-- path/to/input/call-gSV-inputs.csv -->
- sample: <!-- e.g. A-mini TWGSAMIN000001-N001-S01-F -->
- input YAML: <!-- path/to/input/call-gSV-inputs.yaml -->
- config: <!-- path/to/cpnfig/nextflow-test-amini.config -->
- output: <!-- path/to/output/Manta-1.6.0/ -->
- Delly - gSV
- sample: <!-- e.g. A-mini TWGSAMIN000001-T001-S01-F, TWGSAMIN000001-T002-S02-F -->
- input csv: <!-- path/to/input/call-gSV-inputs.csv -->
- sample: <!-- e.g. A-mini TWGSAMIN000001-N001-S01-F -->
- input YAML: <!-- path/to/input/call-gSV-inputs.yaml -->
- config: <!-- path/to/config/nextflow-test-amini.config -->
- output: <!-- path/to/output/Delly-0.8.7/ -->
- Delly - gCNV
- sample: <!-- e.g. A-mini TWGSAMIN000001-T001-S01-F, TWGSAMIN000001-T002-S02-F -->
- input csv: <!-- path/to/input/call-gSV-inputs.csv -->
- sample: <!-- e.g. A-mini TWGSAMIN000001-N001-S01-F -->
- input YAML: <!-- path/to/input/call-gSV-inputs.yaml -->
- config: <!-- path/to/config/nextflow-test-amini.config -->
- output: <!-- path/to/output/Delly-0.8.7/ -->
- Delly - gSV & gCNV
- sample: <!-- e.g. A-mini TWGSAMIN000001-T001-S01-F, TWGSAMIN000001-T002-S02-F -->
- input csv: <!-- path/to/input/call-gSV-inputs.csv -->
- sample: <!-- e.g. A-mini TWGSAMIN000001-N001-S01-F -->
- input YAML: <!-- path/to/input/call-gSV-inputs.yaml -->
- config: <!-- path/to/config/nextflow-test-amini.config -->
- output: <!-- path/to/output/Delly-0.8.7/ -->

Expand Down
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
[submodule "external/pipeline-Nextflow-module"]
path = external/pipeline-Nextflow-module
url = [email protected]:uclahs-cds/pipeline-Nextflow-module.git
[submodule "external/pipeline-Nextflow-config"]
path = external/pipeline-Nextflow-config
url = [email protected]:uclahs-cds/pipeline-Nextflow-config.git
13 changes: 11 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,15 @@ All notable changes to the call-gSV pipeline.
---

## [Unreleased]
### Changed
- Update README to reflect YAML support
- Parse sample ID from input BAM for output directory naming

### Added
- Add YAML input

### Removed
- Remove CSV input

---

Expand All @@ -26,7 +35,7 @@ All notable changes to the call-gSV pipeline.

### Changed
- Update README.md for `4.0.0`
- Move `save_intermediate_files` from `default.config` to `template.config` and set it to `false`
- Move `save_intermediate_files` from `default.config` to `template.config` and set it to `false`
- Update BCFtools 1.12 to 1.15.1
- Update Delly 1.0.3 to 1.1.3
- Update Delly 0.9.1 to 1.0.3
Expand All @@ -46,7 +55,7 @@ All notable changes to the call-gSV pipeline.
- Fix Issue #33: should pass the mappability_map file instead of the exclusion file to regenotype_gCNV_Delly

### Changed
- Change the input file schema by removing variant_type,reference_fasta,reference_fasta_index, put them into template.config.
- Change the input file schema by removing variant_type,reference_fasta,reference_fasta_index, put them into template.config.
- Change partition types from lowmem/midmem/execute to F2/F32/F72/M64.
- Standardize the output structure.
- Standardize the configuration structure.
Expand Down
55 changes: 31 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,9 +53,9 @@ Pipelines should be run **WITH A SINGLE SAMPLE AT TIME**. Otherwise resource all

* Do not directly modify the source `template.config`, but rather you should copy it from the pipeline release folder to your project-specific folder and modify it there

3. Create the input CSV using the [template](input/call-gSV-input.csv).See [Input CSV](#Input-CSV) for detailed description of each column. All columns must exist and should be comma separated in order to run the pipeline successfully.
* Again, do not directly modify the source template CSV file. Instead, copy it from the pipeline release folder to your project-specific folder and modify it there.
3. Create the input YAML using the [template](input/call-gSV-input.yaml). See [Input YAML](#Input-YAML) for a detailed description.

* Again, do not directly modify the source template YAML file. Instead, copy it from the pipeline release folder to your project-specific folder and modify it there.

4. The pipeline can be executed locally using the command below:

Expand All @@ -64,14 +64,16 @@ nextflow run path/to/main.nf -config path/to/sample-specific.config
```

* For example, `path/to/main.nf` could be: `/hot/software/pipeline/pipeline-call-gSV/Nextflow/release/4.0.0/main.nf`
* `path/to/sample-specific.config` is the path to where you saved your project-specific copy of [template.config](config/template.config)
* `path/to/sample-specific.config` is the path to where you saved your project-specific copy of [template.config](config/template.config)
* `path/to/input.yaml` is the path to where you saved your sample-specific copy of [call-gSV-input.yaml](input/call-gSV-input.yaml)

To submit to UCLAHS-CDS's Azure cloud, use the submission script [here](https://github.com/uclahs-cds/tool-submit-nf) with the command below:

```bash
python path/to/submit_nextflow_pipeline.py \
--nextflow_script path/to/main.nf \
--nextflow_config path/to/sample-specific.config \
--nextflow_yaml path/to/input.yaml \
--pipeline_run_name <sample_name> \
--partition_type F16 \
--email <your UCLA email, [email protected]>
Expand Down Expand Up @@ -117,7 +119,7 @@ Currently the following filters are applied by Delly when calling SVs. Parameter

### 2. Calling Copy Number Variants

The second step of the pipeline identifies any found CNVs. To do this, Delly requires an aligned and sorted BAM file and BAM index as an input, as well as the BCF output from the initial SV calling (to refine breakpoints) and a mappability map. Any CNVs identified are annotated and output as a single BCF file.
The second step of the pipeline identifies any found CNVs. To do this, Delly requires an aligned and sorted BAM file and BAM index as an input, as well as the BCF output from the initial SV calling (to refine breakpoints) and a mappability map. Any CNVs identified are annotated and output as a single BCF file.

Currently the following filters are applied by Delly when calling CNVs. Parameters with a "call-gSV default" can be updated in the sample specific nextflow [config](config/template.config) file.
<br>
Expand All @@ -144,7 +146,7 @@ For Delly, VCF files are generated from the BCFs to run the vcf-validate command

### Regenotyping

The "regenotyping" branch of the call-gSV pipeline allows you to regenotype previously identified SVs or CNVs using Delly.
The "regenotyping" branch of the call-gSV pipeline allows you to regenotype previously identified SVs or CNVs using Delly.

### 1. Regenotyping Structural Variants

Expand All @@ -160,15 +162,21 @@ The second possible step of the regenotyping pipeline requires an aligned and so

## Inputs

### Input CSV

The input CSV should have each of the input fields listed below as separate columns, using the same order and comma as column separator. An example of the input CSV can be found [here](input/call-gSV-input.csv).
### Input YAML

| Field | Type | Description |
|:------|:-----|:------------|
| patient | string | The patient name to be passed to final BCF/VCF. No white space is allowed. |
| sample | string | The sample name to be passed to final BCF/VCF. No white space is allowed. |
| input_bam | path | Absolute path to the BAM file for the sample. |
| sample_id | string | Sample ID |
| normal | path | Set to absolute path to input BAM |

```
---
input:
BAM:
normal:
- "/path/to/input/BAM"
```
> Note: The pipeline is intended for germline samples. However, if need be as an exceptional case a tumor sample is to be run with this pipeline, it can be done by specifying `tumor` instead of `normal` in the input YAML with a corresponding single tumor BAM path.
### Nextflow Config File Parameters

Expand All @@ -179,7 +187,6 @@ The input CSV should have each of the input fields listed below as separate colu
| `run_discovery` | yes | boolean | Specifies whether or not to run the "disovery" branch of the pipeline. Default value is `true`. (either `run_discovery` or `run_regenotyping` must be `true`) |
| `run_regenotyping` | yes | boolean | Specifies whether or not to run the "regenotyping" branch of the pipeline. Default value is `false`. (either `run_discovery` or `run_regenotyping` must be `true`) |
| `merged_sites` | yes | path | The path to the merged sites.bcf file. Must be populated if running the regenotyping branch. |
| `input_csv` | yes | string | Absolute path to the input CSV file for the pipeline. |
| `reference_fasta` | yes | path | Absolute path to the reference genome `FASTA` file. The reference genome is used by Delly for SV calling. |
| `exclusion_file` | yes | path | Absolute path to the delly reference genome `exclusion` file utilized to remove suggested regions for SV calling. On Slurm, an HG38 exclusion file is located at `/hot/ref/tool-specific-input/Delly/hg38/human.hg38.excl.tsv` |
| `mappability_map` | yes | path | Absolute path to the delly mappability map to support GC and mappability fragment correction in CNV calling |
Expand All @@ -198,16 +205,16 @@ An example of the NextFlow Input Parameters Config file can be found [here](conf

## Outputs

| Output | Output Type | Description |
|:-------|:---------|:------------|
| `.bcf` | final | Binary VCF output format with SVs if found. |
| `.vcf` | intermediate | VCF output format with SVs if found. If output by Manta, these VCFs will be compressed. |
| `.bcf.csi` | final | CSI-format index for BAM files. |
| `.validate.txt` | final | output file from vcf-validator. |
| `.stats.txt` | final | output file from RTG Tools. |
| `report.html`, `timeline.html` and `trace.txt` | log | A Nextflow report, timeline and trace files. |
| `*.log.command.*` | log | Process and sample specific logging files created by nextflow. |
| `*.sha512` | checksum| generates SHA-512 hash to validate file integrity. |
| Output | Description |
|:-------|:------------|
| `.bcf` | Binary VCF output format with SVs if found. |
| `.vcf` | VCF output format with SVs if found. If output by Manta, these VCFs will be compressed. |
| `.bcf.csi` | CSI-format index for BAM files. |
| `.validate.txt` | output file from vcf-validator. |
| `.stats.txt` | output file from RTG Tools. |
| `report.html`, `timeline.html` and `trace.txt` | A Nextflow report, timeline and trace files. |
| `*.log.command.*` | Process and sample specific logging files created by nextflow. |
| `*.sha512` | generates SHA-512 hash to validate file integrity. |
---

## Testing and Validation
Expand Down Expand Up @@ -268,7 +275,7 @@ Metrics below are based on the integration of Delly v1.13 in the `call-gSV` pipe
| SV breakends | 0 | 219 | 1124 | 0 | `.stats.txt` |
| Symbolic SVs | 2 | 1559 | 12500 | 11156 | `.stats.txt` |
| Same as reference | 1 | 263 | 4595 | 1471 | `.stats.txt` |
| Missing Genotype | 0 | 8 | 38 | 31 | `.stats.txt` |
| Missing Genotype | 0 | 8 | 38 | 31 | `.stats.txt` |
| Total Het/Hom ratio | (2/0) | 1.00 (843/845) | 2.37 (9580/4044) | 1.86 (7251/3905) | `.stats.txt` |
| Breakend Het/Hom ratio | (0/0) | 0.84 (59/70) | 13.41 (1046/78) | (0/0) | `.stats.txt` |
| Symbolic SV Het/Hom ratio | (2/0) | 1.01 (784/775) | 2.15 (8534/3966) | 1.86 (7251/3905) | `.stats.txt` |
Expand Down
2 changes: 1 addition & 1 deletion config/default.config
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ params {
bcftools_version = "1.15.1"
vcftools_version = "0.1.16"
rtgtools_version = "3.12"
pipeval_version = "3.0.0"
pipeval_version = "4.0.0-rc.2"

// Docker tool versions
docker_image_delly = "${-> params.docker_container_registry}/delly:${params.delly_version}"
Expand Down
67 changes: 43 additions & 24 deletions config/methods.config
Original file line number Diff line number Diff line change
@@ -1,3 +1,6 @@
import nextflow.util.SysHelper
includeConfig "../external/pipeline-Nextflow-config/config/bam/bam_parser.config"

methods {
check_permissions = { path ->
def filePath = new File(path)
Expand All @@ -15,13 +18,35 @@ methods {
}
}

set_log_output_dir = {
set_ids_from_bams = {
params.sample_to_process = [] as Set
params.input.BAM.each { k, v ->
v.each { bam_path ->
def bam_header = bam_parser.parse_bam_header(bam_path)
def sm_tags = bam_header['read_group'].collect{ it['SM'] }.unique()

if (sm_tags.size() != 1) {
throw new Exception("${bam_path} contains multiple samples! Please run pipeline with a single sample BAM.")
}
params.sample_to_process.add(['id': sm_tags[0], 'path': bam_path, 'sample_type': k])
}
}
}

set_output_dir = {
def sample = params.sample_to_process
.collect{ it.id }

if (sample.size() != 1) {
throw new Exception("${params.sample_to_process}\n\n Multiple BAMs found in the input! Please run pipeline one sample at a time.")
}

params.sample = sample[0]

def sample
params.output_dir_base = "${params.output_dir}/${manifest.name}-${manifest.version}/${params.sample}"
}

// assumes that project and samples name are in the pipeline.config
def reader = new FileReader(params.input_csv)
reader.splitEachLine(',') { parts -> [sample = parts[1]] }
set_log_output_dir = {

tz = TimeZone.getTimeZone('UTC')
def date = new Date().format("yyyyMMdd'T'HHmmss'Z'", tz)
Expand All @@ -38,18 +63,13 @@ methods {
params.disease = "${disease}"
}
else {
params.log_output_dir = "${params.output_dir}/${manifest.name}-${manifest.version}/${sample}/log-${manifest.name}-${manifest.version}-${date}"
params.log_output_dir = "${params.output_dir_base}/log-${manifest.name}-${manifest.version}-${date}"
params.disease = null
}

params.sample = "${sample}"
params.date = "${date}"
}

set_output_dir = {
params.output_dir = "${params.output_dir}/${manifest.name}-${manifest.version}/${params.sample}"
}

// Function to ensure that resource requirements don't go beyond
// a maximum limit
check_max = { obj, type ->
Expand All @@ -59,28 +79,28 @@ methods {
return params.max_memory as nextflow.util.MemoryUnit
else
return obj
}
}
catch (all) {
println " ### ERROR ### Max memory '${params.max_memory}' is not valid! Using default value: $obj"
return obj
}
}
}
else if (type == 'time') {
try {
if (obj.compareTo(params.max_time as nextflow.util.Duration) == 1)
return params.max_time as nextflow.util.Duration
else
return obj
}
}
catch (all) {
println " ### ERROR ### Max time '${params.max_time}' is not valid! Using default value: $obj"
return obj
}
}
}
else if (type == 'cpus') {
try {
return Math.min(obj, params.max_cpus as int)
}
}
catch (all) {
println " ### ERROR ### Max cpus '${params.max_cpus}' is not valid! Using default value: $obj"
return obj
Expand All @@ -90,7 +110,7 @@ methods {

set_resources_allocation = {
// Function to ensure that resource requirements don't go beyond
// a maximum limit
// a maximum limit
node_cpus = params.max_cpus
node_memory_GB = params.max_memory.toGiga()
// Load base.config by default for all pipelines
Expand Down Expand Up @@ -177,7 +197,7 @@ methods {

timeline.enabled = true
timeline.file = "${params.log_output_dir}/nextflow-log/timeline.html"

report.enabled = true
report.file = "${params.log_output_dir}/nextflow-log/report.html"
}
Expand All @@ -194,15 +214,14 @@ methods {

// Set up env, timeline, trace, and report above.
setup = {
methods.set_log_output_dir()
methods.set_output_dir()
methods.check_permissions(params.log_output_dir)

methods.set_env()
methods.set_ids_from_bams()
methods.set_resources_allocation()
methods.set_output_dir()
methods.set_log_output_dir()
methods.check_permissions(params.log_output_dir)
methods.set_pipeline_logs()
methods.set_process()
methods.set_docker_sudo()
methods.set_pipeline_logs()
}
}

1 change: 0 additions & 1 deletion config/template.config
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,6 @@ params {

save_intermediate_files = false

input_csv = "path/to/input/csv/"
output_dir = "where/to/save/outputs/"

reference_fasta = "/hot/ref/reference/GRCh38-BI-20160721/Homo_sapiens_assembly38.fasta"
Expand Down
1 change: 1 addition & 0 deletions external/pipeline-Nextflow-config
2 changes: 0 additions & 2 deletions input/call-gSV-input.csv

This file was deleted.

5 changes: 5 additions & 0 deletions input/call-gSV-input.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
---
input:
BAM:
normal:
- "/absolute/path/to/input/BAM"
Loading

0 comments on commit 82bffab

Please sign in to comment.