This pipeline takes BAMs and runs selected Quality Control (QC) steps. Available algorithms are currently SAMtools stats
, Picard CollectWgsMetrics
, FastQC
, Qualimap bamqc
, mosdepth coverage
and mosdepth quantize
. Generally either Qualimap bamqc
or SAMtools stats and Picard CollectWgsMetrics
should be run, not both. Qualimap bamqc
uses a lot of memory and should not be run within uclahs-cds/metapipeline-DNA
. Input can include any combination of tumor and normal BAMs from a single donor. Each will be processed independently. RNA specific QC is not yet implemented but is expected soon.
-
Update the params section of the
.config file
(Example config). -
Update the input YAML (Template YAMLs).
-
See the submission script, here, to submit your pipeline
Currently supported Nextflow versions: v23.04.2
Each of the below algorithms, if selected, will run in parallel subject to available resources.
samtools stats collects basic statistics from BAM files including read counts, qualities, GC content, insert sizes, read lengths, proper pairing, and duplicated bases.
picard CollectWgsMetrics collects coverage metrics from WGS BAM files.
qualimap bamqc collects basic statistics and coverage metrics from BAM files. Example output: html pdf. Qualimap bamqc
uses a lot of memory and should not be run within uclahs-cds/metapipeline-DNA
.
FastQC aims to provide a QC report which can spot problems which originate either in the sequencer or in the starting library material.
mosdepth by windows provides fast BAM/CRAM depth calculation.
mosdepth quantize creates a bed file labeling regions within specified coverage thresholds. Similar to GATK's callable loci tool.
Example:
---
patient_id: 'patient_id'
dataset_id: 'dataset_id'
input:
normal:
- path: /absolute/path/to/normal.bam
read_length: length
tumor:
- path: /absolute/path/to/tumor.bam
read_length: length
Field | Type | Required | Description |
---|---|---|---|
algorithm |
list | no | List of tools to be run: ['fastqc', 'samtools_stats', 'collectwgsmetrics', 'mosdepth_coverage', 'mosdepth_quantize', 'qualimap_bamqc'], default = ['stats', 'collectwgsmetrics'] |
reference |
path | yes/no | Reference fasta is required only for CollectWgsMetrics |
output_dir |
path | yes | Not required if blcds_registered_dataset = true |
blcds_registered_dataset |
boolean | no | Default is false . Only uclahs_cds users should change this. When true , BLCDS folder structure is used |
work_dir |
path | no | Path of working directory for Nextflow. When included, Nextflow intermediate files and logs will be saved to this directory. With uclahs_cds = true , the default is /scratch and should only be changed for testing/development. Changing this directory to /hot or /tmp can lead to high server latency and potential disk space limitations, respectively. |
Field | Type | Required | Description |
---|---|---|---|
stats_max_rgs_per_sample | integer | no | If a sample has more than this number of readgroups, SAMtools stats will not run per readgroup analysis. Default = 20 |
stats_max_libs_per_sample | integer | no | If a sample has more than this number of libraries, SAMtools stats will not run per library analysis. Default = 20 |
stats_remove_duplicates | boolean | no | Ignore reads marked as duplicate. Default = false |
stats_additional_options | string | no | Any additional options recognized by samtools stats |
Field | Type | Required | Description |
---|---|---|---|
cwm_coverage_cap | integer | no | Cap coverage at this value. Default = 250 |
cwm_minimum_mapping_quality | integer | no | Ignore reads with mapping quality below this value. Default = 20 |
cwm_minimum_base_quality | integer | no | Ignore bases with quality below this value. Default = 20 |
cwm_use_fast_algorithm | boolean | no | If true , fast algorithm is used |
cwm_additional_options | string | no | Any additional options recognized by CollectWgsMetrics |
Field | Type | Required | Description |
---|---|---|---|
fastqc_level | string | yes | 'readgroup', 'library' or 'sample' |
fastqc_additional_options | string | no | Any additional options recognized by FastQC |
Field | Type | Required | Description |
---|---|---|---|
bamqc_output_format | string | no | Choice of 'pdf' or 'html', default = 'pdf' |
bamqc_additional_options | string | no | Any additional options recognized by bamqc |
Field | Type | Required | Description |
---|---|---|---|
mosdepth_use_fast_algorithm | boolean | no | fast algorithm ignores read pair overlaps and CIGARs. It should not be used on libraries with small insert sizes. Default = false |
mosdepth_per_base_output | boolean | no | Output coverage for every base. Default = true |
mosdepth_window_size | integer | no | Size for mosdepth windows coverage calculations |
mosdepth_additional_options | string | no | Any additional options recognized by mosdepth , --mapq 20 recommended |
Field | Type | Required | Description |
---|---|---|---|
mosdepth_quantize_cutoffs | string | no | cutoffs for coverage regions. Default = 0:1:5:150 |
mosdepth_quantize_use_fast_algorithm | boolean | no | fast algorithm ignores read pair overlaps and CIGARs. It should not be used on libraries with small insert sizes. Default = false |
mosdepth_q0_label | string | no | lowest coverage regions label. Default = Q0 |
mosdepth_q1_label | string | no | next coverage regions label. Default = Q1 |
mosdepth_q2_label | string | no | next coverage regions label. Default = Q2 |
mosdepth_q3_label | string | no | highest coverage regions label. Default = Q3 |
mosdepth_quantize_additional_options | string | no | Any additional options recognized by mosdepth . --mapq 20 recommended |
To update the base resource (cpus or memory) allocations for processes, use the following structure. The default allocations can be found in the node-specific config files
base_resource_update {
memory = [
[['process_name', 'process_name2'], <multiplier for resource>],
[['process_name3', 'process_name4'], <different multiplier for resource>]
]
cpus = [
[['process_name', 'process_name2'], <multiplier for resource>],
[['process_name3', 'process_name4'], <different multiplier for resource>]
]
}
Note Resource updates will be applied in the order they're provided so if a process is included twice in the memory list, it will be updated twice in the order it's given.
Examples:
- To double memory of all processes:
base_resource_update {
memory = [
[[], 2]
]
}
- To double memory for
run_CollectWgsMetrics_Picard
and triple memory forrun_statsSamples_SAMtools
andrun_bamqc_Qualimap
:
base_resource_update {
memory = [
['run_CollectWgsMetrics_Picard', 2],
[['run_statsSamples_SAMtools', 'run_bamqc_Qualimap'], 3]
]
}
- To double CPUs and memory for
run_CollectWgsMetrics_Picard
and double memory forrun_statsSamples_SAMtools
:
base_resource_update {
cpus = [
['run_CollectWgsMetrics_Picard', 2]
]
memory = [
[['run_CollectWgsMetrics_Picard', 'run_statsSamples_SAMtools'], 2]
]
}
Output | Description |
---|---|
{SAMtools-version}_{dataset_id}_{sample_id}_stats.txt |
SAMtools stats sample level results |
{SAMtools-version}_{dataset_id}_{sample_id}-{library_id}_stats.txt |
SAMtools stats library level results |
{SAMtools-version}_{dataset_id}_{sample_id}-{library_id}-{rg_id}_stats.txt |
SAMtools stats readgroup level results |
{Picard-version}_{dataset_id}_{sample_id}_wgs-metrics.txt |
Picard CollectWgsMetrics results |
{Qualimap-version}_{dataset_id}_{sample_id}_stats |
Directory of Qualimap results, including, genome_results.txt and either .pdf or .html and supporting directories |
{FastQC-version}_{dataset_id}_{sample_id}_fastqc |
Directory of sample level FastQC results |
{FastQC-version}_{dataset_id}_{sample_id}-{library_id}_fastqc |
Directory of library level FastQC results |
{FastQC-version}_{dataset_id}_{sample_id}-{library_id}-{rg_id}_fastqc |
Directory of readgroup level FastQC results |
{mosdepth-version}_{dataset_id}_{sample_id}-{window_size}.mosdepth.summary.txt |
mosdepth coverage results by region with a final line for total |
{mosdepth-version}_{dataset_id}_{sample_id}-{window_size}.mosdepth.global.dist.txt |
mosdepth coverage cumulative distribution indicating the proportion of total bases that were covered for at least a given coverage value |
{mosdepth-version}_{dataset_id}_{sample_id}-{window_size}.mosdepth.region.dist.txt |
mosdepth coverage cumulative distribution indicating the proportion of the windows that were covered for at least a given coverage value |
{mosdepth-version}_{dataset_id}_{sample_id}-{window_size}.regions.bed.gz |
mosdepth coverage bedfile giving coverage for each window |
{mosdepth-version}_{dataset_id}_{sample_id}-quantize-{q0}-{q1}-{q2}-{q3}.mosdepth.summary.txt |
mosdepth quantize coverage results by region with a final line for total |
{mosdepth-version}_{dataset_id}_{sample_id}-quantize-{q0}-{q1}-{q2}-{q3}.mosdepth.global.dist.txt |
mosdepth quantize cumulative distribution indicating the proportion of total bases that were covered for at least a given coverage value |
{mosdepth-version}_{dataset_id}_{sample_id}-quantize-{q0}-{q1}-{q2}-{q3}.quantized.bed.gz |
mosdepth quantize bed file |
- Issue Tracker to report errors and enhancement ideas.
- Discussions can take place in generate-SQC-BAM Discussions
- generate-SQC-BAM Pull Requests are also open for discussion
Please see list of Contributors at GitHub.
Generate-SQC-BAM is licensed under the GNU General Public License version 2. See the file LICENSE for the terms of the GNU GPL license.
Generate-SQC-BAM takes BAM files and generates per sample QC metrics
Copyright (C) 2024 University of California Los Angeles ("Boutros Lab") All rights reserved.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.