Skip to content

Commit df9cde5

Browse files
author
Manavalan Gajapathy
committed
Merge branch 'qualimap_bamqc-cluster-config' into 'master'
User configurable hardware resources Closes #48 and #47 See merge request center-for-computational-genomics-and-data-science/sciops/pipelines/quac!4
2 parents fdbbd79 + 614a593 commit df9cde5

File tree

10 files changed

+151
-42
lines changed

10 files changed

+151
-42
lines changed

Changelog.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -35,3 +35,8 @@ YYYY-MM-DD John Doe
3535

3636
* Bugfix: Fixes error when there is only one sample in input ped file (#34)
3737
* Adds system-testing for such only-one-sample-in-input setup (#35).
38+
39+
2022-04-07 Manavalan Gajapathy
40+
41+
* Previously hardcoded hardware resources for snakemake rules can now be supplied via `configs/workflow.yaml` (closes #48)
42+
* Modified multiqc conda env config to use explicit dependencies to get around installation issues (closes #47)

README.md

Lines changed: 16 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -185,18 +185,24 @@ snakemake rules.
185185

186186
### Set up workflow config file
187187

188-
QuaC requires a workflow config file in yaml format (`configs/workflow.yaml`), which provides filepaths to necessary
189-
dependencies required by certain QC tools. Their format should look like:
188+
QuaC requires a workflow config file in yaml format ([`configs/workflow.yaml`](./configs/workflow.yaml)), which provides filepaths to necessary
189+
dataset dependencies required by certain QC tools. In addition, hardware resources can be configured (refer to [`configs/workflow.yaml`](./configs/workflow.y) for more info). File format should look like:
190190

191191
```yaml
192-
ref: "path to ref genome path"
193-
somalier:
194-
sites: "path to somalier's site file"
195-
labels_1kg: "path to somalier's ancestry-labels-1kg file"
196-
somalier_1kg: "dirpath to somalier's 1kg-somalier files"
197-
verifyBamID:
198-
svd_dat_wgs: "path to WGS resources .dat files"
199-
svd_dat_exome: "path to exome resources .dat files"
192+
datasets:
193+
ref: "path to ref genome path"
194+
somalier:
195+
sites: "path to somalier's site file"
196+
labels_1kg: "path to somalier's ancestry-labels-1kg file"
197+
somalier_1kg: "dirpath to somalier's 1kg-somalier files"
198+
verifyBamID:
199+
svd_dat_wgs: "path to WGS resources .dat files"
200+
svd_dat_exome: "path to exome resources .dat files"
201+
202+
#### hardware resources ####
203+
resources:
204+
...
205+
...
200206
```
201207
202208
#### Prepare verifybamid datasets for exome analysis

configs/cluster_config.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,4 +20,4 @@
2020
"multiqc_aggregation_all_samples": {
2121
"mem-per-cpu": "24G"
2222
}
23-
}
23+
}

configs/env/multiqc.yaml

Lines changed: 81 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,84 @@
11
channels:
2-
- conda-forge
3-
- anaconda
42
- bioconda
3+
- conda-forge
4+
- defaults
55
dependencies:
6-
- python =3.6
7-
- multiqc=1.9
6+
- python=3.6.13
7+
- multiqc==1.9
8+
- networkx=2.5
9+
- numpy=1.19.5
10+
- _libgcc_mutex=0.1
11+
- _openmp_mutex=4.5
12+
- brotlipy=0.7.0
13+
- ca-certificates=2021.5.30
14+
- certifi=2021.5.30
15+
- cffi=1.14.6
16+
- chardet=4.0.0
17+
- charset-normalizer=2.0.0
18+
- click=8.0.1
19+
- coloredlogs=15.0.1
20+
- colormath=3.0.0
21+
- cryptography=3.4.7
22+
- cycler=0.10.0
23+
- decorator=5.0.9
24+
- freetype=2.10.4
25+
- future=0.18.2
26+
- humanfriendly=9.2
27+
- idna=3.1
28+
- importlib-metadata=4.6.3
29+
- jbig=2.1
30+
- jinja2=3.0.1
31+
- jpeg=9d
32+
- kiwisolver=1.3.1
33+
- lcms2=2.12
34+
- ld_impl_linux-64=2.36.1
35+
- lerc=2.2.1
36+
- libblas=3.9.0
37+
- libcblas=3.9.0
38+
- libdeflate=1.7
39+
- libffi=3.3
40+
- libgcc-ng=11.1.0
41+
- libgfortran-ng=11.1.0
42+
- libgfortran5=11.1.0
43+
- libgomp=11.1.0
44+
- liblapack=3.9.0
45+
- libopenblas=0.3.17
46+
- libpng=1.6.37
47+
- libstdcxx-ng=11.1.0
48+
- libtiff=4.3.0
49+
- libwebp-base=1.2.0
50+
- lz4-c=1.9.3
51+
- lzstring=1.0.4
52+
- markdown=3.3.4
53+
- markupsafe=2.0.1
54+
- matplotlib-base=3.3.4
55+
- ncurses=6.2
56+
- olefile=0.46
57+
- openjpeg=2.4.0
58+
- openssl=1.1.1k
59+
- pillow=8.3.1
60+
- pip=21.2.3
61+
- pycparser=2.20
62+
- pyopenssl=20.0.1
63+
- pyparsing=2.4.7
64+
- pysocks=1.7.1
65+
- python-dateutil=2.8.2
66+
- python_abi=3.6
67+
- pyyaml=5.4.1
68+
- readline=8.1
69+
- requests=2.26.0
70+
- setuptools=49.6.0
71+
- simplejson=3.8.1
72+
- six=1.16.0
73+
- spectra=0.0.11
74+
- sqlite=3.36.0
75+
- tk=8.6.10
76+
- tornado=6.1
77+
- typing_extensions=3.10.0.0
78+
- urllib3=1.26.6
79+
- wheel=0.37.0
80+
- xz=5.2.5
81+
- yaml=0.2.5
82+
- zipp=3.5.0
83+
- zlib=1.2.11
84+
- zstd=1.5.0

configs/workflow.yaml

Lines changed: 19 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,19 @@
1-
ref: "/data/project/worthey_lab/datasets_central/human_reference_genome/processed/GRCh38/no_alt_rel20190408/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna"
2-
somalier:
3-
sites: "/data/project/worthey_lab/manual_datasets_central/somalier/0.2.13/sites/sites.hg38.vcf.gz"
4-
labels_1kg: "/data/project/worthey_lab/manual_datasets_central/somalier/0.2.13/ancestry/ancestry-labels-1kg.tsv"
5-
somalier_1kg: "/data/project/worthey_lab/manual_datasets_central/somalier/0.2.13/ancestry/1kg-somalier/"
6-
verifyBamID:
7-
svd_dat_wgs: "/data/project/worthey_lab/manual_datasets_central/verifyBamID/2.0.1/resource/wgs/1000g.phase3.100k.b38.vcf.gz.dat"
8-
svd_dat_exome: "/data/project/worthey_lab/manual_datasets_central/verifyBamID/2.0.1/resource/exome/chr_added/1000g.phase3.10k.b38.exome.vcf.gz.dat"
1+
datasets:
2+
ref: "/data/project/worthey_lab/datasets_central/human_reference_genome/processed/GRCh38/no_alt_rel20190408/GCA_000001405.15_GRCh38_no_alt_analysis_set.fna"
3+
somalier:
4+
sites: "/data/project/worthey_lab/manual_datasets_central/somalier/0.2.13/sites/sites.hg38.vcf.gz"
5+
labels_1kg: "/data/project/worthey_lab/manual_datasets_central/somalier/0.2.13/ancestry/ancestry-labels-1kg.tsv"
6+
somalier_1kg: "/data/project/worthey_lab/manual_datasets_central/somalier/0.2.13/ancestry/1kg-somalier/"
7+
verifyBamID:
8+
svd_dat_wgs: "/data/project/worthey_lab/manual_datasets_central/verifyBamID/2.0.1/resource/wgs/1000g.phase3.100k.b38.vcf.gz.dat"
9+
svd_dat_exome: "/data/project/worthey_lab/manual_datasets_central/verifyBamID/2.0.1/resource/exome/chr_added/1000g.phase3.10k.b38.exome.vcf.gz.dat"
10+
11+
#### hardware resources ####
12+
resources:
13+
qualimap_bamqc:
14+
no_cpu: 2
15+
mem_per_cpu: "24G"
16+
mosdepth_coverage:
17+
no_cpu: 4
18+
verifybamid:
19+
no_cpu: 4

src/run_quac.py

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -43,17 +43,17 @@ def read_workflow_config(workflow_config_fpath):
4343
data = yaml.safe_load(fh)
4444

4545
mount_paths = set()
46-
46+
datasets = data["datasets"]
4747
# ref genome
48-
mount_paths.add(Path(data["ref"]).parent)
48+
mount_paths.add(Path(datasets["ref"]).parent)
4949

5050
# somalier resource files
51-
for resource in data["somalier"]:
52-
mount_paths.add(Path(data["somalier"][resource]).parent)
51+
for resource in datasets["somalier"]:
52+
mount_paths.add(Path(datasets["somalier"][resource]).parent)
5353

5454
# verifyBamID resource files
55-
for resource in data["verifyBamID"]:
56-
mount_paths.add(Path(data["verifyBamID"][resource]).parent)
55+
for resource in datasets["verifyBamID"]:
56+
mount_paths.add(Path(datasets["verifyBamID"][resource]).parent)
5757

5858
return mount_paths
5959

workflow/rules/aggregate_results.smk

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,9 @@ rule multiqc_by_sample_initial_pass:
5050
# multiqc uses fastq's filenames to identify sample names. Rename them to in-house names,
5151
# using custom rename config file
5252
extra=lambda wildcards, input: f"--config {input.multiqc_config} --sample-names {input.rename_config}",
53+
conda:
54+
### see issue #47 on why local conda env is used to sidestep snakemake-wrapper's ###
55+
str(WORKFLOW_PATH / "configs/env/multiqc.yaml")
5356
wrapper:
5457
"0.64.0/bio/multiqc"
5558

@@ -133,10 +136,14 @@ rule multiqc_by_sample_final_pass:
133136
# multiqc uses fastq's filenames to identify sample names. Rename them to in-house names,
134137
# using custom rename config file
135138
extra=lambda wildcards, input: f"--config {input.multiqc_config} --sample-names {input.rename_config}",
139+
conda:
140+
### see issue #47 on why local conda env is used to sidestep snakemake-wrapper's ###
141+
str(WORKFLOW_PATH / "configs/env/multiqc.yaml")
136142
wrapper:
137143
"0.64.0/bio/multiqc"
138144

139145

146+
140147
########################## Multi-sample QC aggregation ##########################
141148
localrules:
142149
aggregate_sample_rename_configs,
@@ -192,5 +199,8 @@ rule multiqc_aggregation_all_samples:
192199
--sample-names {input.rename_config} \
193200
--cl_config "max_table_rows: 2000"'
194201
),
202+
conda:
203+
### see issue #47 on why local conda env is used to sidestep snakemake-wrapper's ###
204+
str(WORKFLOW_PATH / "configs/env/multiqc.yaml")
195205
wrapper:
196206
"0.64.0/bio/multiqc"

workflow/rules/coverage_analysis.smk

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -24,11 +24,11 @@ rule qualimap_bamqc:
2424
"stats bam using qualimap. Sample: {wildcards.sample}"
2525
conda:
2626
str(WORKFLOW_PATH / "configs/env/qualimap.yaml")
27-
threads: 2
27+
threads: config["resources"]["qualimap_bamqc"]["no_cpu"]
2828
params:
2929
outdir=lambda wildcards, output: str(Path(output["html_report"]).parent),
3030
capture_bed=lambda wildcards, input: f"--feature-file {input.target_regions}" if input.target_regions else "",
31-
java_mem="24G",
31+
java_mem=config["resources"]["qualimap_bamqc"]["mem_per_cpu"],
3232
shell:
3333
r"""
3434
unset DISPLAY
@@ -49,7 +49,7 @@ rule picard_collect_multiple_metrics:
4949
input:
5050
bam=PROJECT_PATH / "{sample}" / "bam" / "{sample}.bam",
5151
index=PROJECT_PATH / "{sample}" / "bam" / "{sample}.bam.bai",
52-
ref=config["ref"],
52+
ref=config["datasets"]["ref"],
5353
output:
5454
multiext(
5555
str(OUT_DIR / "{sample}" / "qc" / "picard-stats" / "{sample}"),
@@ -68,7 +68,7 @@ rule picard_collect_wgs_metrics:
6868
input:
6969
bam=PROJECT_PATH / "{sample}" / "bam" / "{sample}.bam",
7070
index=PROJECT_PATH / "{sample}" / "bam" / "{sample}.bam.bai",
71-
ref=config["ref"],
71+
ref=config["datasets"]["ref"],
7272
output:
7373
OUT_DIR / "{sample}" / "qc" / "picard-stats" / "{sample}.collect_wgs_metrics",
7474
message:
@@ -97,7 +97,7 @@ rule mosdepth_coverage:
9797
"Running mosdepth for coverage. Sample: {wildcards.sample}"
9898
conda:
9999
str(WORKFLOW_PATH / "configs/env/mosdepth.yaml")
100-
threads: 4
100+
threads: config["resources"]["mosdepth_coverage"]["no_cpu"]
101101
params:
102102
out_prefix=lambda wildcards, output: output["summary"].replace(".mosdepth.summary.txt", ""),
103103
capture_bed=lambda wildcards, input: f"--by {input.target_regions}" if input.target_regions else "",

workflow/rules/relatedness_ancestry.smk

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,8 @@ rule somalier_extract:
22
input:
33
bam=PROJECT_PATH / "{sample}" / "bam" / "{sample}.bam",
44
bam_index=PROJECT_PATH / "{sample}" / "bam" / "{sample}.bam.bai",
5-
sites=config["somalier"]["sites"],
6-
ref_genome=config["ref"],
5+
sites=config["datasets"]["somalier"]["sites"],
6+
ref_genome=config["datasets"]["ref"],
77
output:
88
protected(OUT_DIR / "project_level_qc" / "somalier" / "extract" / "{sample}.somalier"),
99
message:
@@ -55,8 +55,8 @@ rule somalier_relate:
5555
rule somalier_ancestry:
5656
input:
5757
extracted=expand(OUT_DIR / "project_level_qc" / "somalier" / "extract" / "{sample}.somalier", sample=SAMPLES),
58-
labels_1kg=config["somalier"]["labels_1kg"],
59-
somalier_1kg_directory=config["somalier"]["somalier_1kg"],
58+
labels_1kg=config["datasets"]["somalier"]["labels_1kg"],
59+
somalier_1kg_directory=config["datasets"]["somalier"]["somalier_1kg"],
6060
output:
6161
out=protected(
6262
expand(

workflow/rules/within_species_contamintation.smk

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,15 @@
11
def get_svd(wildcards):
22
if EXOME_MODE:
3-
return expand(f"{config['verifyBamID']['svd_dat_exome']}.{{ext}}", ext=["bed", "mu", "UD"])
3+
return expand(f"{config['datasets']['verifyBamID']['svd_dat_exome']}.{{ext}}", ext=["bed", "mu", "UD"])
44
else:
5-
return expand(f"{config['verifyBamID']['svd_dat_wgs']}.{{ext}}", ext=["bed", "mu", "UD"])
5+
return expand(f"{config['datasets']['verifyBamID']['svd_dat_wgs']}.{{ext}}", ext=["bed", "mu", "UD"])
66

77

88
rule verifybamid:
99
input:
1010
bam=PROJECT_PATH / "{sample}" / "bam" / "{sample}.bam",
1111
bam_index=PROJECT_PATH / "{sample}" / "bam" / "{sample}.bam.bai",
12-
ref_genome=config["ref"],
12+
ref_genome=config["datasets"]["ref"],
1313
svd=get_svd,
1414
output:
1515
ancestry=protected(OUT_DIR / "{sample}" / "qc" / "verifyBamID" / "{sample}.Ancestry"),
@@ -22,7 +22,7 @@ rule verifybamid:
2222
svd_prefix=lambda wildcards, input: input["svd"][0].replace(Path(input["svd"][0]).suffix, ""),
2323
out_prefix=lambda wildcards, output: output["ancestry"].replace(".Ancestry", ""),
2424
sanity_check="--DisableSanityCheck" if is_testing_mode() else "",
25-
threads: 4
25+
threads: config["resources"]["verifybamid"]["no_cpu"]
2626
shell:
2727
r"""
2828
verifybamid2 {params.sanity_check} \

0 commit comments

Comments
 (0)