Workflows for a) haplotype calling, b) joint genotyping and c) variant filtering and imputation have been developed with Snakemake and bash. The workflow for each step is duly named.
Snakemake pipelines are formed of three files:
Snakefile- Python based files with the core instructions structured in concatenated module/rules- Config file -
YAMLfile that contains the technical details of the workflow (e.g. input/output files, wildcards, software path and version) - Cluster file -
JSONfile including the cluster details for each rule (e.g. memory and cores requested, log location)
Bash pipelines are formed of two files:
- Core
bashscript Bashsubmission script - complimentary details for the corebashscript (similarly toSnakemakeconfig and cluster files)
Snakemake workflows are submitted by using the following bash script:
#!/bin/bash
module load python_gpu/3.7.4
snakemake --jobs 500 -rp --latency-wait 40 --keep-going --rerun-incomplete --cluster-config cluster.json --cluster "bsub -J {cluster.jobname} -n {cluster.ncore} -W {cluster.jobtime} -oo {cluster.logi} -R \"rusage[mem={cluster.memo}]\""
Bash workflow is submitted via the bash submission script.
- Input files need to be named following the wildcard patterns (
UCD/Angusand chromosome numbers in our case) - Sample names are provided as Python list in the
haplotype_caller_config.yamlfile - example is provided - Log files can only be generated in the folders specified in the
*cluster.jsonif the relevant folders have been created within thelog_folder - PDF files with the Snakemake graph (
DAG) can be created as follows:
#!/bin/bash
module load python_gpu/3.6.4
module load gcc/4.8.5 graphviz/2.40.1
name="workflow_name"
snakemake --forceall --dag | dot -Tpdf > ${name}_dag.pdf