Snakemake rules

A Snakemake rule is a step of the pipeline that can be executed, depending on what the desired outputs of the pipeline are.

This page includes some additional context about the Snakemake rules defined in Snakefile.py, beyond what is already available in the comments. If applicable, links to external documentation for bioinformatics tools are included in the rule descriptions.

Rule graph

Here is a graph summarizing how the Snakemake rules relate to each other in the workflow:

Rule all

`rule_all`

Purpose: This is a special Snakemake rule where the input is the list of output files you want to have at the end of the workflow.

External documentation: snakemake

Data processing

`make_data_links`

Purpose: Makes symbolic links to data files

Notes:

The python functions called by this rule (and defined in scripts/gus_helper_functions.py) only handle a few cases of how the FileName (in samples.csv) relates to the precise names for forward and reverse raw read files.

External documentation: none

`cutadapt`

Purpose: Removes adapters from reads with cutadapt

Cutadapt flags:

-a CTGTCTCTTAT will need to change if you have a different adapter sequence.

External documentation: cutadapt

`sickle2050`

Purpose: Trims reads based on quality with sickle

Sickle flags:

pe - paired-end reads
-t sanger - specify type of quality value (background on why this flag is sanger not illumina)
-q 20 - quality threshold
-l 20 - discard trimmed reads smaller than this length
-x - do not trim from 5’ end (trim from 3’ end only)
-n - truncate N (and everything after it)

External documentation: sickle

Mapping step

`refGenome_index`

Purpose: Indexes reference genome for bowtie2

External documentation: bowtie2

`bowtie2`

Purpose: Aligns reads to the reference genome with bowtie2

Bowtie2 flags:

-X 2000 - minimum fragment length for valid paired-end alignment
--no-mixed - no alignments for individual mates
--dovetail - allow mates to extend past each other
--no-unal - suppress SAM records for reads that failed to align
Default alignment parameters are available in Bowtie2 documentation

External documentation: bowtie2

`bowtie2qc`

Purpose: Runs a QC script to summarize results of bowtie2 mapping

External documentation: none

`sam2bam`

Purpose: Compresses SAM file into BAM file (and removes duplicate reads)

Notes:

Executed in a Snakemake shadow directory to avoid leftover temp files or truncated files if job is aborted unexpectedly

External documentation: samtools

`sam2bam_cleanup`

Purpose: Deletes SAM file once BAM file is created (SAM files are very large)

Notes:

Snakemake rule priority set high to get rid of large SAM files as fast as possible

External documentation: none

`samtools_idx`

Purpose: Indexes reference genome for samtools

External documentation: samtools

`mpileup2vcf`

Purpose: Processes BAM file into VCF files

Notes:

Executed in a Snakemake shadow directory to avoid leftover temp files or truncated files if job is aborted unexpectedly

Samtools flags:

samtools mpileup - produces "pileup" textual format from an alignment
- -q30 - minimum mapping quality for an alignment to be used
- -x - turns off read-pair overlap detection and removal
- -s - record mapping quality in output file
- -O - output base positions on reads in orientation listed in the SAM file (left to right)
- -d3000 - maximum number of reads at a given position to include (skips reads over this coverage threshold)
- -t SP - for Phred-scaled strand bias P-value
bcftools call - SNP calling
- -c - use consensus caller method
- -Oz - output compressed VCF
bcftools view
- -Oz - output compressed VCF
- -v snps - variant SNVs only
- -q .75 - minimum allele frequency of sites to be printed
tabix - indexes position sorted files in TAB-delimited formats
- -p vcf - input format for indexing

External documentation: samtools

`vcf2quals`

Purpose: Parses VCF with python script

External documentation: none

`pileup2diversity_matrix`

Purpose: Parses pileup with python script

External documentation: none

Case step

`variants2positions`

Purpose: Generates a list of candidate SNV positions for a given sample

External documentation: none

`combine_positions_prep`

Purpose: Creates a list of files with candidate SNV positions from each sample

External documentation: none

`candidate_mutation_table_prep`

Purpose: Build input for candidate_mutation_table

External documentation: none

`combine_positions`

Purpose: Generates a list of candidate SNV positions based on candidate SNV positions across ingroup samples

Notes:

Ignores variant positions in samples marked as outgroups

External documentation: none

`candidate_mutation_table`

Purpose: Builds candidate mutation table (stats across candidate SNV positions)

Notes:

Option to build raw coverage matrix and normalized coverage matrix
See this page for a description of data objects generated

External documentation: none

Assembly (alternative workflow)

The assembly workflow generates annotated assemblies for each sample. Below is a description of snakemake rules specific to the assembly step (excluding the data processing steps already described above).

Rule graph

`spades`

Purpose: Assemble a genome from reads from a given sample using SPAdes

SPAdes flags:

-m 500 - memory limit in GB
-k 21,33,55,77 - k-mer lengths
--phred-offset 33 - PHRED quality offset for the input reads
--careful - tries to reduce the number of mismatches and short indels

External documentation: SPAdes

`prokka`

Purpose: Annotate assembly using prokka

Prokka flags:

--compliant - force Genbank/ENA/DDJB compliance
--force - force overwriting existing output folder

External documentation: prokka

`build_annotation_orthologs_input`

Purpose: Get two-column (caldeID,path2faa) input file for ortholog_inference script

External documentation: none

`infer_orthologs`

Purpose: Infer ortholog info for each identified gene (based on AA sequence) across all clades using CD-HIT

CD-HIT flags:

percent_identity="0.9" - percent identity threshold for clustering AA sequences

External documentation: CD-HIT

Bracken (alternative workflow)

The bracken workflow generates estimates of taxonomic abundance for each sample. Below is a description of snakemake rules specific to the assembly step (excluding the data processing steps already described above).

Rule graph

`FQ2FA`

Purpose: Turn fastq files into fasta files

External documentation: none

`kraken2`

Purpose: Run kraken (on forward read file only)

Notes:

The kraken database should be updated occasionally.

External documentation: Kraken2

`bracken`

Purpose: Run bracken

External documentation: Bracken

Files

readme_snake_rules.md

Latest commit

History

readme_snake_rules.md

File metadata and controls

Snakemake rules

Rule graph

Rule all

rule_all

Data processing

make_data_links

cutadapt

sickle2050

Mapping step

refGenome_index

bowtie2

bowtie2qc

sam2bam

sam2bam_cleanup

samtools_idx

mpileup2vcf

vcf2quals

pileup2diversity_matrix

Case step

variants2positions

combine_positions_prep

candidate_mutation_table_prep

combine_positions

candidate_mutation_table

Assembly (alternative workflow)

Rule graph

spades

prokka

build_annotation_orthologs_input

infer_orthologs

Bracken (alternative workflow)

Rule graph

FQ2FA

kraken2

bracken

Table of contents

`rule_all`

`make_data_links`

`cutadapt`

`sickle2050`

`refGenome_index`

`bowtie2`

`bowtie2qc`

`sam2bam`

`sam2bam_cleanup`

`samtools_idx`

`mpileup2vcf`

`vcf2quals`

`pileup2diversity_matrix`

`variants2positions`

`combine_positions_prep`

`candidate_mutation_table_prep`

`combine_positions`

`candidate_mutation_table`

`spades`

`prokka`

`build_annotation_orthologs_input`

`infer_orthologs`

`FQ2FA`

`kraken2`

`bracken`