RNAseq-pipeline-Bash-Snakemake-Nextflow

This bioinformatic pipeline performs a basic RNA-seq analysis. We start by downloading the FASTQ files and end with a Differential Expression Genes list. Of course, we'll do some downstream analysis as well. I'll be using three different methods to do this:

In every method, the fundamental processes are going to be the same, but there are going to be different ways to achieve that. Each method offers a unique value, which we shall experience on our own, and document it.

Objectives

This is a purely learning project to understand the workflow differences of Bash, Snakemake, and Nextflow. We are going to evaluate the usability and reproducibility of these workflows. Ideally, I would like to benchmark the runtimes as well. But this project was run on a modest 16GB RAM laptop, not a realistic environment for evaluating performance, since I can't make use of parallel computing due to the risk of melting my laptop. So we are going to focus on these criteria:

Ease of use for new users (learning curve)
Reproducibility (Can we trace the output?)
Resumability (Can the pipeline recover from a crash without doing everything again?)
Maintainabilty (Is it easy to update one step in a process without crashing the whole damn thing?)
Container support (How easy is it to set up and use Docker/Singularity?)

Dataset

This pipeline analyzes RNA-seq data from the GEO dataset GSE37211, which investigates estrogen receptor signaling in parathyroid adenoma cells.

Organism: Homo sapiens
Platform: Illumina HiSeq 2000 (paired-end, 100 bp)
Samples: 23 total, across 6 conditions:
- Control (24h, 48h)
- DPN (24h, 48h)
- OHT (24h, 48h)
Experimental Focus: Transcriptomic response to DPN and Tamoxifen (OHT), targeting estrogen receptor beta.
Source: Haglund et al., J Clin Endocrinol Metab, 2012 (PubMed)

Why this dataset? It was one of the complete datasets suggested under 50GB during one of my master's courses called Bioinformatic Methods for Next Generation Sequencing Analysis at NTNU.

Usually, for personal learning projects, people tend to choose only specific parts of the large sequencing data. It is okay for learning purposes, but it won't lead to replicating the paper's figures or any meaningful results. Depending upon your hardware (I did it in 16gb RAM device), this should not pose too big a problem as long as you run sample by sample.

Pipeline overview

All three implementations (Bash, Snakemake, and Nextflow) follow the same biological logic:

Download raw FASTQ files
Using fasterq-dump to retrieve sequencing data from NCBI SRA.
Quality control
Assess raw read quality using FastQC
Trimming (if needed)
Adapter and quality trimming using tools like Trimmomatic
Alignment
Align reads to the reference genome using STAR.
Read counting
Quantify gene-level expression using featureCounts.
Differential expression analysis
Use DESeq2 (in R) to identify significantly differentially expressed genes between conditions.
Optional downstream analysis
Includes volcano plots, and heatmaps to visualize sample variation and DEG patterns.

Workflow Comparison

This table compares the Bash, Snakemake, and Nextflow implementations across the key evaluation criteria defined above.

Feature	Bash-linear	Snakemake	Nextflow
Ease of use	✅ Beginner-friendly (at first)	⚠️ Some syntax learning required	⚠️ More abstract and DSL-heavy
Reproducibility	❌ Manual logging, fragile	✅ Full dependency tracking + logs	✅ Excellent reproducibility with containers
Resumability	❌ Must restart manually	✅ Can resume with `--rerun-incomplete`	✅ Built-in checkpointing and caching
Maintainability	❌ Hard to update safely	✅ Modular rules make updates easy	✅ Modular + reusable with process isolation
Container Support	❌ None by default	✅ Native `singularity:`/`conda` per rule	✅ Native Docker/Singularity support

✅ = Strong
⚠️ = Moderate / learning curve
❌ = Weak / missing

Interpretation:

Bash is approachable but brittle. Great for quick one-offs, bad for scaling or sharing.
Snakemake hits the sweet spot between usability and reproducibility — especially on local systems or HPCs.
Nextflow is more powerful and flexible (esp. for cloud/HPC), but has a steeper learning curve and more complex syntax.

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
bash		bash
nextflow		nextflow
snakemake		snakemake
LICENSE		LICENSE
README.md		README.md
TruSeq3-PE.fa		TruSeq3-PE.fa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

RNAseq-pipeline-Bash-Snakemake-Nextflow

Objectives

Dataset

Pipeline overview

Workflow Comparison

About

Uh oh!

Releases

Packages

Languages

License

Amar825/RNAseq-pipeline-Bash-Snakemake-Nextflow

Folders and files

Latest commit

History

Repository files navigation

RNAseq-pipeline-Bash-Snakemake-Nextflow

Objectives

Dataset

Pipeline overview

Workflow Comparison

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages