A comprehensive computational pipeline for simulating and analyzing coral reef transcriptomic responses to environmental stressors, implementing standardized RNA-seq differential expression workflows with interactive visualization capabilities.
A bioinformatics tool designed for marine genomics researchers studying cnidarian transcriptional responses to environmental perturbations. The application provides a complete analytical framework for coral gene expression profiling, incorporating established methodologies from the field of marine molecular ecology and transcriptomics.
- Transcriptome Simulation Engine: Generates realistic coral gene expression count matrices with configurable parameters mimicking RNA-seq data characteristics
- Differential Expression Analysis: Implements Welch's t-test with Benjamini-Hochberg FDR correction for identifying stress-responsive genes
- CPM Normalization Pipeline: Counts per million (CPM) transformation with log₂ scaling for variance stabilization
- Principal Component Analysis: Dimensionality reduction for sample clustering and batch effect detection
- Interactive Volcano Plots: Statistical significance visualization with log₂ fold-change thresholds
- Expression Heatmaps: Hierarchical clustering of top differentially expressed genes (DEGs)
The simulation engine generates synthetic coral transcriptome data using Poisson-distributed count models:
- Baseline Expression: Control samples drawn from Poisson(λ=50) distribution
- Stress Response Modeling:
- Upregulated genes (n=10): Additional Poisson(λ=40) counts
- Downregulated genes (n=10): Reduced by Poisson(λ=20) counts
- Background genes: Maintain baseline expression levels
-
Count Matrix Loading: Import expression data in standard CSV format with genes as rows and samples as columns
-
CPM Normalization:
CPM = (raw_counts / library_size) × 10⁶ log₂CPM = log₂(CPM + 1) -
Differential Expression Testing:
- Statistical method: Welch's two-sample t-test (unequal variances)
- Multiple testing correction: Benjamini-Hochberg FDR (α = 0.05)
- Effect size: log₂ fold-change calculation with pseudocount adjustment
-
Dimensionality Reduction:
- Principal Component Analysis on log₂CPM-transformed data
- Variance explained reporting for PC1 and PC2
- Volcano Plot: -log₁₀(p-value) vs log₂(fold-change) with significance thresholding
- PCA Biplot: Sample ordination colored by treatment group with variance contribution
- Expression Heatmap: Z-score normalized expression of top 25 DEGs with hierarchical clustering
- Python ≥ 3.8
- 4GB RAM minimum (recommended: 8GB for large datasets)
pip install -r requirements.txtCore Libraries:
streamlit: Web application frameworkpandas: Data manipulation and analysisnumpy: Numerical computingscipy: Statistical functionsstatsmodels: Advanced statistical modelingscikit-learn: Machine learning and PCA implementationplotly: Interactive plottingseaborn: Statistical data visualizationmatplotlib: Publication-quality figures
streamlit run app.pyThe application will launch in your default web browser at http://localhost:8501.
-
Simulated Data Generation (Default):
- Toggle "Use existing data" checkbox OFF
- Generates fresh mock expression matrix (200 genes × 6 samples)
- Sample naming convention:
Ctrl_1,Ctrl_2,Ctrl_3,Stress_1,Stress_2,Stress_3
-
Custom Data Upload:
- Toggle "Use existing data" checkbox ON
- Requires CSV format with gene identifiers as row names
- Column headers must follow
Ctrl_*andStress_*naming pattern
Input CSV Structure:
gene_id,Ctrl_1,Ctrl_2,Ctrl_3,Stress_1,Stress_2,Stress_3
Gene_001,45,52,48,89,76,84
Gene_002,67,61,59,45,52,48
...
Requirements:
- Gene identifiers in first column
- Sample names containing "Ctrl" for control samples
- Sample names containing "Stress" for treatment samples
- Raw count data (non-negative integers)
- X-axis: log₂ fold-change (positive = upregulated in stress, negative = downregulated)
- Y-axis: -log₁₀(p-value) (higher = more statistically significant)
- Color coding:
- Red/highlighted points: FDR-adjusted p-value < 0.05
- Gray points: Non-significant genes
- Sample clustering: Groups of samples with similar expression profiles
- Treatment separation: Distance between control and stress sample clusters
- Variance explained: Percentage of total expression variance captured by each PC
- Quality assessment: Tight within-group clustering indicates good biological replication
- Rows: Top 25 most significantly differentially expressed genes
- Columns: Individual sample replicates
- Color scale: Z-score normalized expression (red = high, blue = low)
- Clustering: Hierarchical clustering reveals gene co-expression patterns
- p-value: Raw statistical significance from t-test
- FDR-adjusted p-value: Multiple testing corrected significance (α = 0.05)
- Fold-change: Biological effect size (typical threshold: |log₂FC| > 1)
The application generates several data products:
-
Differential Expression Results Table:
- Gene identifiers
- log₂ fold-change values
- Raw p-values
- FDR-adjusted p-values
- Ranked by statistical significance
-
Interactive Visualizations:
- Volcano plot (HTML/SVG export compatible)
- PCA biplot with variance statistics
- Expression heatmap with clustering dendrograms
This project is released under the Creative Commons CC0 1.0 Universal License - see the LICENSE file for details.