Skip to content
Won Cheol Yim edited this page Dec 4, 2025 · 4 revisions

Sylvan Tutorial: Running with Toy Data

This tutorial walks you through running Sylvan on the included toy dataset (A. thaliana chromosome 4).

Table of Contents


Prerequisites

  • Linux system with SLURM
  • Singularity 3.x+
  • Conda/Mamba
  • Git LFS

Step 1: Environment Setup

# Create conda environment
conda create -n sylvan -c conda-forge -c bioconda python=3.11 snakemake=7 -y
conda activate sylvan

# Install Git LFS
git lfs install

Step 2: Download Containers

Sylvan Container

singularity pull --arch amd64 sylvan.sif library://wyim/sylvan/sylvan:latest

EDTA Container (for repeat library)

export SINGULARITY_CACHEDIR=$PWD
singularity pull EDTA.sif docker://quay.io/biocontainers/edta:2.2.0--hdfd78af_1

Clone Repository

git clone https://github.com/plantgenomicslab/Sylvan.git
cd Sylvan

# Verify LFS files are downloaded (not pointers)
git lfs pull
ls -la toydata/  # Files should be > 200 bytes

Step 3: Prepare Repeat Library (EDTA)

Note: The toy data includes a pre-computed repeat library. This step is for reference or if you need to regenerate it.

sbatch -c 16 --mem=68g --wrap="singularity exec --cleanenv --env PYTHONNOUSERSITE=1 \
  EDTA.sif EDTA.pl \
  --genome toydata/genome_input/genome.fasta \
  --cds toydata/cds_aa/neighbor.cds \
  --anno 1 --threads 16 --force 1"

Expected runtime: ~1.5 hours for 18.6 Mb genome

Output: genome.fasta.mod.EDTA.TElib.fa

EDTA Benchmark (Toy Data)

Stage Duration
LTR detection ~3 min
SINE detection ~6 min
LINE detection ~45 min
TIR detection ~8 min
Helitron detection ~10 min
Filtering & annotation ~8 min
Total ~1.5 hours

Step 4: Run Annotation Pipeline

Environment Variables

The pipeline uses several environment variables for configuration:

Variable Description Example
SYLVAN_CONFIG Path to pipeline config file toydata/config/config_annotate.yml
SYLVAN_CLUSTER_CONFIG Path to SLURM cluster config (optional, auto-derived from SYLVAN_CONFIG) toydata/config/cluster_annotate.yml
TMPDIR Temporary directory for intermediate files $(pwd)/results/TMP
SLURM_TMPDIR SLURM job temporary directory (should match TMPDIR) $TMPDIR

Why set TMPDIR?

Setting TMPDIR explicitly is critical on many HPC systems:

  1. Memory-backed /tmp (tmpfs): Some HPC nodes have no local disk storage. The default /tmp is mounted as tmpfs, which stores files in RAM. Large temporary files from tools like STAR, RepeatMasker, or Augustus can quickly exhaust memory and crash your jobs with cryptic "out of memory", "no space left on device", or even segmentation fault errors—since the OS may kill processes or corrupt memory when tmpfs fills up.

  2. Quota limits: Shared /tmp partitions often have strict per-user quotas (e.g., 1-10 GB). Genome annotation tools easily exceed this.

  3. Job isolation: When TMPDIR points to your project directory, temp files persist after job completion for debugging. Cleanup is also straightforward with rm -rf results/TMP/*.

  4. Singularity compatibility: Containers inherit TMPDIR from the host. Setting it to a bound path ensures temp files are written to accessible storage.

Tip: Check if your cluster uses tmpfs with df -h /tmp. If it shows tmpfs as the filesystem type, you must set TMPDIR to avoid memory issues.

Complete Environment Setup

Copy this block before running the pipeline:

# Required: Pipeline configuration
export SYLVAN_CONFIG="toydata/config/config_annotate.yml"

# Required: Temp directory (create if not exists)
mkdir -p results/TMP
export TMPDIR="$(pwd)/results/TMP"
export SLURM_TMPDIR="$TMPDIR"

# Optional: Bind additional paths for Singularity
# export SINGULARITY_BIND="/scratch,/data"

# Optional: Increase open file limit (some tools need this)
ulimit -n 65535 2>/dev/null || true

Additional Singularity Variables

Variable Description When to Use
SINGULARITY_BIND Bind additional host paths into container When input files are outside working directory
SINGULARITY_CACHEDIR Location for Singularity cache When home directory has quota limits
SINGULARITY_TMPDIR Singularity's internal temp directory Should match TMPDIR

When do you need SINGULARITY_BIND?

Singularity automatically binds your current working directory, $HOME, and /tmp. However, you need explicit binding when:

  • Input files (genome, RNA-seq, proteins) are on a separate filesystem (e.g., /scratch, /project, /data)
  • Your home directory has quota limits and you store data elsewhere
  • Using shared lab storage mounted at non-standard paths
  • Config file references absolute paths outside the working directory
# Common scenarios:

# 1. Data on scratch space
export SINGULARITY_BIND="/scratch/$USER"

# 2. Multiple paths (comma-separated)
export SINGULARITY_BIND="/scratch,/project/shared_data,/data/genomes"

# 3. Bind with different container path (host:container)
export SINGULARITY_BIND="/long/path/on/host:/data"

# 4. Read-only binding (for shared reference data)
export SINGULARITY_BIND="/shared/databases:/databases:ro"

Diagnosing bind issues:

# Error: "file not found" inside container but exists on host
# → The path isn't bound. Add it to SINGULARITY_BIND

# Test if path is accessible inside container:
singularity exec sylvan.sif ls /your/data/path

# See what's currently bound:
singularity exec sylvan.sif cat /proc/mounts | grep -E "scratch|project|data"

Verifying Your Environment

Before submitting jobs, verify your setup:

# Check TMPDIR is on real disk (not tmpfs)
df -h $TMPDIR

# Verify Singularity can access paths
singularity exec sylvan.sif ls $TMPDIR

# Test config file is readable
cat $SYLVAN_CONFIG | head -5

Debugging Environment Issues

When jobs fail unexpectedly, environment variables are often the culprit. Use these techniques to diagnose:

1. Print all relevant variables:

# Add to your script or run interactively
echo "=== Environment Check ==="
echo "SYLVAN_CONFIG: $SYLVAN_CONFIG"
echo "TMPDIR: $TMPDIR"
echo "SLURM_TMPDIR: $SLURM_TMPDIR"
echo "SINGULARITY_BIND: $SINGULARITY_BIND"
echo "PWD: $PWD"
df -h $TMPDIR

2. Check what SLURM jobs actually see:

# Submit a diagnostic job
sbatch --wrap='env | grep -E "TMPDIR|SINGULARITY|SYLVAN" && df -h /tmp $TMPDIR'

3. Common environment-related errors:

Error Message Likely Cause Solution
No space left on device TMPDIR on tmpfs or quota exceeded Set TMPDIR to project storage
Segmentation fault Memory exhausted (tmpfs full) Set TMPDIR to disk-backed storage
file not found (in container) Path not bound in Singularity Add path to SINGULARITY_BIND
Permission denied Singularity can't write to TMPDIR Check directory permissions, ensure path is bound
cannot create temp file TMPDIR doesn't exist or not writable Run mkdir -p $TMPDIR && touch $TMPDIR/test

4. Interactive debugging inside container:

# Start interactive shell in container with same bindings
singularity shell --cleanenv sylvan.sif

# Inside container, verify paths exist
ls -la $TMPDIR
ls -la /path/to/your/data

5. Check if variables survive into SLURM jobs:

SLURM doesn't always pass environment variables. Ensure your submit script exports them:

# In your sbatch script or wrapper
#SBATCH --export=ALL    # Pass all environment variables

# Or explicitly export in the script:
export TMPDIR="$(pwd)/results/TMP"
export SLURM_TMPDIR="$TMPDIR"

Dry Run (Recommended)

Always do a dry run first to verify configuration:

export SYLVAN_CONFIG="toydata/config/config_annotate.yml"
snakemake -n --snakefile bin/Snakefile_annotate

Submit to SLURM

sbatch -A [account] -p [partition] -c 1 --mem=1g \
  -J annotate -o annotate.out -e annotate.err \
  --wrap="./bin/annotate_toydata.sh"

Output locations

  • All intermediate/final results are written under the repo root results/.
  • RepeatMasker/RepeatModeler run inside results/GETA/RepeatMasker/..., so .RepeatMaskerCache and RM_* temp folders also stay there.
  • EVM commands and outputs live in results/EVM/; no EVM -> results/EVM symlink is needed.
  • For the filter pipeline, set RexDB to a RepeatExplorer protein DB (e.g., Viridiplantae_v4.0.fasta from https://github.com/repeatexplorer/rexdb). You can download directly via:
    wget -O toydata/misc/Viridiplantae_v4.0.fasta https://raw.githubusercontent.com/repeatexplorer/rexdb/refs/heads/main/Viridiplantae_v4.0.fasta

Expected Runtime (Toy Data)

Stage Time
RepeatMasking 15-30 min
RNA-seq alignment 30-60 min
Transcript assembly 20-40 min
Homology search 30-60 min
Augustus training 1-2 hours
Gene model combination 30-60 min
EvidenceModeler 30-60 min
Total 4-8 hours

Force Rerun

# Rerun all jobs
./bin/annotate_toydata.sh --forceall

# Rerun specific rule
./bin/annotate_toydata.sh --forcerun helixer

# Rerun incomplete jobs
./bin/rerun-incomplete_toydata.sh

Step 5: Run Filter Pipeline

After annotation completes:

sbatch -A [account] -p [partition] -c 1 --mem=4g \
  -J filter -o filter.out -e filter.err \
  --wrap="./bin/filter.sh"

Output: results/FILTER/filter.gff3

Step 6: Format Output (TidyGFF)

singularity exec sylvan.sif python bin/TidyGFF.py \
  Ath4 results/FILTER/filter.gff3 \
  --out Ath4_v1.0 \
  --splice-name t \
  --justify 5 \
  --sort \
  --chrom-regex "^Chr" \
  --source Sylvan

Monitoring and Debugging

Check Job Status

squeue -u $USER

View Logs

# Snakemake log
tail -f .snakemake/log/*.snakemake.log

# Find recent error logs
ls -lt results/logs/*.err | head -10

# Search for errors
grep -l 'Error\|Traceback' results/logs/*.err

# View specific log (pattern: {rule}_{wildcards}.err)
cat results/logs/liftoff_.err
cat results/logs/geneRegion2Genewise_seqid=group17400.err

Common Issues

Issue Solution
Out of memory Increase memory in cluster_annotate.yml
LFS files are pointers Run git lfs pull
Singularity bind error Use paths within working directory
Augustus training fails Need minimum 500 training genes

Toy Data Details

Overview

The toy dataset contains Arabidopsis thaliana chromosome 4 split into 3 segments (~18.6 Mb total).

Directory Structure

toydata/
├── config/                      # Configuration files
│   ├── config_annotate.yml      # Pipeline config (pre-configured)
│   ├── cluster_annotate.yml     # SLURM resource config
│   └── evm_weights.txt          # EVM weights
├── genome_input/
│   └── genome.fasta.gz          # A. thaliana Chr4 (3 parts)
├── RNASeq/                      # 12 paired-end RNA-seq samples
│   ├── sub_SRR1019221_1.fastq.gz
│   └── ...
├── neighbor_genome/             # Neighbor species genomes
│   ├── aly4.fasta               # A. lyrata
│   ├── ath4.fasta               # A. thaliana
│   ├── chi4.fasta               # C. hirsuta
│   └── cru4.fasta               # C. rubella
├── neighbor_gff3/               # Neighbor annotations
│   ├── aly4.gff3
│   └── ...
├── cds_aa/
│   ├── neighbor.cds             # Combined CDS for EDTA
│   └── neighbor.aa              # Proteins for homology
└── EDTA/                        # Pre-computed repeat library
    └── genome.fasta.mod.EDTA.TElib.fa

Genome Statistics

Segment Length GC (%)
Chr4_1 6,195,060 bp 36.66
Chr4_2 6,195,060 bp 35.24
Chr4_3 6,195,018 bp 36.69
Total 18,585,138 bp 36.20

TAIR10 Reference Annotation (Chr4)

For comparison, the official TAIR10 annotation of Arabidopsis thaliana (Col-0) chromosome 4 contains:

Feature Type Count
Protein-coding genes 4,124
pre-tRNA genes 79
rRNA genes 0
snRNA genes 0
snoRNA genes 11
miRNA genes 28
Other RNA genes 62
Pseudogenes 121
Transposable element (TE) genes 711
Total annotated loci 5,410

Note: TAIR10 is the current reference annotation standard for A. thaliana. The protein-coding gene count increased from 3,744 (original Chr4 paper) to 4,124 as annotation methods improved.

Source: Phoenix Bioinformatics - Genome Annotation at TAIR

Sylvan Pipeline Results (Toy Data)

Running the Sylvan pipeline on the Chr4 toy dataset produces the following results:

Annotation Phase (complete_draft.gff3):

Metric Count
Total genes 5,720
Total mRNA 5,800

Filter Phase (filter.gff3):

Metric Count
Genes kept 3,756
mRNA kept 3,834
Genes discarded 1,964
mRNA discarded 1,966

Output files:

  • results/FILTER/filter.gff3 - Filtered gene models
  • results/FILTER/discard.gff3 - Discarded gene models
  • results/FILTER/keep_data.tsv - Evidence data for kept genes
  • results/FILTER/discard_data.tsv - Evidence data for discarded genes
  • results/complete_draft.gff3.map - ID mapping between original and new IDs

Comparison: The 3,756 kept genes represents ~91% of TAIR10's 4,124 protein-coding genes on Chr4. The higher initial count (5,720) includes transposable elements (TAIR10 has 711 TE genes) and low-confidence predictions that are filtered out.

Neighbor Species

Code Species Common Name
aly4 Arabidopsis lyrata Lyrate rockcress
cru4 Capsella rubella Pink shepherd's purse
chi4 Cardamine hirsuta Hairy bittercress

RNA-seq Samples

SRA Tissue Size
SRR1019221 Leaf (14-day) 4.6 Gb
SRR1105822 Rosette (19-day) 3.1 Gb
SRR1105823 Rosette (19-day) 4.5 Gb
SRR1106559 Rosette (19-day) 3.6 Gb
SRR446027 Whole plant 2.5 Gb
SRR446028 Whole plant 5.4 Gb
SRR446033 Whole plant 5.3 Gb
SRR446034 Whole plant 5.4 Gb
SRR446039 Whole plant 2.5 Gb
SRR446040 Whole plant 6.3 Gb
SRR764885 Leaf (4-week) 4.6 Gb
SRR934391 Whole plant 4.0 Gb

How the Toy Data Was Created

1. Genome segmentation

seqkit split2 -p 3 Chr4.fasta

2. Neighbor CDS extraction

Syntenic regions were identified using MCscan/jcvi:

python -m jcvi.compara.catalog ortholog ath aly --no_strip_names
python -m jcvi.compara.synteny mcscan ath.bed ath.aly.lifted.anchors --iter=1

3. RNA-seq subsetting

Reads mapping to Chr4 were extracted:

STAR --genomeDir star_index --readFilesIn sample_1.fq.gz sample_2.fq.gz
samtools view -b -F 4 Aligned.bam | samtools fastq -1 out_1.fq -2 out_2.fq -

Test Environment

The toy data was tested on:

Specification Value
Nodes 4
Total CPUs 256
CPU Intel Xeon E5-2683 v4 @ 2.10GHz
Cores per node 64 (2 sockets × 16 cores × 2 threads)
Memory per node 256 GB
Storage GPFS

Runtime Statistics

The following summarizes the runtime distribution across all pipeline rules when running on the toy dataset.

Key observations:

  • Most time-consuming steps: aggregate_CombineGeneModels (~50,000s), geneRegion2Genewise (~1,000s), and Sam2Transfrag (~100-200s) are the bottlenecks
  • Fast steps: Most preprocessing and formatting rules complete in under 10 seconds
  • Parallelizable rules: Rules like geneRegion2Genewise, gmapExon, and STAR_paired run as multiple parallel jobs (shown as multiple dots), significantly reducing wall-clock time
  • GPU-accelerated: helixer benefits from GPU acceleration when available

Runtime variability:

Actual runtime will vary significantly depending on your hardware and cluster configuration:

Factor Impact
CPU speed Faster clock speeds reduce single-threaded bottlenecks
Available nodes More nodes = more parallel jobs = faster wall-clock time
Memory per node Insufficient memory causes job failures or swapping
Storage I/O GPFS/Lustre faster than NFS; SSD faster than HDD
Queue wait time Busy clusters add significant delays between jobs
GPU availability Helixer runs ~10x faster with GPU acceleration

With the test environment above (4 nodes, 256 CPUs, 256 GB/node), the toy dataset completes in 4-8 hours wall-clock time. On smaller clusters or shared resources, expect longer runtimes.


Getting Help

Generate cluster config from config_annotate

python bin/generate_cluster_from_config.py   --config config/config_annotate.yml   --out    config/cluster_annotate.yml   --account cpu-s1-pgl-0 --partition cpu-s1-pgl-0

(To regenerate toydata cluster config, point --config/--out to toydata paths.)

python bin/generate_cluster_from_config.py   --config toydata/config/config_annotate.yml   --out    toydata/config/cluster_annotate.yml   --account cpu-s1-pgl-0 --partition cpu-s1-pgl-0
chmod 775 bin/generate_cluster_from_config.py

Feature Importance Analysis

After finishing the filter phase you will have FILTER/data.tsv (the feature matrix used by Filter.py) and a BUSCO run directory such as results/busco/eudicots_odb10. Reviewers often ask for a feature ablation study, so we provide an automated helper:

python bin/filter_feature_importance.py FILTER/data.tsv results/busco/<lineage>/full_table.tsv \
  --output-table FILTER/feature_importance.tsv
  • What is the BUSCO full table? Every BUSCO run writes a full_table.tsv inside its lineage-specific run folder. Each non-Missing BUSCO row lists the BUSCO ID, status (Complete/Duplicated/Fragmented), and the transcript/gene ID it matched. The feature-importance script reuses this file to count how many BUSCOs remain in the “keep” set during each iteration—no new BUSCO analysis is required.
  • Outputs: FILTER/feature_importance.tsv (table) plus FILTER/feature_importance.json (machine-readable). Both include the baseline run (all features) and each leave-one-feature-out run, along with final out-of-bag (OOB) error, BUSCO counts, and iteration counts.
  • Optional flags:
    • --features TPM COVERAGE PFAM ... restricts the analysis to specific columns from FILTER/data.tsv.
    • --ignore TPM_missing singleExon removes metadata columns so the script automatically uses every other feature column.

Workflow summary:

  1. Run Filter.py as usual to create FILTER/data.tsv.
  2. Identify the BUSCO full_table.tsv path you already used for filter monitoring (e.g., results/busco/eudicots_odb10/full_table.tsv).
  3. Execute the command above. Inspect FILTER/feature_importance.tsv to see how dropping each feature affects OOB error (positive delta ⇒ feature is important).
  4. Incorporate the results (table/plot) into your manuscript or reviewer response.