-
Notifications
You must be signed in to change notification settings - Fork 0
Home
This tutorial walks you through running Sylvan on the included toy dataset (A. thaliana chromosome 4).
- Prerequisites
- Step 1: Environment Setup
- Step 2: Download Containers
- Step 3: Prepare Repeat Library (EDTA)
- Step 4: Run Annotation Pipeline
- Step 5: Run Filter Pipeline
- Step 6: Format Output (TidyGFF)
- Monitoring and Debugging
- Toy Data Details
- Linux system with SLURM
- Singularity 3.x+
- Conda/Mamba
- Git LFS
# Create conda environment
conda create -n sylvan -c conda-forge -c bioconda python=3.11 snakemake=7 -y
conda activate sylvan
# Install Git LFS
git lfs installsingularity pull --arch amd64 sylvan.sif library://wyim/sylvan/sylvan:latestexport SINGULARITY_CACHEDIR=$PWD
singularity pull EDTA.sif docker://quay.io/biocontainers/edta:2.2.0--hdfd78af_1git clone https://github.com/plantgenomicslab/Sylvan.git
cd Sylvan
# Verify LFS files are downloaded (not pointers)
git lfs pull
ls -la toydata/ # Files should be > 200 bytesNote: The toy data includes a pre-computed repeat library. This step is for reference or if you need to regenerate it.
sbatch -c 16 --mem=68g --wrap="singularity exec --cleanenv --env PYTHONNOUSERSITE=1 \
EDTA.sif EDTA.pl \
--genome toydata/genome_input/genome.fasta \
--cds toydata/cds_aa/neighbor.cds \
--anno 1 --threads 16 --force 1"Expected runtime: ~1.5 hours for 18.6 Mb genome
Output: genome.fasta.mod.EDTA.TElib.fa
| Stage | Duration |
|---|---|
| LTR detection | ~3 min |
| SINE detection | ~6 min |
| LINE detection | ~45 min |
| TIR detection | ~8 min |
| Helitron detection | ~10 min |
| Filtering & annotation | ~8 min |
| Total | ~1.5 hours |
The pipeline uses several environment variables for configuration:
| Variable | Description | Example |
|---|---|---|
SYLVAN_CONFIG |
Path to pipeline config file | toydata/config/config_annotate.yml |
SYLVAN_CLUSTER_CONFIG |
Path to SLURM cluster config (optional, auto-derived from SYLVAN_CONFIG) |
toydata/config/cluster_annotate.yml |
TMPDIR |
Temporary directory for intermediate files | $(pwd)/results/TMP |
SLURM_TMPDIR |
SLURM job temporary directory (should match TMPDIR) |
$TMPDIR |
Why set TMPDIR?
Setting TMPDIR explicitly is critical on many HPC systems:
-
Memory-backed
/tmp(tmpfs): Some HPC nodes have no local disk storage. The default/tmpis mounted astmpfs, which stores files in RAM. Large temporary files from tools like STAR, RepeatMasker, or Augustus can quickly exhaust memory and crash your jobs with cryptic "out of memory", "no space left on device", or even segmentation fault errors—since the OS may kill processes or corrupt memory when tmpfs fills up. -
Quota limits: Shared
/tmppartitions often have strict per-user quotas (e.g., 1-10 GB). Genome annotation tools easily exceed this. -
Job isolation: When
TMPDIRpoints to your project directory, temp files persist after job completion for debugging. Cleanup is also straightforward withrm -rf results/TMP/*. -
Singularity compatibility: Containers inherit
TMPDIRfrom the host. Setting it to a bound path ensures temp files are written to accessible storage.
Tip: Check if your cluster uses tmpfs with
df -h /tmp. If it showstmpfsas the filesystem type, you must setTMPDIRto avoid memory issues.
Copy this block before running the pipeline:
# Required: Pipeline configuration
export SYLVAN_CONFIG="toydata/config/config_annotate.yml"
# Required: Temp directory (create if not exists)
mkdir -p results/TMP
export TMPDIR="$(pwd)/results/TMP"
export SLURM_TMPDIR="$TMPDIR"
# Optional: Bind additional paths for Singularity
# export SINGULARITY_BIND="/scratch,/data"
# Optional: Increase open file limit (some tools need this)
ulimit -n 65535 2>/dev/null || true| Variable | Description | When to Use |
|---|---|---|
SINGULARITY_BIND |
Bind additional host paths into container | When input files are outside working directory |
SINGULARITY_CACHEDIR |
Location for Singularity cache | When home directory has quota limits |
SINGULARITY_TMPDIR |
Singularity's internal temp directory | Should match TMPDIR
|
When do you need SINGULARITY_BIND?
Singularity automatically binds your current working directory, $HOME, and /tmp. However, you need explicit binding when:
- Input files (genome, RNA-seq, proteins) are on a separate filesystem (e.g.,
/scratch,/project,/data) - Your home directory has quota limits and you store data elsewhere
- Using shared lab storage mounted at non-standard paths
- Config file references absolute paths outside the working directory
# Common scenarios:
# 1. Data on scratch space
export SINGULARITY_BIND="/scratch/$USER"
# 2. Multiple paths (comma-separated)
export SINGULARITY_BIND="/scratch,/project/shared_data,/data/genomes"
# 3. Bind with different container path (host:container)
export SINGULARITY_BIND="/long/path/on/host:/data"
# 4. Read-only binding (for shared reference data)
export SINGULARITY_BIND="/shared/databases:/databases:ro"Diagnosing bind issues:
# Error: "file not found" inside container but exists on host
# → The path isn't bound. Add it to SINGULARITY_BIND
# Test if path is accessible inside container:
singularity exec sylvan.sif ls /your/data/path
# See what's currently bound:
singularity exec sylvan.sif cat /proc/mounts | grep -E "scratch|project|data"Before submitting jobs, verify your setup:
# Check TMPDIR is on real disk (not tmpfs)
df -h $TMPDIR
# Verify Singularity can access paths
singularity exec sylvan.sif ls $TMPDIR
# Test config file is readable
cat $SYLVAN_CONFIG | head -5When jobs fail unexpectedly, environment variables are often the culprit. Use these techniques to diagnose:
1. Print all relevant variables:
# Add to your script or run interactively
echo "=== Environment Check ==="
echo "SYLVAN_CONFIG: $SYLVAN_CONFIG"
echo "TMPDIR: $TMPDIR"
echo "SLURM_TMPDIR: $SLURM_TMPDIR"
echo "SINGULARITY_BIND: $SINGULARITY_BIND"
echo "PWD: $PWD"
df -h $TMPDIR2. Check what SLURM jobs actually see:
# Submit a diagnostic job
sbatch --wrap='env | grep -E "TMPDIR|SINGULARITY|SYLVAN" && df -h /tmp $TMPDIR'3. Common environment-related errors:
| Error Message | Likely Cause | Solution |
|---|---|---|
No space left on device |
TMPDIR on tmpfs or quota exceeded | Set TMPDIR to project storage |
Segmentation fault |
Memory exhausted (tmpfs full) | Set TMPDIR to disk-backed storage |
file not found (in container) |
Path not bound in Singularity | Add path to SINGULARITY_BIND
|
Permission denied |
Singularity can't write to TMPDIR | Check directory permissions, ensure path is bound |
cannot create temp file |
TMPDIR doesn't exist or not writable | Run mkdir -p $TMPDIR && touch $TMPDIR/test
|
4. Interactive debugging inside container:
# Start interactive shell in container with same bindings
singularity shell --cleanenv sylvan.sif
# Inside container, verify paths exist
ls -la $TMPDIR
ls -la /path/to/your/data5. Check if variables survive into SLURM jobs:
SLURM doesn't always pass environment variables. Ensure your submit script exports them:
# In your sbatch script or wrapper
#SBATCH --export=ALL # Pass all environment variables
# Or explicitly export in the script:
export TMPDIR="$(pwd)/results/TMP"
export SLURM_TMPDIR="$TMPDIR"Always do a dry run first to verify configuration:
export SYLVAN_CONFIG="toydata/config/config_annotate.yml"
snakemake -n --snakefile bin/Snakefile_annotatesbatch -A [account] -p [partition] -c 1 --mem=1g \
-J annotate -o annotate.out -e annotate.err \
--wrap="./bin/annotate_toydata.sh"Output locations
- All intermediate/final results are written under the repo root
results/. - RepeatMasker/RepeatModeler run inside
results/GETA/RepeatMasker/..., so.RepeatMaskerCacheandRM_*temp folders also stay there. - EVM commands and outputs live in
results/EVM/; noEVM -> results/EVMsymlink is needed. - For the filter pipeline, set
RexDBto a RepeatExplorer protein DB (e.g., Viridiplantae_v4.0.fasta from https://github.com/repeatexplorer/rexdb). You can download directly via:
wget -O toydata/misc/Viridiplantae_v4.0.fasta https://raw.githubusercontent.com/repeatexplorer/rexdb/refs/heads/main/Viridiplantae_v4.0.fasta
| Stage | Time |
|---|---|
| RepeatMasking | 15-30 min |
| RNA-seq alignment | 30-60 min |
| Transcript assembly | 20-40 min |
| Homology search | 30-60 min |
| Augustus training | 1-2 hours |
| Gene model combination | 30-60 min |
| EvidenceModeler | 30-60 min |
| Total | 4-8 hours |
# Rerun all jobs
./bin/annotate_toydata.sh --forceall
# Rerun specific rule
./bin/annotate_toydata.sh --forcerun helixer
# Rerun incomplete jobs
./bin/rerun-incomplete_toydata.shAfter annotation completes:
sbatch -A [account] -p [partition] -c 1 --mem=4g \
-J filter -o filter.out -e filter.err \
--wrap="./bin/filter.sh"Output: results/FILTER/filter.gff3
singularity exec sylvan.sif python bin/TidyGFF.py \
Ath4 results/FILTER/filter.gff3 \
--out Ath4_v1.0 \
--splice-name t \
--justify 5 \
--sort \
--chrom-regex "^Chr" \
--source Sylvansqueue -u $USER# Snakemake log
tail -f .snakemake/log/*.snakemake.log
# Find recent error logs
ls -lt results/logs/*.err | head -10
# Search for errors
grep -l 'Error\|Traceback' results/logs/*.err
# View specific log (pattern: {rule}_{wildcards}.err)
cat results/logs/liftoff_.err
cat results/logs/geneRegion2Genewise_seqid=group17400.err| Issue | Solution |
|---|---|
| Out of memory | Increase memory in cluster_annotate.yml
|
| LFS files are pointers | Run git lfs pull
|
| Singularity bind error | Use paths within working directory |
| Augustus training fails | Need minimum 500 training genes |
The toy dataset contains Arabidopsis thaliana chromosome 4 split into 3 segments (~18.6 Mb total).
toydata/
├── config/ # Configuration files
│ ├── config_annotate.yml # Pipeline config (pre-configured)
│ ├── cluster_annotate.yml # SLURM resource config
│ └── evm_weights.txt # EVM weights
├── genome_input/
│ └── genome.fasta.gz # A. thaliana Chr4 (3 parts)
├── RNASeq/ # 12 paired-end RNA-seq samples
│ ├── sub_SRR1019221_1.fastq.gz
│ └── ...
├── neighbor_genome/ # Neighbor species genomes
│ ├── aly4.fasta # A. lyrata
│ ├── ath4.fasta # A. thaliana
│ ├── chi4.fasta # C. hirsuta
│ └── cru4.fasta # C. rubella
├── neighbor_gff3/ # Neighbor annotations
│ ├── aly4.gff3
│ └── ...
├── cds_aa/
│ ├── neighbor.cds # Combined CDS for EDTA
│ └── neighbor.aa # Proteins for homology
└── EDTA/ # Pre-computed repeat library
└── genome.fasta.mod.EDTA.TElib.fa
| Segment | Length | GC (%) |
|---|---|---|
| Chr4_1 | 6,195,060 bp | 36.66 |
| Chr4_2 | 6,195,060 bp | 35.24 |
| Chr4_3 | 6,195,018 bp | 36.69 |
| Total | 18,585,138 bp | 36.20 |
For comparison, the official TAIR10 annotation of Arabidopsis thaliana (Col-0) chromosome 4 contains:
| Feature Type | Count |
|---|---|
| Protein-coding genes | 4,124 |
| pre-tRNA genes | 79 |
| rRNA genes | 0 |
| snRNA genes | 0 |
| snoRNA genes | 11 |
| miRNA genes | 28 |
| Other RNA genes | 62 |
| Pseudogenes | 121 |
| Transposable element (TE) genes | 711 |
| Total annotated loci | 5,410 |
Note: TAIR10 is the current reference annotation standard for A. thaliana. The protein-coding gene count increased from 3,744 (original Chr4 paper) to 4,124 as annotation methods improved.
Running the Sylvan pipeline on the Chr4 toy dataset produces the following results:
Annotation Phase (complete_draft.gff3):
| Metric | Count |
|---|---|
| Total genes | 5,720 |
| Total mRNA | 5,800 |
Filter Phase (filter.gff3):
| Metric | Count |
|---|---|
| Genes kept | 3,756 |
| mRNA kept | 3,834 |
| Genes discarded | 1,964 |
| mRNA discarded | 1,966 |
Output files:
-
results/FILTER/filter.gff3- Filtered gene models -
results/FILTER/discard.gff3- Discarded gene models -
results/FILTER/keep_data.tsv- Evidence data for kept genes -
results/FILTER/discard_data.tsv- Evidence data for discarded genes -
results/complete_draft.gff3.map- ID mapping between original and new IDs
Comparison: The 3,756 kept genes represents ~91% of TAIR10's 4,124 protein-coding genes on Chr4. The higher initial count (5,720) includes transposable elements (TAIR10 has 711 TE genes) and low-confidence predictions that are filtered out.
| Code | Species | Common Name |
|---|---|---|
| aly4 | Arabidopsis lyrata | Lyrate rockcress |
| cru4 | Capsella rubella | Pink shepherd's purse |
| chi4 | Cardamine hirsuta | Hairy bittercress |
| SRA | Tissue | Size |
|---|---|---|
| SRR1019221 | Leaf (14-day) | 4.6 Gb |
| SRR1105822 | Rosette (19-day) | 3.1 Gb |
| SRR1105823 | Rosette (19-day) | 4.5 Gb |
| SRR1106559 | Rosette (19-day) | 3.6 Gb |
| SRR446027 | Whole plant | 2.5 Gb |
| SRR446028 | Whole plant | 5.4 Gb |
| SRR446033 | Whole plant | 5.3 Gb |
| SRR446034 | Whole plant | 5.4 Gb |
| SRR446039 | Whole plant | 2.5 Gb |
| SRR446040 | Whole plant | 6.3 Gb |
| SRR764885 | Leaf (4-week) | 4.6 Gb |
| SRR934391 | Whole plant | 4.0 Gb |
1. Genome segmentation
seqkit split2 -p 3 Chr4.fasta2. Neighbor CDS extraction
Syntenic regions were identified using MCscan/jcvi:
python -m jcvi.compara.catalog ortholog ath aly --no_strip_names
python -m jcvi.compara.synteny mcscan ath.bed ath.aly.lifted.anchors --iter=13. RNA-seq subsetting
Reads mapping to Chr4 were extracted:
STAR --genomeDir star_index --readFilesIn sample_1.fq.gz sample_2.fq.gz
samtools view -b -F 4 Aligned.bam | samtools fastq -1 out_1.fq -2 out_2.fq -The toy data was tested on:
| Specification | Value |
|---|---|
| Nodes | 4 |
| Total CPUs | 256 |
| CPU | Intel Xeon E5-2683 v4 @ 2.10GHz |
| Cores per node | 64 (2 sockets × 16 cores × 2 threads) |
| Memory per node | 256 GB |
| Storage | GPFS |
The following summarizes the runtime distribution across all pipeline rules when running on the toy dataset.
Key observations:
-
Most time-consuming steps:
aggregate_CombineGeneModels(~50,000s),geneRegion2Genewise(~1,000s), andSam2Transfrag(~100-200s) are the bottlenecks - Fast steps: Most preprocessing and formatting rules complete in under 10 seconds
-
Parallelizable rules: Rules like
geneRegion2Genewise,gmapExon, andSTAR_pairedrun as multiple parallel jobs (shown as multiple dots), significantly reducing wall-clock time -
GPU-accelerated:
helixerbenefits from GPU acceleration when available
Runtime variability:
Actual runtime will vary significantly depending on your hardware and cluster configuration:
| Factor | Impact |
|---|---|
| CPU speed | Faster clock speeds reduce single-threaded bottlenecks |
| Available nodes | More nodes = more parallel jobs = faster wall-clock time |
| Memory per node | Insufficient memory causes job failures or swapping |
| Storage I/O | GPFS/Lustre faster than NFS; SSD faster than HDD |
| Queue wait time | Busy clusters add significant delays between jobs |
| GPU availability | Helixer runs ~10x faster with GPU acceleration |
With the test environment above (4 nodes, 256 CPUs, 256 GB/node), the toy dataset completes in 4-8 hours wall-clock time. On smaller clusters or shared resources, expect longer runtimes.
- Issues: https://github.com/plantgenomicslab/Sylvan/issues
- See also: README.md for configuration reference
python bin/generate_cluster_from_config.py --config config/config_annotate.yml --out config/cluster_annotate.yml --account cpu-s1-pgl-0 --partition cpu-s1-pgl-0
(To regenerate toydata cluster config, point --config/--out to toydata paths.)
python bin/generate_cluster_from_config.py --config toydata/config/config_annotate.yml --out toydata/config/cluster_annotate.yml --account cpu-s1-pgl-0 --partition cpu-s1-pgl-0
chmod 775 bin/generate_cluster_from_config.py
After finishing the filter phase you will have FILTER/data.tsv (the feature
matrix used by Filter.py) and a BUSCO run directory such as
results/busco/eudicots_odb10. Reviewers often ask for a feature ablation
study, so we provide an automated helper:
python bin/filter_feature_importance.py FILTER/data.tsv results/busco/<lineage>/full_table.tsv \
--output-table FILTER/feature_importance.tsv-
What is the BUSCO full table? Every BUSCO run writes a
full_table.tsvinside its lineage-specific run folder. Each non-Missing BUSCO row lists the BUSCO ID, status (Complete/Duplicated/Fragmented), and the transcript/gene ID it matched. The feature-importance script reuses this file to count how many BUSCOs remain in the “keep” set during each iteration—no new BUSCO analysis is required. -
Outputs:
FILTER/feature_importance.tsv(table) plusFILTER/feature_importance.json(machine-readable). Both include the baseline run (all features) and each leave-one-feature-out run, along with final out-of-bag (OOB) error, BUSCO counts, and iteration counts. -
Optional flags:
-
--features TPM COVERAGE PFAM ...restricts the analysis to specific columns fromFILTER/data.tsv. -
--ignore TPM_missing singleExonremoves metadata columns so the script automatically uses every other feature column.
-
Workflow summary:
- Run
Filter.pyas usual to createFILTER/data.tsv. - Identify the BUSCO
full_table.tsvpath you already used for filter monitoring (e.g.,results/busco/eudicots_odb10/full_table.tsv). - Execute the command above. Inspect
FILTER/feature_importance.tsvto see how dropping each feature affects OOB error (positive delta ⇒ feature is important). - Incorporate the results (table/plot) into your manuscript or reviewer response.