-
Notifications
You must be signed in to change notification settings - Fork 3
FAQ
- Installation & Setup
- Input Data & File Formats
- Database Setup & Configuration
- Running the Pipeline
- Parameter Optimization
- Troubleshooting & Error Messages
- Output Interpretation
- System Requirements & Performance
A: EukFinder has substantial computational requirements:
Minimum:
- Disk Space: 200 GB total (Centrifuge DB: 70-100 GB, PLAST DB: 2-15 GB, temporary files: 5-50 GB per sample)
- RAM: 32 GB
- Threads: 10-20 recommended
Recommended:
- Disk Space: 300+ GB
- RAM: 64-128 GB (especially for metaSPAdes assembly and Centrifuge classification)
- Threads: 20-40+
Special Note: metaSPAdes and Centrifuge are particularly memory-intensive. Consider allocating extra memory for assembly steps.
A: Use conda for the easiest installation:
conda create -n eukfinder -c bioconda eukfinder
conda activate eukfinderThis automatically handles all dependencies (python 3.12.8, numpy, pandas, joblib, spades, trimmomatic, centrifuge, bowtie2, plast, etc.).
Manual Installation: Not recommended due to version compatibility issues, but possible if you install each dependency individually.
Important: MacOS and other non-Linux operating systems are not explicitly supported by developers.
A: If you encounter version conflicts:
- Create a fresh conda environment
- Install EukFinder from bioconda (preferred over pip)
- Verify installation with
eukfinder -h - Test with the provided test dataset before processing real data
A: Currently, unpaired reads are a required input for the short_seqs workflow. However, if you only have paired-end data, here are two workarounds:
Option 1: Create a dummy unpaired reads file
cat > dummy_unpaired.fastq << 'EOF'
@dummy_read_1
NNNNNNNNNN
+
IIIIIIIIII
@dummy_read_2
NNNNNNNNNN
+
IIIIIIIIII
EOF
gzip dummy_unpaired.fastq
eukfinder short_seqs --r1 S1_1.fastq.gz --r2 S1_2.fastq.gz --un dummy_unpaired.fastq.gz -o S1 ...Option 2: Extract a subset from your R1 file
zcat paired_R1.fastq.gz | head -1000 > temp_unpaired.fastq
gzip temp_unpaired.fastq
eukfinder short_seqs --r1 paired_R1.fastq.gz --r2 paired_R2.fastq.gz --un temp_unpaired.fastq.gz -o output ...The unpaired reads file doesn't need to be large; it's processed through the pipeline but typically contributes minimally to the final results.
A:
For read_prep and short_seqs:
- FASTQ format (.fastq or .fq extension)
- Important: Does NOT accept gzipped (.gz) files for raw reads
- Quality encoding: auto-detected or can be specified with
--qenc
For long_seqs:
- FASTA format (.fasta or .fa extension)
- Does NOT accept gzipped files
- Input should be uncompressed single-line fasta files
Note: If you have gzipped files, decompress them first:
gunzip your_file.fastq.gzA: The --taxonomy-update flag controls ETE3 taxonomy database updates:
-
First time running EukFinder: Set to
True(-t T)- Downloads and caches the NCBI taxonomy database (~200 MB)
- Only needed once per installation
-
Subsequent runs: Set to
False(-t F)- Uses the cached taxonomy database
- Faster, as it skips the download step
Example:
# First run
eukfinder short_seqs ... -t T
# Later runs
eukfinder short_seqs ... -t FA: EukFinder provides pre-built databases. Download them before running:
# Create directory for databases
mkdir ~/eukfinder_db
cd ~/eukfinder_db
# Download individual databases (or all at once)
wget https://perun.biochem.dal.ca/Eukfinder/compressed_db/centrifuge_db.tar.gz
wget https://perun.biochem.dal.ca/Eukfinder/compressed_db/PlastDB_db.tar.gz
wget https://perun.biochem.dal.ca/Eukfinder/compressed_db/TrueSeq2_NexteraSE-PE.fa.tar.gz
# Extract
tar -xzf centrifuge_db.tar.gz
tar -xzf PlastDB_db.tar.gz
tar -xzf TrueSeq2_NexteraSE-PE.fa.tar.gzDatabase Sizes:
- Centrifuge: 70-100 GB
- PLAST: 2-15 GB
- Human genome (optional): 0.92 GB
- Illumina adapters: 2.4 KB
- Total: ~150-200 GB
A:
Issue: Temporary space needed during extraction often exceeds final database size. Typical flow requires ~300 GB total space.
Solutions:
-
Use external/additional storage: If on an HPC cluster, download to a larger RDS storage location:
# Instead of home directory cd /mnt/rds/larger_storage/eukfinder_dbs # Download and extract there
-
Download incrementally: Only download databases you need:
- Essential: Centrifuge database
- Recommended: PLAST database
- Optional: Human genome (for decontamination only)
-
Extract and delete: Remove tar files after extraction to free space:
tar -xzf centrifuge_db.tar.gz && rm centrifuge_db.tar.gz -
Monitor space: Use
df -handdu -shto track usage during downloads
A: Critical for success: Always use the common basename prefix of Centrifuge files, NOT the full path to individual files.
Correct:
eukfinder short_seqs ... --cdb /path/to/Centrifuge_DB/Centrifuge_NewDB_Sept2020Incorrect:
# These will fail:
eukfinder short_seqs ... --cdb /path/to/Centrifuge_DB/Centrifuge_NewDB_Sept2020.1.cf
eukfinder short_seqs ... --cdb /path/to/Centrifuge_NewDB_Sept2020.cfThe basename should resolve to exactly 4 files:
Centrifuge_NewDB_Sept2020.1.cf
Centrifuge_NewDB_Sept2020.2.cf
Centrifuge_NewDB_Sept2020.3.cf
Centrifuge_NewDB_Sept2020.4.cf
Verify with:
ls -la /path/to/Centrifuge_DB/Centrifuge_NewDB_Sept2020* | wc -l
# Should return: 4A: Yes. EukFinder supports custom Centrifuge and PLAST databases. See the wiki pages:
-
Centrifuge DB: Use
centrifuge-buildwith custom genomes - PLAST DB: Protein sequences with corresponding taxonomy maps
Detailed instructions are in the Building_custom_DB/ folder of the repository, including scripts to:
- Download specific genomes from NCBI
- Build taxonomy maps
- Construct the database indices
A: EukFinder requires two sequential steps:
Step 1: Read Preparation Removes adapters, low-quality reads, and optionally host DNA contamination. There are two variants:
Option A: read_prep (for samples with host DNA contamination)
Use when your samples contain host-derived sequences that need to be removed (e.g., clinical samples, cultured samples):
eukfinder read_prep \
--r1 raw_R1.fastq --r2 raw_R2.fastq \
-n 20 -i TrueSeq2_NexteraSE-PE.fa \
--hcrop 0 --qscore 20 --mlen 60 \
--hg host_genome.fasta \
-o sample_nameOption B: read_prep_env (for environmental samples without host contamination)
Use for environmental metagenomes where you don't need to remove host DNA (faster, skips bowtie2 mapping):
eukfinder read_prep_env \
--r1 raw_R1.fastq --r2 raw_R2.fastq \
-n 20 -i TrueSeq2_NexteraSE-PE.fa \
--hcrop 0 --qscore 20 --mlen 60 \
-o sample_nameKey difference: read_prep_env does NOT require the --hg (host genome) parameter and skips the bowtie2 contamination removal step, making it faster for samples that don't have host contamination.
Step 2: Sequence Classification (short_seqs)
Classifies reads, assembles, and identifies eukaryotic sequences:
eukfinder short_seqs \
--r1 sample_R1PT.fq --r2 sample_R2PT.fq \
--un sample_R1unPT.fq \
--pclass sample_centrifuge_P \
--uclass sample_centrifuge_UP \
-o sample -n 20 -z 2 -t F \
--mhlen 50 -e 0.05 --pid 60 --cov 30 --max_m 300Important: Do NOT skip the read prep step. The short_seqs step requires the processed files from Step 1. Choose between read_prep or read_prep_env based on your sample type.
A: Use the long_seqs workflow for PacBio, Nanopore, or pre-assembled contigs:
eukfinder long_seqs \
-l assembled_contigs.fasta \
-o output_prefix \
--mhlen 50 \
-n 20 -z 2 -t T \
-e 0.05 --pid 60 --cov 30Key differences from short_seqs:
- Input is FASTA (not FASTQ)
- Single classification round (no assembly step)
- No need for paired/unpaired reads
- Faster, as assembly is skipped
A: Both remove adapters and perform quality filtering, but they differ in host contamination handling:
read_prep (Standard, for samples with host DNA):
- Requires:
--hg host_genome.fastaparameter - Removes host-derived reads using bowtie2 mapping
- Use for: Clinical samples, cultured samples, or any sample where host contamination is present
- Slower: Includes bowtie2 host removal step
- Example:
eukfinder read_prep --r1 R1.fastq --r2 R2.fastq \ -n 20 -i TrueSeq2_NexteraSE-PE.fa \ --hg human_genome.fasta -o sample
read_prep_env (For environmental metagenomes):
- Does NOT require:
--hg(host genome) parameter - Skips bowtie2 host removal step
- Use for: Environmental samples (water, soil, sediment, etc.) where no host contamination is expected
- Faster: More efficient for pure environmental samples
- Example:
eukfinder read_prep_env --r1 R1.fastq --r2 R2.fastq \ -n 20 -i TrueSeq2_NexteraSE-PE.fa -o sample
Which should you use?
-
Use
read_prepif your samples may contain host DNA (e.g., human gut samples, fecal samples, cultured organisms) -
Use
read_prep_envif your samples are from pure environmental sources without expected host contamination (saves time, same parameters otherwise)
Important: The output files from either workflow can be used with short_seqs identically. The choice only affects the preprocessing speed and efficiency.
A: Important: EukFinder writes all output to the current working directory.
- Ensure the directory is writable
- Avoid running from read-only locations
- On HPC clusters, run from a project/scratch directory, not home directory
- Example:
cd /scratch/myproject/sample1 eukfinder short_seqs --r1 ... --r2 ... --un ... -o sample1
A: --mhlen controls the minimum sequence length required for Centrifuge classification matches.
- Default: 50 bp (works well for most datasets)
- Purpose: Filters out spurious short matches
- Lower values (e.g., 30-40): More sensitive, catches shorter reads, but higher false-positive rate
- Higher values (e.g., 60-100): More stringent, fewer false positives, but may miss real eukaryotic sequences
Important: Do not set excessively high values (>100) without testing, as this can cause unexpected downstream failures.
Recommendation: Start with --mhlen 50 and adjust based on your results.
A: EukFinder offers two parameter profiles:
"Strict" mode (high precision, lower recall):
eukfinder short_seqs ... --mhlen 50 -e 0.01 --pid 60 --cov 30
# Use the "Euk" output file only- Captures ~95% of real eukaryotic reads
- High false-positive rate (~90%)
- Good if you want high confidence in identifications
"Lenient" mode (balanced precision/recall):
eukfinder short_seqs ... --mhlen 30 -e 0.05 --pid 50 --cov 10
# Use the "EUnk" output file (Euk + Unknown)- Better balance between precision and recall
- Useful if you need comprehensive recovery
- Follow up with binning workflows to filter contaminants
Parameters:
-
-e: E-value threshold (lower = more stringent) -
--pid: Percent identity (higher = more stringent) -
--cov: Coverage threshold (higher = more stringent)
A:
-n (number of threads):
- Controls parallelization
- Default: 20
- Set based on available CPU cores (don't exceed available cores)
- Higher values speed up processing but increase memory usage
- Recommendation: Use available cores on your HPC cluster
-z (number of chunks):
- Splits input for PLAST processing
- Default: 2
- Higher values reduce memory per chunk but increase overhead
- Use 2-4 for most datasets
- For very large samples (>100M reads), try 4-8
Example for HPC with 40 cores:
eukfinder short_seqs ... -n 40 -z 4A: Controls RAM allocated to metaSPAdes assembly.
- Default: 300 GB
- Typical range: 100-400 GB
- Your system: Set to ~80% of available RAM
Example for 128 GB system:
eukfinder short_seqs ... --max_m 100The assembly step is memory-intensive. If it crashes, reduce --max_m or reduce input size.
A: Centrifuge database path is incorrect or incomplete.
Causes:
- Wrong database prefix (see Q9)
- Database not fully extracted
- Database files corrupted or incomplete
Solution:
# Verify all 4 database files exist
ls -la /path/to/database/prefix*
# Should show 4 files with .1.cf, .2.cf, .3.cf, .4.cf extensions
# If not, re-extract the database
# Then verify the path is correct
eukfinder short_seqs ... --cdb /path/to/database/prefixA: Common causes:
- Centrifuge database issue: See Q18
-
Unpaired reads file missing or empty: Provide a valid
--unfile - Temporary directory permissions: Ensure write permissions in current directory
-
Memory exceeded: Reduce
--max_mor increase available system RAM
Debugging:
- Check the log file (usually
Class_DATE.log) - Verify all input files exist and are readable
- Ensure sufficient disk space for temporary files
A: FastQC may detect partial adapter sequences in quality control, but this is normal.
Causes:
- Reads are truncated; FastQC sees partial adapter sequences
- Full-length adapters are in the adapter file, but reads are shorter
Solution:
- This is expected and usually not a problem
- Trimmomatic will remove full-length adapters during preprocessing
- Only investigate if overrepresented sequences are not adapter-related
To verify:
- BLAST the overrepresented sequence against NCBI nt database
- Check if it's a known adapter or actual contamination
- If contamination, consider quality filtering or additional preprocessing
A: EukFinder does NOT accept gzipped (.gz) files as input.
Solution:
# Decompress your files
gunzip your_file.fastq.gz
gunzip your_file.fasta.gz
# Then run EukFinder with uncompressed files
eukfinder short_seqs --r1 your_file.fastq --r2 ... --un ...Alternatively, decompress into a temporary file:
zcat your_file.fastq.gz > temp_file.fastq
eukfinder short_seqs --r1 temp_file.fastq ...A: EukFinder produces multiple output files with different classifications:
- Euk: High-confidence eukaryotic contigs
- Unk: Unknown/unclassified contigs
- EUnk: Combined Euk + Unk (more permissive)
- Bact: Bacterial contigs
- Arch: Archaeal contigs
- Misc: Miscellaneous/viral contigs
Which to use:
- For strict analysis: Use "Euk" file only
- For comprehensive recovery: Use "EUnk" file, then filter with downstream binning
- Most studies use "Euk" or "EUnk" depending on research goals
A:
Eukfinder_results/:
- Contains final classified assembled contigs (scaffolds)
- Files named:
scf_output_prefix.Classification.fasta - Main output for downstream analyses
Eukfinder_Temps/:
- Contains intermediate results from each stage
- Folders for temporary files from first classification round (tmps_output_prefix_DATE)
- Useful for debugging but can be deleted after successful completion to save space
A: The summary_table.txt contains:
- Group: Classification type (Euk, Unk, Bact, etc.)
- #Seq: Number of contigs in that classification
- Total size(bp): Combined length of all sequences
Example interpretation:
Group #Seq Total size(bp)
Euk 234 5,234,567
Unk 156 2,654,321
Bact 1,200 45,678,900
Here, you recovered 234 eukaryotic contigs totaling ~5.2 Mbp.
A: Runtime depends heavily on input size and system resources:
Typical benchmarks (from publication, 40 CPUs, 188 GB RAM):
- Input: 30.2M reads, 6.0 Gbp total
- eukfinder_short: 557 minutes (~9 hours)
- eukfinder_long: 98 minutes (~1.6 hours) using assembled contigs
Factors affecting runtime:
- Input size (reads/contigs)
- Number of threads allocated
- Memory available (affects assembly speed)
- Database size (affects classification speed)
Optimization:
- Allocate more cores (uses
-nparameter) - Use more RAM (for metaSPAdes assembly)
- Process smaller samples in batches if needed
A: Temporary space varies by sample size:
Estimate: Allocate 2-5x the input file size for temporary files
Example:
- Input: 10 GB of reads
- Recommended temporary space: 20-50 GB
- Databases: 150 GB
- Total: ~200-250 GB minimum for this sample
On HPC clusters:
- Use scratch/work directories, not home directories
- Monitor usage with
du -sh - Clean up temporary files after completion (can save 50+ GB per sample)
A: Use job arrays or parallel processing:
Option 1: Shell loop (simple)
for sample in S1 S2 S3 S4; do
eukfinder short_seqs --r1 ${sample}_R1.fastq --r2 ${sample}_R2.fastq \
--un ${sample}_un.fastq -o $sample &
done
waitOption 2: HPC job array (recommended)
# SLURM example
#SBATCH --array=1-4%2 # Run 4 jobs, max 2 at a time
SAMPLES=(S1 S2 S3 S4)
SAMPLE=${SAMPLES[$((SLURM_ARRAY_TASK_ID - 1))]}
eukfinder short_seqs --r1 ${SAMPLE}_R1.fastq --r2 ${SAMPLE}_R2.fastq \
--un ${SAMPLE}_un.fastq -o $SAMPLEOption 3: GNU Parallel
parallel eukfinder short_seqs --r1 {}_R1.fastq --r2 {}_R2.fastq \
--un {}_un.fastq -o {} ::: S1 S2 S3 S4A: See the Building_custom_DB/ directory in the repository for detailed instructions. General steps:
-
Select target genomes:
- Identify organisms of interest using pilot data analysis
- Download relevant genomes from NCBI using
ncbi-genome-download
-
Create taxonomy mapping:
- Use provided scripts:
Build_Centrifuge_map_from_assembly_report.py - Maps sequence IDs to taxonomic IDs
- Use provided scripts:
-
Build database:
centrifuge-build -p 16 --bmax 1342177280 \ --conversion-table genome2taxid.map \ --taxonomy-tree taxonomy/nodes.dmp \ --name-table taxonomy/names.dmp \ genome.fasta database_name
A: EukFinder is primarily developed for Linux environments. MacOS and other non-Linux systems are not explicitly supported. Issues may arise with:
- Conda package availability for your OS
- Compiler compatibility for dependencies
- Path handling differences
Recommendation: Use Docker or WSL2 on Windows, or Linux VM if unavailable on your system.
A: High-quality input improves results:
Recommended preprocessing:
- Quality score: ≥Q20 (99.9% accuracy)
- Length: ≥60 bp after trimming
- Contamination: Remove obvious contaminants (host DNA, adapters)
EukFinder's read_prep step handles:
- Adapter trimming
- Quality filtering
- Host genome removal (optional)
Your inputs should:
- Be in FASTQ format (not gzipped)
- Have standard quality encoding (auto-detected)
- Be properly paired (for short_seqs)
For additional help:
- GitHub Issues: https://github.com/RogerLab/Eukfinder/issues
- Publication: Zhao et al. 2025, mBio
- Contact: [email protected] or [email protected]
Last Updated: January 2025 EukFinder v1.2.4