FAQ

EukFinder Frequently Asked Questions (FAQ)

Installation & Setup

Q1: What are the minimum system requirements for EukFinder?

A: EukFinder has substantial computational requirements:

Minimum:

Disk Space: 200 GB total (Centrifuge DB: 70-100 GB, PLAST DB: 2-15 GB, temporary files: 5-50 GB per sample)
RAM: 32 GB
Threads: 10-20 recommended

Recommended:

Disk Space: 300+ GB
RAM: 64-128 GB (especially for metaSPAdes assembly and Centrifuge classification)
Threads: 20-40+

Special Note: metaSPAdes and Centrifuge are particularly memory-intensive. Consider allocating extra memory for assembly steps.

Q2: How should I install EukFinder?

A: Use conda for the easiest installation:

conda create -n eukfinder -c bioconda eukfinder
conda activate eukfinder

This automatically handles all dependencies (python 3.12.8, numpy, pandas, joblib, spades, trimmomatic, centrifuge, bowtie2, plast, etc.).

Manual Installation: Not recommended due to version compatibility issues, but possible if you install each dependency individually.

Important: MacOS and other non-Linux operating systems are not explicitly supported by developers.

Q3: What should I do if dependencies conflict during installation?

A: If you encounter version conflicts:

Create a fresh conda environment
Install EukFinder from bioconda (preferred over pip)
Verify installation with eukfinder -h
Test with the provided test dataset before processing real data

Input Data & File Formats

Q4: Can I run EukFinder with paired-end reads only (no unpaired reads)?

A: Currently, unpaired reads are a required input for the short_seqs workflow. However, if you only have paired-end data, here are two workarounds:

Option 1: Create a dummy unpaired reads file

cat > dummy_unpaired.fastq << 'EOF'
@dummy_read_1
NNNNNNNNNN
+
IIIIIIIIII
@dummy_read_2
NNNNNNNNNN
+
IIIIIIIIII
EOF

gzip dummy_unpaired.fastq

eukfinder short_seqs --r1 S1_1.fastq.gz --r2 S1_2.fastq.gz --un dummy_unpaired.fastq.gz -o S1 ...

Option 2: Extract a subset from your R1 file

zcat paired_R1.fastq.gz | head -1000 > temp_unpaired.fastq
gzip temp_unpaired.fastq

eukfinder short_seqs --r1 paired_R1.fastq.gz --r2 paired_R2.fastq.gz --un temp_unpaired.fastq.gz -o output ...

The unpaired reads file doesn't need to be large; it's processed through the pipeline but typically contributes minimally to the final results.

Q5: What file formats does EukFinder accept?

A:

For read_prep and short_seqs:

FASTQ format (.fastq or .fq extension)
Important: Does NOT accept gzipped (.gz) files for raw reads
Quality encoding: auto-detected or can be specified with --qenc

For long_seqs:

FASTA format (.fasta or .fa extension)
Does NOT accept gzipped files
Input should be uncompressed single-line fasta files

Note: If you have gzipped files, decompress them first:

gunzip your_file.fastq.gz

Q6: What is the `--taxonomy-update` flag and when should I use it?

A: The --taxonomy-update flag controls ETE3 taxonomy database updates:

First time running EukFinder: Set to True (-t T)
- Downloads and caches the NCBI taxonomy database (~200 MB)
- Only needed once per installation
Subsequent runs: Set to False (-t F)
- Uses the cached taxonomy database
- Faster, as it skips the download step

Example:

# First run
eukfinder short_seqs ... -t T

# Later runs
eukfinder short_seqs ... -t F

Database Setup & Configuration

Q7: How do I download and set up the reference databases?

A: EukFinder provides pre-built databases. Download them before running:

# Create directory for databases
mkdir ~/eukfinder_db
cd ~/eukfinder_db

# Download individual databases (or all at once)
wget https://perun.biochem.dal.ca/Eukfinder/compressed_db/centrifuge_db.tar.gz
wget https://perun.biochem.dal.ca/Eukfinder/compressed_db/PlastDB_db.tar.gz
wget https://perun.biochem.dal.ca/Eukfinder/compressed_db/TrueSeq2_NexteraSE-PE.fa.tar.gz

# Extract
tar -xzf centrifuge_db.tar.gz
tar -xzf PlastDB_db.tar.gz
tar -xzf TrueSeq2_NexteraSE-PE.fa.tar.gz

Database Sizes:

Centrifuge: 70-100 GB
PLAST: 2-15 GB
Human genome (optional): 0.92 GB
Illumina adapters: 2.4 KB
Total: ~150-200 GB

Q8: I'm running out of disk space during database installation. What can I do?

A:

Issue: Temporary space needed during extraction often exceeds final database size. Typical flow requires ~300 GB total space.

Solutions:

Use external/additional storage: If on an HPC cluster, download to a larger RDS storage location:

# Instead of home directory
cd /mnt/rds/larger_storage/eukfinder_dbs
# Download and extract there

Download incrementally: Only download databases you need:
- Essential: Centrifuge database
- Recommended: PLAST database
- Optional: Human genome (for decontamination only)
Extract and delete: Remove tar files after extraction to free space:
```
tar -xzf centrifuge_db.tar.gz && rm centrifuge_db.tar.gz
```
Monitor space: Use df -h and du -sh to track usage during downloads

Q9: How do I specify the database paths correctly?

A: Critical for success: Always use the common basename prefix of Centrifuge files, NOT the full path to individual files.

Correct:

eukfinder short_seqs ... --cdb /path/to/Centrifuge_DB/Centrifuge_NewDB_Sept2020

Incorrect:

# These will fail:
eukfinder short_seqs ... --cdb /path/to/Centrifuge_DB/Centrifuge_NewDB_Sept2020.1.cf
eukfinder short_seqs ... --cdb /path/to/Centrifuge_NewDB_Sept2020.cf

The basename should resolve to exactly 4 files:

Centrifuge_NewDB_Sept2020.1.cf
Centrifuge_NewDB_Sept2020.2.cf
Centrifuge_NewDB_Sept2020.3.cf
Centrifuge_NewDB_Sept2020.4.cf

Verify with:

ls -la /path/to/Centrifuge_DB/Centrifuge_NewDB_Sept2020* | wc -l
# Should return: 4

Q10: Can I use custom databases instead of the pre-built ones?

A: Yes. EukFinder supports custom Centrifuge and PLAST databases. See the wiki pages:

Centrifuge DB: Use centrifuge-build with custom genomes
PLAST DB: Protein sequences with corresponding taxonomy maps

Detailed instructions are in the Building_custom_DB/ folder of the repository, including scripts to:

Download specific genomes from NCBI
Build taxonomy maps
Construct the database indices

Running the Pipeline

Q11: What are the two steps required for processing short-read Illumina data?

A: EukFinder requires two sequential steps:

Step 1: Read Preparation Removes adapters, low-quality reads, and optionally host DNA contamination. There are two variants:

Option A: read_prep (for samples with host DNA contamination) Use when your samples contain host-derived sequences that need to be removed (e.g., clinical samples, cultured samples):

eukfinder read_prep \
  --r1 raw_R1.fastq --r2 raw_R2.fastq \
  -n 20 -i TrueSeq2_NexteraSE-PE.fa \
  --hcrop 0 --qscore 20 --mlen 60 \
  --hg host_genome.fasta \
  -o sample_name

Option B: read_prep_env (for environmental samples without host contamination) Use for environmental metagenomes where you don't need to remove host DNA (faster, skips bowtie2 mapping):

eukfinder read_prep_env \
  --r1 raw_R1.fastq --r2 raw_R2.fastq \
  -n 20 -i TrueSeq2_NexteraSE-PE.fa \
  --hcrop 0 --qscore 20 --mlen 60 \
  -o sample_name

Key difference: read_prep_env does NOT require the --hg (host genome) parameter and skips the bowtie2 contamination removal step, making it faster for samples that don't have host contamination.

Step 2: Sequence Classification (short_seqs) Classifies reads, assembles, and identifies eukaryotic sequences:

eukfinder short_seqs \
  --r1 sample_R1PT.fq --r2 sample_R2PT.fq \
  --un sample_R1unPT.fq \
  --pclass sample_centrifuge_P \
  --uclass sample_centrifuge_UP \
  -o sample -n 20 -z 2 -t F \
  --mhlen 50 -e 0.05 --pid 60 --cov 30 --max_m 300

Important: Do NOT skip the read prep step. The short_seqs step requires the processed files from Step 1. Choose between read_prep or read_prep_env based on your sample type.

Q12: How do I run EukFinder with long reads or assembled contigs?

A: Use the long_seqs workflow for PacBio, Nanopore, or pre-assembled contigs:

eukfinder long_seqs \
  -l assembled_contigs.fasta \
  -o output_prefix \
  --mhlen 50 \
  -n 20 -z 2 -t T \
  -e 0.05 --pid 60 --cov 30

Key differences from short_seqs:

Input is FASTA (not FASTQ)
Single classification round (no assembly step)
No need for paired/unpaired reads
Faster, as assembly is skipped

Q13: What is the difference between `read_prep` and `read_prep_env`?

A: Both remove adapters and perform quality filtering, but they differ in host contamination handling:

read_prep (Standard, for samples with host DNA):

Requires: --hg host_genome.fasta parameter
Removes host-derived reads using bowtie2 mapping
Use for: Clinical samples, cultured samples, or any sample where host contamination is present
Slower: Includes bowtie2 host removal step

Example:

eukfinder read_prep --r1 R1.fastq --r2 R2.fastq \
  -n 20 -i TrueSeq2_NexteraSE-PE.fa \
  --hg human_genome.fasta -o sample

read_prep_env (For environmental metagenomes):

Does NOT require: --hg (host genome) parameter
Skips bowtie2 host removal step
Use for: Environmental samples (water, soil, sediment, etc.) where no host contamination is expected
Faster: More efficient for pure environmental samples

Example:

eukfinder read_prep_env --r1 R1.fastq --r2 R2.fastq \
  -n 20 -i TrueSeq2_NexteraSE-PE.fa -o sample

Which should you use?

Use read_prep if your samples may contain host DNA (e.g., human gut samples, fecal samples, cultured organisms)
Use read_prep_env if your samples are from pure environmental sources without expected host contamination (saves time, same parameters otherwise)

Important: The output files from either workflow can be used with short_seqs identically. The choice only affects the preprocessing speed and efficiency.

Q14: What working directory should I use to run EukFinder?

A: Important: EukFinder writes all output to the current working directory.

Ensure the directory is writable
Avoid running from read-only locations
On HPC clusters, run from a project/scratch directory, not home directory

Example:

cd /scratch/myproject/sample1
eukfinder short_seqs --r1 ... --r2 ... --un ... -o sample1

Parameter Optimization

Q14: What does `--mhlen` (minimum hit length) do and how should I set it?

A: --mhlen controls the minimum sequence length required for Centrifuge classification matches.

Default: 50 bp (works well for most datasets)
Purpose: Filters out spurious short matches
Lower values (e.g., 30-40): More sensitive, catches shorter reads, but higher false-positive rate
Higher values (e.g., 60-100): More stringent, fewer false positives, but may miss real eukaryotic sequences

Important: Do not set excessively high values (>100) without testing, as this can cause unexpected downstream failures.

Recommendation: Start with --mhlen 50 and adjust based on your results.

Q15: How do I optimize parameters for precision vs. recall?

A: EukFinder offers two parameter profiles:

"Strict" mode (high precision, lower recall):

eukfinder short_seqs ... --mhlen 50 -e 0.01 --pid 60 --cov 30
# Use the "Euk" output file only

Captures ~95% of real eukaryotic reads
High false-positive rate (~90%)
Good if you want high confidence in identifications

"Lenient" mode (balanced precision/recall):

eukfinder short_seqs ... --mhlen 30 -e 0.05 --pid 50 --cov 10
# Use the "EUnk" output file (Euk + Unknown)

Better balance between precision and recall
Useful if you need comprehensive recovery
Follow up with binning workflows to filter contaminants

Parameters:

-e: E-value threshold (lower = more stringent)
--pid: Percent identity (higher = more stringent)
--cov: Coverage threshold (higher = more stringent)

Q16: What are the `-n` (threads) and `-z` (chunks) parameters?

A:

-n (number of threads):

Controls parallelization
Default: 20
Set based on available CPU cores (don't exceed available cores)
Higher values speed up processing but increase memory usage
Recommendation: Use available cores on your HPC cluster

-z (number of chunks):

Splits input for PLAST processing
Default: 2
Higher values reduce memory per chunk but increase overhead
Use 2-4 for most datasets
For very large samples (>100M reads), try 4-8

Example for HPC with 40 cores:

eukfinder short_seqs ... -n 40 -z 4

Q17: What should `--max_m` (maximum memory for assembly) be set to?

A: Controls RAM allocated to metaSPAdes assembly.

Default: 300 GB
Typical range: 100-400 GB
Your system: Set to ~80% of available RAM

Example for 128 GB system:

eukfinder short_seqs ... --max_m 100

The assembly step is memory-intensive. If it crashes, reduce --max_m or reduce input size.

Troubleshooting & Error Messages

Q18: What does the error "centrifuge database cannot be found" mean?

A: Centrifuge database path is incorrect or incomplete.

Causes:

Wrong database prefix (see Q9)
Database not fully extracted
Database files corrupted or incomplete

Solution:

# Verify all 4 database files exist
ls -la /path/to/database/prefix*

# Should show 4 files with .1.cf, .2.cf, .3.cf, .4.cf extensions
# If not, re-extract the database

# Then verify the path is correct
eukfinder short_seqs ... --cdb /path/to/database/prefix

Q19: Why did my run terminate during validation?

A: Common causes:

Centrifuge database issue: See Q18
Unpaired reads file missing or empty: Provide a valid --un file
Temporary directory permissions: Ensure write permissions in current directory
Memory exceeded: Reduce --max_m or increase available system RAM

Debugging:

Check the log file (usually Class_DATE.log)
Verify all input files exist and are readable
Ensure sufficient disk space for temporary files

Q20: What should I do if FastQC reports overrepresented sequences?

A: FastQC may detect partial adapter sequences in quality control, but this is normal.

Causes:

Reads are truncated; FastQC sees partial adapter sequences
Full-length adapters are in the adapter file, but reads are shorter

Solution:

This is expected and usually not a problem
Trimmomatic will remove full-length adapters during preprocessing
Only investigate if overrepresented sequences are not adapter-related

To verify:

BLAST the overrepresented sequence against NCBI nt database
Check if it's a known adapter or actual contamination
If contamination, consider quality filtering or additional preprocessing

Q21: How do I handle gzipped input files?

A: EukFinder does NOT accept gzipped (.gz) files as input.

Solution:

# Decompress your files
gunzip your_file.fastq.gz
gunzip your_file.fasta.gz

# Then run EukFinder with uncompressed files
eukfinder short_seqs --r1 your_file.fastq --r2 ... --un ...

Alternatively, decompress into a temporary file:

zcat your_file.fastq.gz > temp_file.fastq
eukfinder short_seqs --r1 temp_file.fastq ...

Output Interpretation

Q22: What do the different output classification files mean?

A: EukFinder produces multiple output files with different classifications:

Euk: High-confidence eukaryotic contigs
Unk: Unknown/unclassified contigs
EUnk: Combined Euk + Unk (more permissive)
Bact: Bacterial contigs
Arch: Archaeal contigs
Misc: Miscellaneous/viral contigs

Which to use:

For strict analysis: Use "Euk" file only
For comprehensive recovery: Use "EUnk" file, then filter with downstream binning
Most studies use "Euk" or "EUnk" depending on research goals

Q23: What is the `Eukfinder_results/` and `Eukfinder_Temps/` folder structure?

A:

Eukfinder_results/:

Contains final classified assembled contigs (scaffolds)
Files named: scf_output_prefix.Classification.fasta
Main output for downstream analyses

Eukfinder_Temps/:

Contains intermediate results from each stage
Folders for temporary files from first classification round (tmps_output_prefix_DATE)
Useful for debugging but can be deleted after successful completion to save space

Q24: How do I interpret the summary table?

A: The summary_table.txt contains:

Group: Classification type (Euk, Unk, Bact, etc.)
#Seq: Number of contigs in that classification
Total size(bp): Combined length of all sequences

Example interpretation:

Group      #Seq    Total size(bp)
Euk        234     5,234,567
Unk        156     2,654,321
Bact       1,200   45,678,900

Here, you recovered 234 eukaryotic contigs totaling ~5.2 Mbp.

System Requirements & Performance

Q25: How long does EukFinder take to process a sample?

A: Runtime depends heavily on input size and system resources:

Typical benchmarks (from publication, 40 CPUs, 188 GB RAM):

Input: 30.2M reads, 6.0 Gbp total
eukfinder_short: 557 minutes (~9 hours)
eukfinder_long: 98 minutes (~1.6 hours) using assembled contigs

Factors affecting runtime:

Input size (reads/contigs)
Number of threads allocated
Memory available (affects assembly speed)
Database size (affects classification speed)

Optimization:

Allocate more cores (uses -n parameter)
Use more RAM (for metaSPAdes assembly)
Process smaller samples in batches if needed

Q26: How much temporary disk space do I need?

A: Temporary space varies by sample size:

Estimate: Allocate 2-5x the input file size for temporary files

Example:

Input: 10 GB of reads
Recommended temporary space: 20-50 GB
Databases: 150 GB
Total: ~200-250 GB minimum for this sample

On HPC clusters:

Use scratch/work directories, not home directories
Monitor usage with du -sh
Clean up temporary files after completion (can save 50+ GB per sample)

Q27: How do I run multiple samples in parallel?

A: Use job arrays or parallel processing:

Option 1: Shell loop (simple)

for sample in S1 S2 S3 S4; do
  eukfinder short_seqs --r1 ${sample}_R1.fastq --r2 ${sample}_R2.fastq \
    --un ${sample}_un.fastq -o $sample &
done
wait

Option 2: HPC job array (recommended)

# SLURM example
#SBATCH --array=1-4%2  # Run 4 jobs, max 2 at a time

SAMPLES=(S1 S2 S3 S4)
SAMPLE=${SAMPLES[$((SLURM_ARRAY_TASK_ID - 1))]}

eukfinder short_seqs --r1 ${SAMPLE}_R1.fastq --r2 ${SAMPLE}_R2.fastq \
  --un ${SAMPLE}_un.fastq -o $SAMPLE

Option 3: GNU Parallel

parallel eukfinder short_seqs --r1 {}_R1.fastq --r2 {}_R2.fastq \
  --un {}_un.fastq -o {} ::: S1 S2 S3 S4

Advanced Topics

Q28: How do I build a custom Centrifuge database for my specific environment?

A: See the Building_custom_DB/ directory in the repository for detailed instructions. General steps:

Select target genomes:
- Identify organisms of interest using pilot data analysis
- Download relevant genomes from NCBI using ncbi-genome-download
Create taxonomy mapping:
- Use provided scripts: Build_Centrifuge_map_from_assembly_report.py
- Maps sequence IDs to taxonomic IDs

Build database:

centrifuge-build -p 16 --bmax 1342177280 \
  --conversion-table genome2taxid.map \
  --taxonomy-tree taxonomy/nodes.dmp \
  --name-table taxonomy/names.dmp \
  genome.fasta database_name

Q29: Can I use EukFinder on non-Linux systems?

A: EukFinder is primarily developed for Linux environments. MacOS and other non-Linux systems are not explicitly supported. Issues may arise with:

Conda package availability for your OS
Compiler compatibility for dependencies
Path handling differences

Recommendation: Use Docker or WSL2 on Windows, or Linux VM if unavailable on your system.

Data Quality & Preprocessing

Q30: What read quality should I target before running EukFinder?

A: High-quality input improves results:

Recommended preprocessing:

Quality score: ≥Q20 (99.9% accuracy)
Length: ≥60 bp after trimming
Contamination: Remove obvious contaminants (host DNA, adapters)

EukFinder's read_prep step handles:

Adapter trimming
Quality filtering
Host genome removal (optional)

Your inputs should:

Be in FASTQ format (not gzipped)
Have standard quality encoding (auto-detected)
Be properly paired (for short_seqs)

Contact & Support

For additional help:

GitHub Issues: https://github.com/RogerLab/Eukfinder/issues
Publication: Zhao et al. 2025, mBio
Contact: [email protected] or [email protected]

Last Updated: January 2025 EukFinder v1.2.4

FAQ

EukFinder Frequently Asked Questions (FAQ)

Table of Contents

Installation & Setup

Q1: What are the minimum system requirements for EukFinder?

Q2: How should I install EukFinder?

Q3: What should I do if dependencies conflict during installation?

Input Data & File Formats

Q4: Can I run EukFinder with paired-end reads only (no unpaired reads)?

Q5: What file formats does EukFinder accept?

Q6: What is the --taxonomy-update flag and when should I use it?

Database Setup & Configuration

Q7: How do I download and set up the reference databases?

Q8: I'm running out of disk space during database installation. What can I do?

Q9: How do I specify the database paths correctly?

Q10: Can I use custom databases instead of the pre-built ones?

Running the Pipeline

Q11: What are the two steps required for processing short-read Illumina data?

Q12: How do I run EukFinder with long reads or assembled contigs?

Q13: What is the difference between read_prep and read_prep_env?

Q14: What working directory should I use to run EukFinder?

Parameter Optimization

Q14: What does --mhlen (minimum hit length) do and how should I set it?

Q15: How do I optimize parameters for precision vs. recall?

Q16: What are the -n (threads) and -z (chunks) parameters?

Q17: What should --max_m (maximum memory for assembly) be set to?

Troubleshooting & Error Messages

Q18: What does the error "centrifuge database cannot be found" mean?

Q19: Why did my run terminate during validation?

Q20: What should I do if FastQC reports overrepresented sequences?

Q21: How do I handle gzipped input files?

Output Interpretation

Q22: What do the different output classification files mean?

Q23: What is the Eukfinder_results/ and Eukfinder_Temps/ folder structure?

Q24: How do I interpret the summary table?

System Requirements & Performance

Q25: How long does EukFinder take to process a sample?

Q26: How much temporary disk space do I need?

Q27: How do I run multiple samples in parallel?

Advanced Topics

Q28: How do I build a custom Centrifuge database for my specific environment?

Q29: Can I use EukFinder on non-Linux systems?

Data Quality & Preprocessing

Q30: What read quality should I target before running EukFinder?

Contact & Support

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Q6: What is the `--taxonomy-update` flag and when should I use it?

Q13: What is the difference between `read_prep` and `read_prep_env`?

Q14: What does `--mhlen` (minimum hit length) do and how should I set it?

Q16: What are the `-n` (threads) and `-z` (chunks) parameters?

Q17: What should `--max_m` (maximum memory for assembly) be set to?

Q23: What is the `Eukfinder_results/` and `Eukfinder_Temps/` folder structure?