Skip to content
dzhao2019 edited this page Jan 14, 2026 · 6 revisions

EukFinder Frequently Asked Questions (FAQ)

Table of Contents

  1. Installation & Setup
  2. Input Data & File Formats
  3. Database Setup & Configuration
  4. Running the Pipeline
  5. Parameter Optimization
  6. Troubleshooting & Error Messages
  7. Output Interpretation
  8. System Requirements & Performance

Installation & Setup

Q1: What are the minimum system requirements for EukFinder?

A: EukFinder has substantial computational requirements:

Minimum:

  • Disk Space: 200 GB total (Centrifuge DB: 70-100 GB, PLAST DB: 2-15 GB, temporary files: 5-50 GB per sample)
  • RAM: 32 GB
  • Threads: 10-20 recommended

Recommended:

  • Disk Space: 300+ GB
  • RAM: 64-128 GB (especially for metaSPAdes assembly and Centrifuge classification)
  • Threads: 20-40+

Special Note: metaSPAdes and Centrifuge are particularly memory-intensive. Consider allocating extra memory for assembly steps.

Q2: How should I install EukFinder?

A: Use conda for the easiest installation:

conda create -n eukfinder -c bioconda eukfinder
conda activate eukfinder

This automatically handles all dependencies (python 3.12.8, numpy, pandas, joblib, spades, trimmomatic, centrifuge, bowtie2, plast, etc.).

Manual Installation: Not recommended due to version compatibility issues, but possible if you install each dependency individually.

Important: MacOS and other non-Linux operating systems are not explicitly supported by developers.

Q3: What should I do if dependencies conflict during installation?

A: If you encounter version conflicts:

  1. Create a fresh conda environment
  2. Install EukFinder from bioconda (preferred over pip)
  3. Verify installation with eukfinder -h
  4. Test with the provided test dataset before processing real data

Input Data & File Formats

Q4: Can I run EukFinder with paired-end reads only (no unpaired reads)?

A: Currently, unpaired reads are a required input for the short_seqs workflow. However, if you only have paired-end data, here are two workarounds:

Option 1: Create a dummy unpaired reads file

cat > dummy_unpaired.fastq << 'EOF'
@dummy_read_1
NNNNNNNNNN
+
IIIIIIIIII
@dummy_read_2
NNNNNNNNNN
+
IIIIIIIIII
EOF

gzip dummy_unpaired.fastq

eukfinder short_seqs --r1 S1_1.fastq.gz --r2 S1_2.fastq.gz --un dummy_unpaired.fastq.gz -o S1 ...

Option 2: Extract a subset from your R1 file

zcat paired_R1.fastq.gz | head -1000 > temp_unpaired.fastq
gzip temp_unpaired.fastq

eukfinder short_seqs --r1 paired_R1.fastq.gz --r2 paired_R2.fastq.gz --un temp_unpaired.fastq.gz -o output ...

The unpaired reads file doesn't need to be large; it's processed through the pipeline but typically contributes minimally to the final results.

Q5: What file formats does EukFinder accept?

A:

For read_prep and short_seqs:

  • FASTQ format (.fastq or .fq extension)
  • Important: Does NOT accept gzipped (.gz) files for raw reads
  • Quality encoding: auto-detected or can be specified with --qenc

For long_seqs:

  • FASTA format (.fasta or .fa extension)
  • Does NOT accept gzipped files
  • Input should be uncompressed single-line fasta files

Note: If you have gzipped files, decompress them first:

gunzip your_file.fastq.gz

Q6: What is the --taxonomy-update flag and when should I use it?

A: The --taxonomy-update flag controls ETE3 taxonomy database updates:

  • First time running EukFinder: Set to True (-t T)

    • Downloads and caches the NCBI taxonomy database (~200 MB)
    • Only needed once per installation
  • Subsequent runs: Set to False (-t F)

    • Uses the cached taxonomy database
    • Faster, as it skips the download step

Example:

# First run
eukfinder short_seqs ... -t T

# Later runs
eukfinder short_seqs ... -t F

Database Setup & Configuration

Q7: How do I download and set up the reference databases?

A: EukFinder provides pre-built databases. Download them before running:

# Create directory for databases
mkdir ~/eukfinder_db
cd ~/eukfinder_db

# Download individual databases (or all at once)
wget https://perun.biochem.dal.ca/Eukfinder/compressed_db/centrifuge_db.tar.gz
wget https://perun.biochem.dal.ca/Eukfinder/compressed_db/PlastDB_db.tar.gz
wget https://perun.biochem.dal.ca/Eukfinder/compressed_db/TrueSeq2_NexteraSE-PE.fa.tar.gz

# Extract
tar -xzf centrifuge_db.tar.gz
tar -xzf PlastDB_db.tar.gz
tar -xzf TrueSeq2_NexteraSE-PE.fa.tar.gz

Database Sizes:

  • Centrifuge: 70-100 GB
  • PLAST: 2-15 GB
  • Human genome (optional): 0.92 GB
  • Illumina adapters: 2.4 KB
  • Total: ~150-200 GB

Q8: I'm running out of disk space during database installation. What can I do?

A:

Issue: Temporary space needed during extraction often exceeds final database size. Typical flow requires ~300 GB total space.

Solutions:

  1. Use external/additional storage: If on an HPC cluster, download to a larger RDS storage location:

    # Instead of home directory
    cd /mnt/rds/larger_storage/eukfinder_dbs
    # Download and extract there
  2. Download incrementally: Only download databases you need:

    • Essential: Centrifuge database
    • Recommended: PLAST database
    • Optional: Human genome (for decontamination only)
  3. Extract and delete: Remove tar files after extraction to free space:

    tar -xzf centrifuge_db.tar.gz && rm centrifuge_db.tar.gz
  4. Monitor space: Use df -h and du -sh to track usage during downloads

Q9: How do I specify the database paths correctly?

A: Critical for success: Always use the common basename prefix of Centrifuge files, NOT the full path to individual files.

Correct:

eukfinder short_seqs ... --cdb /path/to/Centrifuge_DB/Centrifuge_NewDB_Sept2020

Incorrect:

# These will fail:
eukfinder short_seqs ... --cdb /path/to/Centrifuge_DB/Centrifuge_NewDB_Sept2020.1.cf
eukfinder short_seqs ... --cdb /path/to/Centrifuge_NewDB_Sept2020.cf

The basename should resolve to exactly 4 files:

Centrifuge_NewDB_Sept2020.1.cf
Centrifuge_NewDB_Sept2020.2.cf
Centrifuge_NewDB_Sept2020.3.cf
Centrifuge_NewDB_Sept2020.4.cf

Verify with:

ls -la /path/to/Centrifuge_DB/Centrifuge_NewDB_Sept2020* | wc -l
# Should return: 4

Q10: Can I use custom databases instead of the pre-built ones?

A: Yes. EukFinder supports custom Centrifuge and PLAST databases. See the wiki pages:

  • Centrifuge DB: Use centrifuge-build with custom genomes
  • PLAST DB: Protein sequences with corresponding taxonomy maps

Detailed instructions are in the Building_custom_DB/ folder of the repository, including scripts to:

  1. Download specific genomes from NCBI
  2. Build taxonomy maps
  3. Construct the database indices

Running the Pipeline

Q11: What are the two steps required for processing short-read Illumina data?

A: EukFinder requires two sequential steps:

Step 1: Read Preparation Removes adapters, low-quality reads, and optionally host DNA contamination. There are two variants:

Option A: read_prep (for samples with host DNA contamination) Use when your samples contain host-derived sequences that need to be removed (e.g., clinical samples, cultured samples):

eukfinder read_prep \
  --r1 raw_R1.fastq --r2 raw_R2.fastq \
  -n 20 -i TrueSeq2_NexteraSE-PE.fa \
  --hcrop 0 --qscore 20 --mlen 60 \
  --hg host_genome.fasta \
  -o sample_name

Option B: read_prep_env (for environmental samples without host contamination) Use for environmental metagenomes where you don't need to remove host DNA (faster, skips bowtie2 mapping):

eukfinder read_prep_env \
  --r1 raw_R1.fastq --r2 raw_R2.fastq \
  -n 20 -i TrueSeq2_NexteraSE-PE.fa \
  --hcrop 0 --qscore 20 --mlen 60 \
  -o sample_name

Key difference: read_prep_env does NOT require the --hg (host genome) parameter and skips the bowtie2 contamination removal step, making it faster for samples that don't have host contamination.

Step 2: Sequence Classification (short_seqs) Classifies reads, assembles, and identifies eukaryotic sequences:

eukfinder short_seqs \
  --r1 sample_R1PT.fq --r2 sample_R2PT.fq \
  --un sample_R1unPT.fq \
  --pclass sample_centrifuge_P \
  --uclass sample_centrifuge_UP \
  -o sample -n 20 -z 2 -t F \
  --mhlen 50 -e 0.05 --pid 60 --cov 30 --max_m 300

Important: Do NOT skip the read prep step. The short_seqs step requires the processed files from Step 1. Choose between read_prep or read_prep_env based on your sample type.

Q12: How do I run EukFinder with long reads or assembled contigs?

A: Use the long_seqs workflow for PacBio, Nanopore, or pre-assembled contigs:

eukfinder long_seqs \
  -l assembled_contigs.fasta \
  -o output_prefix \
  --mhlen 50 \
  -n 20 -z 2 -t T \
  -e 0.05 --pid 60 --cov 30

Key differences from short_seqs:

  • Input is FASTA (not FASTQ)
  • Single classification round (no assembly step)
  • No need for paired/unpaired reads
  • Faster, as assembly is skipped

Q13: What is the difference between read_prep and read_prep_env?

A: Both remove adapters and perform quality filtering, but they differ in host contamination handling:

read_prep (Standard, for samples with host DNA):

  • Requires: --hg host_genome.fasta parameter
  • Removes host-derived reads using bowtie2 mapping
  • Use for: Clinical samples, cultured samples, or any sample where host contamination is present
  • Slower: Includes bowtie2 host removal step
  • Example:
    eukfinder read_prep --r1 R1.fastq --r2 R2.fastq \
      -n 20 -i TrueSeq2_NexteraSE-PE.fa \
      --hg human_genome.fasta -o sample

read_prep_env (For environmental metagenomes):

  • Does NOT require: --hg (host genome) parameter
  • Skips bowtie2 host removal step
  • Use for: Environmental samples (water, soil, sediment, etc.) where no host contamination is expected
  • Faster: More efficient for pure environmental samples
  • Example:
    eukfinder read_prep_env --r1 R1.fastq --r2 R2.fastq \
      -n 20 -i TrueSeq2_NexteraSE-PE.fa -o sample

Which should you use?

  • Use read_prep if your samples may contain host DNA (e.g., human gut samples, fecal samples, cultured organisms)
  • Use read_prep_env if your samples are from pure environmental sources without expected host contamination (saves time, same parameters otherwise)

Important: The output files from either workflow can be used with short_seqs identically. The choice only affects the preprocessing speed and efficiency.


Q14: What working directory should I use to run EukFinder?

A: Important: EukFinder writes all output to the current working directory.

  • Ensure the directory is writable
  • Avoid running from read-only locations
  • On HPC clusters, run from a project/scratch directory, not home directory
  • Example:
    cd /scratch/myproject/sample1
    eukfinder short_seqs --r1 ... --r2 ... --un ... -o sample1

Parameter Optimization

Q14: What does --mhlen (minimum hit length) do and how should I set it?

A: --mhlen controls the minimum sequence length required for Centrifuge classification matches.

  • Default: 50 bp (works well for most datasets)
  • Purpose: Filters out spurious short matches
  • Lower values (e.g., 30-40): More sensitive, catches shorter reads, but higher false-positive rate
  • Higher values (e.g., 60-100): More stringent, fewer false positives, but may miss real eukaryotic sequences

Important: Do not set excessively high values (>100) without testing, as this can cause unexpected downstream failures.

Recommendation: Start with --mhlen 50 and adjust based on your results.

Q15: How do I optimize parameters for precision vs. recall?

A: EukFinder offers two parameter profiles:

"Strict" mode (high precision, lower recall):

eukfinder short_seqs ... --mhlen 50 -e 0.01 --pid 60 --cov 30
# Use the "Euk" output file only
  • Captures ~95% of real eukaryotic reads
  • High false-positive rate (~90%)
  • Good if you want high confidence in identifications

"Lenient" mode (balanced precision/recall):

eukfinder short_seqs ... --mhlen 30 -e 0.05 --pid 50 --cov 10
# Use the "EUnk" output file (Euk + Unknown)
  • Better balance between precision and recall
  • Useful if you need comprehensive recovery
  • Follow up with binning workflows to filter contaminants

Parameters:

  • -e: E-value threshold (lower = more stringent)
  • --pid: Percent identity (higher = more stringent)
  • --cov: Coverage threshold (higher = more stringent)

Q16: What are the -n (threads) and -z (chunks) parameters?

A:

-n (number of threads):

  • Controls parallelization
  • Default: 20
  • Set based on available CPU cores (don't exceed available cores)
  • Higher values speed up processing but increase memory usage
  • Recommendation: Use available cores on your HPC cluster

-z (number of chunks):

  • Splits input for PLAST processing
  • Default: 2
  • Higher values reduce memory per chunk but increase overhead
  • Use 2-4 for most datasets
  • For very large samples (>100M reads), try 4-8

Example for HPC with 40 cores:

eukfinder short_seqs ... -n 40 -z 4

Q17: What should --max_m (maximum memory for assembly) be set to?

A: Controls RAM allocated to metaSPAdes assembly.

  • Default: 300 GB
  • Typical range: 100-400 GB
  • Your system: Set to ~80% of available RAM

Example for 128 GB system:

eukfinder short_seqs ... --max_m 100

The assembly step is memory-intensive. If it crashes, reduce --max_m or reduce input size.


Troubleshooting & Error Messages

Q18: What does the error "centrifuge database cannot be found" mean?

A: Centrifuge database path is incorrect or incomplete.

Causes:

  1. Wrong database prefix (see Q9)
  2. Database not fully extracted
  3. Database files corrupted or incomplete

Solution:

# Verify all 4 database files exist
ls -la /path/to/database/prefix*

# Should show 4 files with .1.cf, .2.cf, .3.cf, .4.cf extensions
# If not, re-extract the database

# Then verify the path is correct
eukfinder short_seqs ... --cdb /path/to/database/prefix

Q19: Why did my run terminate during validation?

A: Common causes:

  1. Centrifuge database issue: See Q18
  2. Unpaired reads file missing or empty: Provide a valid --un file
  3. Temporary directory permissions: Ensure write permissions in current directory
  4. Memory exceeded: Reduce --max_m or increase available system RAM

Debugging:

  • Check the log file (usually Class_DATE.log)
  • Verify all input files exist and are readable
  • Ensure sufficient disk space for temporary files

Q20: What should I do if FastQC reports overrepresented sequences?

A: FastQC may detect partial adapter sequences in quality control, but this is normal.

Causes:

  • Reads are truncated; FastQC sees partial adapter sequences
  • Full-length adapters are in the adapter file, but reads are shorter

Solution:

  • This is expected and usually not a problem
  • Trimmomatic will remove full-length adapters during preprocessing
  • Only investigate if overrepresented sequences are not adapter-related

To verify:

  1. BLAST the overrepresented sequence against NCBI nt database
  2. Check if it's a known adapter or actual contamination
  3. If contamination, consider quality filtering or additional preprocessing

Q21: How do I handle gzipped input files?

A: EukFinder does NOT accept gzipped (.gz) files as input.

Solution:

# Decompress your files
gunzip your_file.fastq.gz
gunzip your_file.fasta.gz

# Then run EukFinder with uncompressed files
eukfinder short_seqs --r1 your_file.fastq --r2 ... --un ...

Alternatively, decompress into a temporary file:

zcat your_file.fastq.gz > temp_file.fastq
eukfinder short_seqs --r1 temp_file.fastq ...

Output Interpretation

Q22: What do the different output classification files mean?

A: EukFinder produces multiple output files with different classifications:

  • Euk: High-confidence eukaryotic contigs
  • Unk: Unknown/unclassified contigs
  • EUnk: Combined Euk + Unk (more permissive)
  • Bact: Bacterial contigs
  • Arch: Archaeal contigs
  • Misc: Miscellaneous/viral contigs

Which to use:

  • For strict analysis: Use "Euk" file only
  • For comprehensive recovery: Use "EUnk" file, then filter with downstream binning
  • Most studies use "Euk" or "EUnk" depending on research goals

Q23: What is the Eukfinder_results/ and Eukfinder_Temps/ folder structure?

A:

Eukfinder_results/:

  • Contains final classified assembled contigs (scaffolds)
  • Files named: scf_output_prefix.Classification.fasta
  • Main output for downstream analyses

Eukfinder_Temps/:

  • Contains intermediate results from each stage
  • Folders for temporary files from first classification round (tmps_output_prefix_DATE)
  • Useful for debugging but can be deleted after successful completion to save space

Q24: How do I interpret the summary table?

A: The summary_table.txt contains:

  • Group: Classification type (Euk, Unk, Bact, etc.)
  • #Seq: Number of contigs in that classification
  • Total size(bp): Combined length of all sequences

Example interpretation:

Group      #Seq    Total size(bp)
Euk        234     5,234,567
Unk        156     2,654,321
Bact       1,200   45,678,900

Here, you recovered 234 eukaryotic contigs totaling ~5.2 Mbp.


System Requirements & Performance

Q25: How long does EukFinder take to process a sample?

A: Runtime depends heavily on input size and system resources:

Typical benchmarks (from publication, 40 CPUs, 188 GB RAM):

  • Input: 30.2M reads, 6.0 Gbp total
  • eukfinder_short: 557 minutes (~9 hours)
  • eukfinder_long: 98 minutes (~1.6 hours) using assembled contigs

Factors affecting runtime:

  • Input size (reads/contigs)
  • Number of threads allocated
  • Memory available (affects assembly speed)
  • Database size (affects classification speed)

Optimization:

  • Allocate more cores (uses -n parameter)
  • Use more RAM (for metaSPAdes assembly)
  • Process smaller samples in batches if needed

Q26: How much temporary disk space do I need?

A: Temporary space varies by sample size:

Estimate: Allocate 2-5x the input file size for temporary files

Example:

  • Input: 10 GB of reads
  • Recommended temporary space: 20-50 GB
  • Databases: 150 GB
  • Total: ~200-250 GB minimum for this sample

On HPC clusters:

  • Use scratch/work directories, not home directories
  • Monitor usage with du -sh
  • Clean up temporary files after completion (can save 50+ GB per sample)

Q27: How do I run multiple samples in parallel?

A: Use job arrays or parallel processing:

Option 1: Shell loop (simple)

for sample in S1 S2 S3 S4; do
  eukfinder short_seqs --r1 ${sample}_R1.fastq --r2 ${sample}_R2.fastq \
    --un ${sample}_un.fastq -o $sample &
done
wait

Option 2: HPC job array (recommended)

# SLURM example
#SBATCH --array=1-4%2  # Run 4 jobs, max 2 at a time

SAMPLES=(S1 S2 S3 S4)
SAMPLE=${SAMPLES[$((SLURM_ARRAY_TASK_ID - 1))]}

eukfinder short_seqs --r1 ${SAMPLE}_R1.fastq --r2 ${SAMPLE}_R2.fastq \
  --un ${SAMPLE}_un.fastq -o $SAMPLE

Option 3: GNU Parallel

parallel eukfinder short_seqs --r1 {}_R1.fastq --r2 {}_R2.fastq \
  --un {}_un.fastq -o {} ::: S1 S2 S3 S4

Advanced Topics

Q28: How do I build a custom Centrifuge database for my specific environment?

A: See the Building_custom_DB/ directory in the repository for detailed instructions. General steps:

  1. Select target genomes:

    • Identify organisms of interest using pilot data analysis
    • Download relevant genomes from NCBI using ncbi-genome-download
  2. Create taxonomy mapping:

    • Use provided scripts: Build_Centrifuge_map_from_assembly_report.py
    • Maps sequence IDs to taxonomic IDs
  3. Build database:

    centrifuge-build -p 16 --bmax 1342177280 \
      --conversion-table genome2taxid.map \
      --taxonomy-tree taxonomy/nodes.dmp \
      --name-table taxonomy/names.dmp \
      genome.fasta database_name

Q29: Can I use EukFinder on non-Linux systems?

A: EukFinder is primarily developed for Linux environments. MacOS and other non-Linux systems are not explicitly supported. Issues may arise with:

  • Conda package availability for your OS
  • Compiler compatibility for dependencies
  • Path handling differences

Recommendation: Use Docker or WSL2 on Windows, or Linux VM if unavailable on your system.


Data Quality & Preprocessing

Q30: What read quality should I target before running EukFinder?

A: High-quality input improves results:

Recommended preprocessing:

  • Quality score: ≥Q20 (99.9% accuracy)
  • Length: ≥60 bp after trimming
  • Contamination: Remove obvious contaminants (host DNA, adapters)

EukFinder's read_prep step handles:

  • Adapter trimming
  • Quality filtering
  • Host genome removal (optional)

Your inputs should:

  • Be in FASTQ format (not gzipped)
  • Have standard quality encoding (auto-detected)
  • Be properly paired (for short_seqs)

Contact & Support

For additional help:


Last Updated: January 2025 EukFinder v1.2.4

Clone this wiki locally