Skip to content

SystemsGenetics/AnnoTater

Repository files navigation

AnnoTater

AnnoTater Logo

GitHub Actions CI Status GitHub Actions Linting Status nf-test Nextflow run with docker run with singularity

Introduction

AnnoTater AnnoTater is a whole or partial genome functional annotation workflow built using Nextflow. It takes a set of protein coding gene sequences (either in nucleotide or protein FASTA format) and runs InterProScan; BLAST vs UniProt SwissProt, NCBI NR, NCBI RefSeq, OrthoDB and StringDB in order to provide a first pass set of annotations for genes.

AnnoTater is constructed using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.

AnnoTater provides the following steps:

  1. Homology searching against specified databases using Diamond BLAST (Diamond). Supported databases include:
    • NCBI nr
    • NCBI RefSeq
    • ExPASy SwissProt
    • ExPASy Trembl
    • STRING database
  2. Execution of InterProScan

Usage

  1. Install Nextflow (>=25.04.07)

  2. Install any of Docker, Singularity, Podman, Shifter or Charliecloud for full pipeline reproducibility (Conda is currently not supported); see docs),

  3. Download the pipeline to access the database download script:

    nextflow pull systemsgenetics/annotater
  4. Download databases. AnnoTater requires reference databases which can take quite a while to download and consume large amounts of storage. Use the included download.py script to retrieve and index the databases:

    # List available datasets
    python3 ~/.nextflow/assets/systemsgenetics/annotater/bin/download.py --list --outdir /path/to/databases
    
    # Download specific datasets (comma-separated)
    python3 ~/.nextflow/assets/systemsgenetics/annotater/bin/download.py --outdir /path/to/databases --datasets interproscan,uniprot_sprot,nr
    
    # Download all datasets
    python3 ~/.nextflow/assets/systemsgenetics/annotater/bin/download.py --outdir /path/to/databases --datasets interproscan,panther,nr,refseq_plant,orthodb,string-db,uniprot_sprot,uniprot_trembl
    
    # Force re-download if datasets already exist
    python3 ~/.nextflow/assets/systemsgenetics/annotater/bin/download.py --outdir /path/to/databases --datasets interproscan --reset

    Available datasets include:

    • interproscan - InterProScan 5.75-106.0 with protein domain databases
    • panther - PANTHER database for InterProScan
    • nr - NCBI Non-Redundant protein database
    • refseq_plant - NCBI RefSeq plant protein sequences
    • orthodb - OrthoDB orthologous protein groups (analysis under construction)
    • string-db - STRING database protein interactions (analysis under construction)
    • uniprot_sprot - UniProt Swiss-Prot curated proteins
    • uniprot_trembl - UniProt TrEMBL unreviewed proteins

    Note: Database downloads can be very large (10s-100s of GB) and may take hours to complete. The script supports resuming interrupted downloads.

    Best Practice: On multi-user systems, it is recommended to download databases to a shared location (e.g., /shared/databases/annotater/ or /opt/databases/annotater/) that is accessible to all users. This prevents duplicate downloads and saves significant storage space, as multiple users can reference the same database files in their workflow runs.

  5. Start running your own analysis!

    nextflow run systemsgenetics/annotater \
        -profile <docker/singularity/podman/shifter/charliecloud/conda/institute> \
        --batch_size 100 \
        --input <fasta file> \
        --data_sprot <directory with swissprot diamond index> \
        --data_refseq <directory with refseq diamond index> \
        --data_ipr <directory with InterProScan data>
    

    See the Parameters section below for a complete list of all available options.

Parameters

Required Parameters

  • --input <path> - Path to input FASTA file containing nucleotide or protein sequences. Accepted formats: .fa, .faa, .fna, .fasta
  • --outdir <path> - Output directory where results will be saved

Database Parameters

Add any combination of the databases you've downloaded:

  • --data_ipr <path> - InterProScan database directory
  • --data_sprot <path> - UniProt Swiss-Prot database directory (Diamond index)
  • --data_trembl <path> - UniProt TrEMBL database directory (Diamond index)
  • --data_nr <path> - NCBI NR database directory (Diamond index)
  • --data_refseq <path> - NCBI RefSeq database directory (Diamond index)
  • --data_orthodb <path> - OrthoDB database directory (Diamond index) (experimental)
  • --data_string <path> - STRING database directory (Diamond index) (experimental)
  • --enzyme_dat <path> - Enzyme.dat file from ExPASy for EC number extraction (used with SwissProt)

Processing Parameters

  • --batch_size <number> - Number of sequences to process per batch (default: 5)
  • --seq_type <pep|nuc> - Input sequence type: pep for protein or nuc for nucleotide (default: pep)
  • --taxonomy_ID <number> - NCBI Taxonomy ID for the species (used by STRING database analysis)

Output Parameters

  • --email <email> - Email address for completion summary
  • --publish_dir_mode <mode> - Method to save results: symlink, link, copy, copyNoFollow (default: link)

Standard Nextflow Parameters

  • --help - Display help message and exit
  • --version - Display version and exit
  • -profile <profile> - Configuration profile: docker, singularity, conda, etc.
  • -resume - Resume previous run from cached results
  • -work-dir <path> - Specify work directory for temporary files

Output Format

AnnoTater produces several types of output files for functional annotation:

Directory Structure

The pipeline outputs are organized in the specified output directory:

results/
├── DIAMOND/
│   ├── blastp_{prefix}_uniprot_sprot.out
│   ├── blastp_{prefix}_uniprot_trembl.out
│   ├── blastp_{prefix}_nr.out
│   └── blastp_{prefix}_refseq_plant.out
├── InterProScan/
│   ├── {prefix}.IPR_mappings.txt
│   ├── {prefix}.GO_mappings.txt
│   └── {prefix}.tsv
├── ECnumbers/
│   └── {prefix}_EC_mappings.txt (if using SwissProt with enzyme.dat)
├── orthodb/
│   └── ortholog_results/ (experimental)
├── string/
│   └── interaction_results/ (experimental)
└── pipeline_info/
    ├── execution_report.html
    ├── execution_timeline.html
    └── annotater_software_versions.yml

DIAMOND BLAST Results

For each database searched (SwissProt, TrEMBL, NR, RefSeq), the pipeline generates:

  • Combined DIAMOND results: blastp_{prefix}_{database}.out or blastx_{prefix}_{database}.out - Tab-separated file containing all BLAST hits with detailed alignment information
  • Raw DIAMOND XML: Individual XML files for each batch, later combined

InterProScan Results

InterProScan analysis produces comprehensive functional annotation files:

  • IPR mappings: {prefix}.IPR_mappings.txt - Tab-separated file mapping genes to InterPro entries

    Gene            IPR         Description
    LOC_Os01g01010  IPR001841   Zinc finger, RING-type
    LOC_Os01g01010  IPR013083   Zinc finger, RING/FYVE/PHD-type
    
  • GO mappings: {prefix}.GO_mappings.txt - Tab-separated file mapping genes to Gene Ontology terms

    Gene            GO
    LOC_Os01g01010  GO:0005507
    LOC_Os01g01010  GO:0016491
    
  • Combined TSV: {prefix}.tsv - Complete InterProScan output with all annotations including domains, signatures, and functional classifications

Database-Specific Notes

  • OrthoDB analysis: Currently under construction
  • STRING database analysis: Currently under construction
  • EC number extraction: Generated when using SwissProt database with enzyme.dat file

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Credits

AnnoTater and was written by the Ficklin Computational Biology Team at Washington State University. Development of AnnoTater was initially funded by the U.S. National Science Foundation (NSF) Award #1659300.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

AnnoTater is currently unpublished. For now, please use the GitHub URL when referencing. An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

About

Functional Annotation of Gene Lists

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •