AnnoTater

Introduction

AnnoTater AnnoTater is a whole or partial genome functional annotation workflow built using Nextflow. It takes a set of protein coding gene sequences (either in nucleotide or protein FASTA format) and runs InterProScan; BLAST vs UniProt SwissProt, NCBI NR, NCBI RefSeq, OrthoDB and StringDB in order to provide a first pass set of annotations for genes.

AnnoTater is constructed using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.

AnnoTater provides the following steps:

Homology searching against specified databases using Diamond BLAST (Diamond). Supported databases include:
- NCBI nr
- NCBI RefSeq
- ExPASy SwissProt
- ExPASy Trembl
- STRING database
Execution of InterProScan

Usage

Install Nextflow (>=25.04.07)
Install any of Docker, Singularity, Podman, Shifter or Charliecloud for full pipeline reproducibility (Conda is currently not supported); see docs),
Download the pipeline to access the database download script:
```
nextflow pull systemsgenetics/annotater
```
Download databases. AnnoTater requires reference databases which can take quite a while to download and consume large amounts of storage. Use the included download.py script to retrieve and index the databases:
```
# List available datasets
python3 ~/.nextflow/assets/systemsgenetics/annotater/bin/download.py --list --outdir /path/to/databases

# Download specific datasets (comma-separated)
python3 ~/.nextflow/assets/systemsgenetics/annotater/bin/download.py --outdir /path/to/databases --datasets interproscan,uniprot_sprot,nr

# Download all datasets
python3 ~/.nextflow/assets/systemsgenetics/annotater/bin/download.py --outdir /path/to/databases --datasets interproscan,panther,nr,refseq_plant,orthodb,string-db,uniprot_sprot,uniprot_trembl

# Force re-download if datasets already exist
python3 ~/.nextflow/assets/systemsgenetics/annotater/bin/download.py --outdir /path/to/databases --datasets interproscan --reset
```
Available datasets include:
- interproscan - InterProScan 5.75-106.0 with protein domain databases
- panther - PANTHER database for InterProScan
- nr - NCBI Non-Redundant protein database
- refseq_plant - NCBI RefSeq plant protein sequences
- orthodb - OrthoDB orthologous protein groups (analysis under construction)
- string-db - STRING database protein interactions (analysis under construction)
- uniprot_sprot - UniProt Swiss-Prot curated proteins
- uniprot_trembl - UniProt TrEMBL unreviewed proteins
Note: Database downloads can be very large (10s-100s of GB) and may take hours to complete. The script supports resuming interrupted downloads.

Best Practice: On multi-user systems, it is recommended to download databases to a shared location (e.g., /shared/databases/annotater/ or /opt/databases/annotater/) that is accessible to all users. This prevents duplicate downloads and saves significant storage space, as multiple users can reference the same database files in their workflow runs.

Start running your own analysis!

nextflow run systemsgenetics/annotater \
    -profile <docker/singularity/podman/shifter/charliecloud/conda/institute> \
    --batch_size 100 \
    --input <fasta file> \
    --data_sprot <directory with swissprot diamond index> \
    --data_refseq <directory with refseq diamond index> \
    --data_ipr <directory with InterProScan data>

See the Parameters section below for a complete list of all available options.

Parameters

Required Parameters

--input <path> - Path to input FASTA file containing nucleotide or protein sequences. Accepted formats: .fa, .faa, .fna, .fasta
--outdir <path> - Output directory where results will be saved

Database Parameters

Add any combination of the databases you've downloaded:

--data_ipr <path> - InterProScan database directory
--data_sprot <path> - UniProt Swiss-Prot database directory (Diamond index)
--data_trembl <path> - UniProt TrEMBL database directory (Diamond index)
--data_nr <path> - NCBI NR database directory (Diamond index)
--data_refseq <path> - NCBI RefSeq database directory (Diamond index)
--data_orthodb <path> - OrthoDB database directory (Diamond index) (experimental)
--data_string <path> - STRING database directory (Diamond index) (experimental)
--enzyme_dat <path> - Enzyme.dat file from ExPASy for EC number extraction (used with SwissProt)

Processing Parameters

--batch_size <number> - Number of sequences to process per batch (default: 5)
--seq_type <pep|nuc> - Input sequence type: pep for protein or nuc for nucleotide (default: pep)
--taxonomy_ID <number> - NCBI Taxonomy ID for the species (used by STRING database analysis)

Output Parameters

--email <email> - Email address for completion summary
--publish_dir_mode <mode> - Method to save results: symlink, link, copy, copyNoFollow (default: link)

Standard Nextflow Parameters

--help - Display help message and exit
--version - Display version and exit
-profile <profile> - Configuration profile: docker, singularity, conda, etc.
-resume - Resume previous run from cached results
-work-dir <path> - Specify work directory for temporary files

Output Format

AnnoTater produces several types of output files for functional annotation:

Directory Structure

The pipeline outputs are organized in the specified output directory:

results/
├── DIAMOND/
│   ├── blastp_{prefix}_uniprot_sprot.out
│   ├── blastp_{prefix}_uniprot_trembl.out
│   ├── blastp_{prefix}_nr.out
│   └── blastp_{prefix}_refseq_plant.out
├── InterProScan/
│   ├── {prefix}.IPR_mappings.txt
│   ├── {prefix}.GO_mappings.txt
│   └── {prefix}.tsv
├── ECnumbers/
│   └── {prefix}_EC_mappings.txt (if using SwissProt with enzyme.dat)
├── orthodb/
│   └── ortholog_results/ (experimental)
├── string/
│   └── interaction_results/ (experimental)
└── pipeline_info/
    ├── execution_report.html
    ├── execution_timeline.html
    └── annotater_software_versions.yml

DIAMOND BLAST Results

For each database searched (SwissProt, TrEMBL, NR, RefSeq), the pipeline generates:

Combined DIAMOND results: blastp_{prefix}_{database}.out or blastx_{prefix}_{database}.out - Tab-separated file containing all BLAST hits with detailed alignment information
Raw DIAMOND XML: Individual XML files for each batch, later combined

InterProScan Results

InterProScan analysis produces comprehensive functional annotation files:

IPR mappings: {prefix}.IPR_mappings.txt - Tab-separated file mapping genes to InterPro entries

Gene            IPR         Description
LOC_Os01g01010  IPR001841   Zinc finger, RING-type
LOC_Os01g01010  IPR013083   Zinc finger, RING/FYVE/PHD-type

GO mappings: {prefix}.GO_mappings.txt - Tab-separated file mapping genes to Gene Ontology terms
```
Gene            GO
LOC_Os01g01010  GO:0005507
LOC_Os01g01010  GO:0016491
```
Combined TSV: {prefix}.tsv - Complete InterProScan output with all annotations including domains, signatures, and functional classifications

Database-Specific Notes

OrthoDB analysis: Currently under construction
STRING database analysis: Currently under construction
EC number extraction: Generated when using SwissProt database with enzyme.dat file

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Credits

AnnoTater and was written by the Ficklin Computational Biology Team at Washington State University. Development of AnnoTater was initially funded by the U.S. National Science Foundation (NSF) Award #1659300.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

AnnoTater is currently unpublished. For now, please use the GitHub URL when referencing. An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

Name		Name	Last commit message	Last commit date
Latest commit History 190 Commits
.devcontainer		.devcontainer
.github		.github
assets		assets
bin		bin
conf		conf
docker		docker
docs		docs
modules		modules
subworkflows		subworkflows
workflows		workflows
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitpod.yml		.gitpod.yml
.nf-core.yml		.nf-core.yml
.pre-commit-config.yaml		.pre-commit-config.yaml
.prettierignore		.prettierignore
.prettierrc.yml		.prettierrc.yml
CHANGELOG.md		CHANGELOG.md
CITATIONS.md		CITATIONS.md
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
modules.json		modules.json
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AnnoTater

Introduction

Usage

Parameters

Required Parameters

Database Parameters

Processing Parameters

Output Parameters

Standard Nextflow Parameters

Output Format

Directory Structure

DIAMOND BLAST Results

InterProScan Results

Database-Specific Notes

Credits

Contributions and Support

Citations

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

SystemsGenetics/AnnoTater

Folders and files

Latest commit

History

Repository files navigation

AnnoTater

Introduction

Usage

Parameters

Required Parameters

Database Parameters

Processing Parameters

Output Parameters

Standard Nextflow Parameters

Output Format

Directory Structure

DIAMOND BLAST Results

InterProScan Results

Database-Specific Notes

Credits

Contributions and Support

Citations

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages