AnnoTater AnnoTater is a whole or partial genome functional annotation workflow built using Nextflow. It takes a set of protein coding gene sequences (either in nucleotide or protein FASTA format) and runs InterProScan; BLAST vs UniProt SwissProt, NCBI NR, NCBI RefSeq, OrthoDB and StringDB in order to provide a first pass set of annotations for genes.
AnnoTater is constructed using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.
AnnoTater provides the following steps:
- Homology searching against specified databases using Diamond BLAST (
Diamond). Supported databases include:- NCBI nr
- NCBI RefSeq
- ExPASy SwissProt
- ExPASy Trembl
- STRING database
- Execution of InterProScan
-
Install
Nextflow(>=25.04.07) -
Install any of
Docker,Singularity,Podman,ShifterorCharliecloudfor full pipeline reproducibility (Condais currently not supported); see docs), -
Download the pipeline to access the database download script:
nextflow pull systemsgenetics/annotater -
Download databases. AnnoTater requires reference databases which can take quite a while to download and consume large amounts of storage. Use the included
download.pyscript to retrieve and index the databases:# List available datasets python3 ~/.nextflow/assets/systemsgenetics/annotater/bin/download.py --list --outdir /path/to/databases # Download specific datasets (comma-separated) python3 ~/.nextflow/assets/systemsgenetics/annotater/bin/download.py --outdir /path/to/databases --datasets interproscan,uniprot_sprot,nr # Download all datasets python3 ~/.nextflow/assets/systemsgenetics/annotater/bin/download.py --outdir /path/to/databases --datasets interproscan,panther,nr,refseq_plant,orthodb,string-db,uniprot_sprot,uniprot_trembl # Force re-download if datasets already exist python3 ~/.nextflow/assets/systemsgenetics/annotater/bin/download.py --outdir /path/to/databases --datasets interproscan --reset
Available datasets include:
interproscan- InterProScan 5.75-106.0 with protein domain databasespanther- PANTHER database for InterProScannr- NCBI Non-Redundant protein databaserefseq_plant- NCBI RefSeq plant protein sequencesorthodb- OrthoDB orthologous protein groups (analysis under construction)string-db- STRING database protein interactions (analysis under construction)uniprot_sprot- UniProt Swiss-Prot curated proteinsuniprot_trembl- UniProt TrEMBL unreviewed proteins
Note: Database downloads can be very large (10s-100s of GB) and may take hours to complete. The script supports resuming interrupted downloads.
Best Practice: On multi-user systems, it is recommended to download databases to a shared location (e.g.,
/shared/databases/annotater/or/opt/databases/annotater/) that is accessible to all users. This prevents duplicate downloads and saves significant storage space, as multiple users can reference the same database files in their workflow runs. -
Start running your own analysis!
nextflow run systemsgenetics/annotater \ -profile <docker/singularity/podman/shifter/charliecloud/conda/institute> \ --batch_size 100 \ --input <fasta file> \ --data_sprot <directory with swissprot diamond index> \ --data_refseq <directory with refseq diamond index> \ --data_ipr <directory with InterProScan data>
See the Parameters section below for a complete list of all available options.
--input <path>- Path to input FASTA file containing nucleotide or protein sequences. Accepted formats:.fa,.faa,.fna,.fasta--outdir <path>- Output directory where results will be saved
Add any combination of the databases you've downloaded:
--data_ipr <path>- InterProScan database directory--data_sprot <path>- UniProt Swiss-Prot database directory (Diamond index)--data_trembl <path>- UniProt TrEMBL database directory (Diamond index)--data_nr <path>- NCBI NR database directory (Diamond index)--data_refseq <path>- NCBI RefSeq database directory (Diamond index)--data_orthodb <path>- OrthoDB database directory (Diamond index) (experimental)--data_string <path>- STRING database directory (Diamond index) (experimental)--enzyme_dat <path>- Enzyme.dat file from ExPASy for EC number extraction (used with SwissProt)
--batch_size <number>- Number of sequences to process per batch (default: 5)--seq_type <pep|nuc>- Input sequence type:pepfor protein ornucfor nucleotide (default:pep)--taxonomy_ID <number>- NCBI Taxonomy ID for the species (used by STRING database analysis)
--email <email>- Email address for completion summary--publish_dir_mode <mode>- Method to save results:symlink,link,copy,copyNoFollow(default:link)
--help- Display help message and exit--version- Display version and exit-profile <profile>- Configuration profile:docker,singularity,conda, etc.-resume- Resume previous run from cached results-work-dir <path>- Specify work directory for temporary files
AnnoTater produces several types of output files for functional annotation:
The pipeline outputs are organized in the specified output directory:
results/
├── DIAMOND/
│ ├── blastp_{prefix}_uniprot_sprot.out
│ ├── blastp_{prefix}_uniprot_trembl.out
│ ├── blastp_{prefix}_nr.out
│ └── blastp_{prefix}_refseq_plant.out
├── InterProScan/
│ ├── {prefix}.IPR_mappings.txt
│ ├── {prefix}.GO_mappings.txt
│ └── {prefix}.tsv
├── ECnumbers/
│ └── {prefix}_EC_mappings.txt (if using SwissProt with enzyme.dat)
├── orthodb/
│ └── ortholog_results/ (experimental)
├── string/
│ └── interaction_results/ (experimental)
└── pipeline_info/
├── execution_report.html
├── execution_timeline.html
└── annotater_software_versions.yml
For each database searched (SwissProt, TrEMBL, NR, RefSeq), the pipeline generates:
- Combined DIAMOND results:
blastp_{prefix}_{database}.outorblastx_{prefix}_{database}.out- Tab-separated file containing all BLAST hits with detailed alignment information - Raw DIAMOND XML: Individual XML files for each batch, later combined
InterProScan analysis produces comprehensive functional annotation files:
-
IPR mappings:
{prefix}.IPR_mappings.txt- Tab-separated file mapping genes to InterPro entriesGene IPR Description LOC_Os01g01010 IPR001841 Zinc finger, RING-type LOC_Os01g01010 IPR013083 Zinc finger, RING/FYVE/PHD-type -
GO mappings:
{prefix}.GO_mappings.txt- Tab-separated file mapping genes to Gene Ontology termsGene GO LOC_Os01g01010 GO:0005507 LOC_Os01g01010 GO:0016491 -
Combined TSV:
{prefix}.tsv- Complete InterProScan output with all annotations including domains, signatures, and functional classifications
- OrthoDB analysis: Currently under construction
- STRING database analysis: Currently under construction
- EC number extraction: Generated when using SwissProt database with enzyme.dat file
Warning
Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.
AnnoTater and was written by the Ficklin Computational Biology Team at Washington State University. Development of AnnoTater was initially funded by the U.S. National Science Foundation (NSF) Award #1659300.
If you would like to contribute to this pipeline, please see the contributing guidelines.
AnnoTater is currently unpublished. For now, please use the GitHub URL when referencing. An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.
This pipeline uses code and infrastructure developed and maintained by the nf-core community, reused here under the MIT license.
The nf-core framework for community-curated bioinformatics pipelines.
Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.
Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.
