Read mapping pipeline for detection and measurement of human and animal virus pathogens from short read metagenomic environmental or clinical samples.
This approach is sensitive, specific, and ideal for exploring virus presence/absence/diversity within and between metagenomic or clinical samples. Interactive reports make it easy to see the breadth of read coverage for each detected virus. This tool should reliably detect virus genomes with 90% ANI or greater to reference genomes.
NOTE: The database used by Esviritu
should cover all human and animal viruses in GenBank as of May 2025 (EsViritu DB v3.1.1). However, the genomes are dereplicated at 95% ANI so that only one genome from a nearly identical group is used. Please open an issue to report any omissions.
As of EsViritu v1.0.0
:
-
Highly curated sequence-to-taxonomy database of known human, animal, and plant viruses.
-
Single read sensitivity for detection (requires reads >= 100 nt).
-
Assembly-aware genome reconstructions for segmented viruses.
-
Expression of uncertainty (i.e. average read identity to reference, nucleotide diversity measurement).
-
Attractive HTML reports.
Logo by Adrien Assie
18,203 high quality virus genome assemblies across 44 taxonomic families:
NOTE: 2025-06-24 EsViritu v1.0.2 is available on bioconda.
1) Create conda environment. mamba
is preferable to conda
for environment creation.
mamba create -n EsViritu bioconda::esviritu
2) Activate the environment with conda
conda activate EsViritu
you should be able to run help menu:
EsViritu -h
3) Download the database (~400 MB when decompressed). EsViritu v1.0.0 or higher requires DB v3.1.0 or higher!
cd
where you want the database to reside
mkdir esviritu_DB && cd esviritu_DB
Download the tarball of DB v3.1.1
(most recent version) from Zenodo:
wget https://zenodo.org/records/15723755/files/esviritu_db_v3.1.1.tar.gz
Check that the download was successful:
md5sum esviritu_db_v3.1.1.tar.gz
should return 509e875db51a335ccce5663d126d1fcf esviritu_db_v3.1.1.tar.gz
Unpack and remove the tarball:
tar -xvf esviritu_db_v3.1.1.tar.gz
rm esviritu_db_v3.1.1.tar.gz
DB files should be in v3.1.1
4) Set the database path (optional but recommended):
conda env config vars set ESVIRITU_DB=/path/to/esviritu_DB/v3.1.1
5) (OPTIONAL BUT RECOMMENDED) Install the R
package dataui
manually in an R session. Without dataui
reports won't show genome coverage sparklines.
R
then:
remotes::install_github("timelyportfolio/dataui")
Detailed Instructions
- Clone repo
git clone https://github.com/cmmr/EsViritu.git
- Go to
EsViritu
directory.
cd EsViritu
- use the file
environment/EsViritu.yml
withmamba create
to generate the environment used with this tool
mamba env create --file environment/EsViritu.yml
- Activate the environment
conda activate EsViritu
- Make it command-line executable. From repo directory:
pip install .
Now follow the database set up instructions above
Basic Instructions
Please note that, while I WAS able to get this to run using Docker
/Docker Desktop
on my Mac, I am not a Docker
expert, and I may be unable to troubleshoot issues.
- Pull Docker image (v1.0.2 shown below)
docker pull quay.io/biocontainers/esviritu:1.0.2--pyhdfd78af_0
Notes:
* be sure to mount your volumes/directories with the `EsViritu` database as well as those with input read files
* I believe you can save environmental variables like ESVIRITU_DB in `Docker` containers
You could filter unwanted sequences out upstream of this tool, but this will allow you to do it within EsViritu
using minimap2
. The pipeline script will look for a file at filter_seqs/filter_seqs.fna
which could be any fasta-formatted sequence file you want to use to remove matching reads (e.g. from host or spike-in).
Here are instructions for downloading and formatting the human genome and phiX spike-in (3 GB decompressed).
NOTE: When analyzing sequences from human tissues processed via hybrid capture virome sequencing, quantification may be more accurate if human reads are NOT removed
cd EsViritu ### or `cd` where you want the filter_seqs to reside
mkdir filter_seqs && cd filter_seqs
## download phiX genome and gunzip
wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/viral/Sinsheimervirus_phiX174/latest_assembly_versions/GCF_000819615.1_ViralProj14015/GCF_000819615.1_ViralProj14015_genomic.fna.gz
gunzip GCF_000819615.1_ViralProj14015_genomic.fna.gz
## download human genome and gunzip
wget https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/latest_assembly_versions/GCF_009914755.1_T2T-CHM13v2.0/GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz
gunzip GCF_009914755.1_T2T-CHM13v2.0_genomic.fna.gz
## concatenate files
cat GCF_000819615.1_ViralProj14015_genomic.fna GCF_009914755.1_T2T-CHM13v2.0_genomic.fna > filter_seqs.fna
## optionally delete separate files
rm GCF_000819615.1_ViralProj14015_genomic.fna GCF_009914755.1_T2T-CHM13v2.0_genomic.fna
## set the filter_seqs directory as an environmental variable
conda env config vars set ESVIRITU_FILTER=/path/to/filter_seqs
Remember to set -f True
to run the filtering step.
- inputs are .fastq files
- (OPTIONAL) Reads are filtered for quality and length, adapters are removed, then reads mapping to human genome or phiX spike in are removed. Must set flags
-q True -f True
. - Filtered reads are aligned to a dereplicated database of human, animal, and plant virus genomes/segments. (read alignment: >= 80% ANI, >= 100 nt aligned, and >= 90% read coverage)
- Candidate reference genomes are dereplicated. First, a network of references (binned at the assembly level) sharing >= 33% of reads aligned (union of read IDs) is generated. Then, local maxima references are determined by total reads aligned. These references are carried forward for final quantification.
- The reads from the original alignment are re-aligned to the dereplicated references.
- breadth, depth, abundance, average read identity, and nucleotide diversity (Pi) is determined for each detected genome/segment.
- Summary tables are generated for the contig level
*.detected_virus.info.tsv
and the assembly level*.detected_virus.assembly_summary.tsv
- Consensus sequences are determined for each detected dereplicated genome/segment.
*_final_consensus.fasta
- A reactable (interactive table) with a visualization of read coverage profile
*.reactable.html
is generated.
You might run this as part of a bash script, do your own upstream read processing, etc, but these are the basic instructions.
Required inputs:
-r reads file (.fastq)
-s sample name
-o output directory (may be shared with other samples)
Activate the conda environment:
conda activate EsViritu
Individual samples can be run with the python script. E.g.:
Basic run with 1 .fastq file:
EsViritu -r /path/to/reads/myreads.fastq -s sample_ABC -o myproject_EsViritu1 -p unpaired
Using paired end input .fastq files. Must be exactly 2 files.
EsViritu -r /path/to/reads/myreads.R1.fastq /path/to/reads/myreads.R2.fastq -s sample_ABC -o myproject_EsViritu1 -p paired
With pre-filtering steps:
EsViritu -r /path/to/reads/myreads.fastq -s sample_ABC -o myproject_EsViritu1 -q True -f True -p unpaired
Help menu
EsViritu -h
Run the batch summary script to collate reports from several sequencing libraries in a project:
Example:
Activate conda environment: conda activate EsViritu
Then run the summarize_esv_runs
command with the relative path to the output directory as the only argument:
summarize_esv_runs myproject_EsViritu1
This command will generate the tables myproject_EsViritu1.detected_virus.info.tsv
, myproject_EsViritu1.detected_virus.assembly_summary.tsv
and the reactable myproject_EsViritu1.batch_detected_viruses.html
, which summarize information about all the samples in the given directory.
Because this tool is based on mapping to reference genomes, recombination and reassortment can cause some issues.
For example, strains of picornaviruses naturally recombine with other strains. So, if the virus genome in your sample is a recombinant between the 5' half of "Picornavirus Strain A" and the 3' half of "Picornavirus Strain B", the output from Esviritu
will indicate that you have both "Picornavirus Strain A" and "Picornavirus Strain B" in your sample. In these cases, it's useful to look at the html reports to analyze coverage across the genome(s). Based on our testing, this only happens with recombinant viruses that have no records in GenBank.
RPKMF, the abundance metric used in EsViritu
is:
(Reads Per Kilobase of reference genome)/(Million reads passing Filtering)
From v0.2.3
to v1.0.0
the tool, while having the same goal, has undergone very extensive code refactoring, logic, and database updates. These include:
- The database is completely redone and increased in size by ~2.5X.
- Subspecies taxonomy labels are added (not in GenBank records) for SARS-CoV-2, Norovirus, Mpox, Rotavirus.
- Segmented virus genomes are treated as coherent assemblies by the tool rather than unaffiliated contigs.
- Genbank records that appeared incomplete based on length and/or number of segments were removed.
- Genbank records with sequences highly similar to human genomic DNA, vectors, or other common contaminants were removed.
- Genbank records without complete taxonomy information (e.g. a virus called picornaviridae sp. lacks genus and species labels) were excluded.
- All code (other than the .hmtl report) is rewritten in python, making it more robust and reducing dependencies.
- Read ANI and nucleotide diversity (Pi) calculations are now reported.
Wastewater sequencing reveals community and variant dynamics of the collective human virome
Michael Tisza, Sara Javornik Cregeen, Vasanthi Avadhanula, Ping Zhang, Tulin Ayvaz, Karen Feliz, Kristi L. Hoffman, Justin R. Clark, Austen Terwilliger, Matthew C. Ross, Juwan Cormier, David Henke, Catherine Troisi, Fuqing Wu, Janelle Rios, Jennifer Deegan, Blake Hansen, John Balliew, Anna Gitter, Kehe Zhang, Runze Li, Cici X. Bauer, Kristina D. Mena, Pedro A. Piedra, Joseph F. Petrosino, Eric Boerwinkle, Anthony W. Maresso