Nextclade Workflow for Enterovirus D68

This repository contains a robust, reproducible workflow for building a custom Nextclade dataset for Enterovirus D68 (EV-D68). It enables you to generate reference and annotation files, download and process sequence data, infer an ancestral sequence, and create all files needed for Nextclade analyses and visualization.

Citation

If you use this dataset in your research, please cite:

Neuner-Jehle, N., González Sánchez, A., Hodcroft, E. B., & European Non-Polio Enterovirus Network (ENPEN). (2025). enterovirus-phylo/nextclade_d68: Enterovirus D68 Nextclade Dataset v1.0.0 (v1.0.0--2025-11-18). Zenodo. https://doi.org/10.5281/zenodo.17642338

Quick Start

# 1. Set up folders
mkdir -p dataset data ingest resources results scripts

# 2. Generate reference files
python3 scripts/generate_from_genbank.py --reference "AY426531.1" --output-dir dataset/

# 3. Configure pathogen.json (edit manually)

# 4. If first time, enable inference in Snakefile:
# Set INFERRENCE_RERUN = True

# 5. Run workflow
snakemake --cores 9 all --config static_inference_confirmed=true

See detailed instructions below for each step.

Folder Structure

Follow the Nextclade example workflow or use the structure below:

mkdir -p dataset data ingest resources results scripts

Workflow Overview

This workflow is composed of several modular steps:

Reference Generation
Extracts relevant reference and annotation files from GenBank.
Dataset Ingest
Downloads and processes sequences and metadata from NCBI Virus.
Inferred Ancestral Root (Recommended)
Uses outgroup rooting to infer a dataset-specific ancestral sequence. This is rooted on a Static Inferred Ancestor — a phylogenetically reconstructed sequence at the MRCA (most recent common ancestor) of the ingroup, which provides a stable, biologically accurate reference point for mutation and clade assignments. This approach addresses the issue that the Fermon reference (1962) differs substantially from currently circulating strains.
Augur Phylogenetics & Nextclade Preparation
Builds trees rooted on the inferred ancestor, prepares multiple sequence alignments, and generates all files required for Nextclade and Auspice.
Visualization & Analysis
Enables both command-line and web-based Nextclade analyses, including local dataset hosting.

Setup Instructions

1. Generate Reference Files

Run the script to extract the reference FASTA and genome annotation from GenBank:

python3 scripts/generate_from_genbank.py --reference "AY426531.1" --output-dir dataset/

During the script execution, follow the prompts for CDS annotation selection.

[0]
[product] or [leave empty for manual choice] to select proteins.
[2].

Outputs:

dataset/reference.fasta
dataset/genome_annotation.gff3

2. Configure `pathogen.json`

Edit pathogen.json to:

Reference your generated files (reference.fasta, genome_annotation.gff3)
Update metadata and QC settings as needed

Warning

If QC is not set, Nextclade will skip quality checks.

See the Nextclade pathogen config documentation for details.

3. Prepare GenBank Reference

Copy your GenBank file to resources/reference.gb and edit it to ensure compatibility with the workflow.

Important requirements:

Each coding sequence (CDS) must have either a product or gene name present
The annotation keys must match exactly between reference.gb and genome_annotation.gff3
Use simple, consistent names (e.g., product="VP1" instead of product="VP1_protein")
Remove any genes that are not relevant for your dataset

Warning

Mismatched or inconsistent gene names will cause augur ancestral to fail, as it cannot match features across files. Ensure your protein names match those defined in the GENES list in the Snakefile.

4. Update the `Snakefile`

Adjust the workflow parameters and file paths as needed for your dataset.
Ensure required files are available:
- data/sequences.fasta
- data/metadata.tsv
- resources/auspice_config.json

Sequences and metadata can be downloaded automatically via the ingest process (see below).

Subprocesses

Ingest

Automates downloading of EV-D68 sequences and metadata from NCBI Virus.
See ingest/README.md for specifics.

Required packages:
csvtk, nextclade, tsv-utils, seqkit, zip, unzip, entrez-direct, ncbi-datasets-cli (installable via conda-forge/bioconda)

Inferred Ancestral Root with Outgroup Rooting (Recommended)

The inferred-root/ directory contains a reproducible pipeline that uses outgroup rooting to infer a dataset-specific ancestral sequence for EV-D68. This method:

Builds a phylogenetic tree including both EV-D68 sequences (ingroup) and related enterovirus sequences (outgroup)
Roots the tree on the outgroup to establish correct evolutionary directionality
Extracts the ancestral sequence at the MRCA of all EV-D68 sequences
Fills gaps with reference nucleotides to ensure a complete, biologically plausible genome

This Static Inferred Ancestor serves as the root of your Nextclade dataset, providing:

More accurate mutation calls relative to a realistic EV-D68 ancestor
A stable reference that better represents EV-D68 diversity than the distant Fermon sequence (1962)

Configuration

The workflow has two key parameters in the main Snakefile:

STATIC_ANCESTRAL_INFERRENCE = True — enables using the inferred root (default: True)
INFERRENCE_RERUN = False — controls whether to regenerate the inferred root (default: False)

For Regular Dataset Builds

Use the existing inferred root:

snakemake --cores 9 all

To Regenerate the Inferred Root

When you need to regenerate with new data or updated outgroups:

Set INFERRENCE_RERUN = True in the Snakefile

Run the workflow:

snakemake --cores 9 all --config static_inference_confirmed=true

The workflow will:
- Clean previous results in inferred-root/results/
- Run the full inference pipeline with your current sequences
- Generate a new resources/inferred-root.fasta
- Incorporate it into the dataset build
After successful regeneration, set INFERRENCE_RERUN = False for future runs

Warning

Setting INFERRENCE_RERUN = True will overwrite your existing resources/inferred-root.fasta file and clear inferred-root/results/. Only use this when you want to regenerate the root with updated data.

Note

First-time users: If resources/inferred-root.fasta doesn't exist, you must set INFERRENCE_RERUN = True initially.
To disable this feature: Set STATIC_ANCESTRAL_INFERRENCE = False and change ROOTING parameter (e.g., ROOTING="mid_point").
Outgroup configuration: Sequences are in resources/outgroup/; update the OUTGROUP list in inferred-root/Snakefile to modify which species are used.

See: inferred-root/README.md for technical details and the complete workflow.

Template for Other Enteroviruses

If you want to apply this approach to other enterovirus types (e.g., EV-A71, CVA16), a Nextclade Dataset Template for Inferred Root is available and recommended for reuse.

Running the Workflow

To generate the Auspice JSON and Nextclade dataset:

snakemake --cores 9 all

This will use the existing inferred root (see Inferred Ancestral Root section above for regeneration instructions).

The workflow will:

Build the reference tree rooted on the inferred ancestor
Produce the Nextclade dataset in out-dataset/
Run Nextclade on example sequences
Output results to test_out/ (alignment, translations, summary TSV)

Key Snakefile parameters:

ROOTING = "ancestral_sequence" — roots tree on the inferred ancestor
STATIC_ANCESTRAL_INFERRENCE = True — enables inferred root in the dataset (default)
INFERRENCE_RERUN = False — set to True only when regenerating the root (default: False)

Labeling Mutations of Interest

To label mutations of interest, execute the mutLabels rule as a standalone instance. They will be added to the out-dataset/pathogen.json file.

Visualizing Your Custom Nextclade Dataset

To use the dataset in Nextclade Web, serve it locally:

serve --cors out-dataset -l 3000

Then open:

https://master.clades.nextstrain.org/?dataset-url=http://localhost:3000

Click "Load example", then "Run"
You may want to reduce "Max. nucleotide markers" to 500 under "Settings" → "Sequence view" to optimize performance

Author & Contact

Maintainers: Nadia Neuner-Jehle, Alejandra González-Sánchez and Emma B. Hodcroft (eve-lab.org)
For questions or suggestions, please open an issue or email: eve-group[at]swisstph.ch

Troubleshooting and Further Help

For issues, see the official Nextclade documentation or open an issue.
For details on the inferred root workflow, see inferred-root/README.md.
For adapting to other enteroviruses, see the dataset-template-inferred-root.

This guide provides a structured, scalable approach to building and using high-quality Nextclade datasets for EV-D68 — and can be adapted for other enterovirus types as well.

Task List

Completed:

Integrate ancestral inferred-root into workflow (#2)
Validate clade assignment of fragmented sequences in Nextclade (testing/)
Ensure novel recombinants get assigned to the root (issue #3) → recombinant feature in testing; QC label
Review and validate EV-D68 nomenclature, including robustness with recombinant sequences
Integrate epitope mutation information as tree coloring and/or display in the Nextclade results table
Generate the inferred ancestral sequence from the outgroup-rooted tree (see mpox for technical details); Commit 605c4db.
Create test dataset — small example demonstrating the full inferred-root workflow end-to-end

Documentation & Visualization:

Document outgroup selection and validation — explain which enterovirus species are used as outgroups and phylogenetic justification
Add workflow diagram — visual representation showing when INFERRENCE_RERUN triggers the inferred-root sub-workflow
Add troubleshooting for INFERRENCE_RERUN — common errors when regenerating (missing outgroups, alignment failures, etc.)

Analysis & Validation:

Document when to regenerate inferred root — guidelines on how often to rerun with new data
Compare mutation profiles — quantify difference in mutation calls between Fermon-rooted vs inferred-root datasets
Validate rooting stability — test sensitivity of inferred root to outgroup selection and subsampling

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Nextclade Workflow for Enterovirus D68

Citation

Quick Start

Folder Structure

Workflow Overview

Setup Instructions

1. Generate Reference Files

2. Configure `pathogen.json`

3. Prepare GenBank Reference

4. Update the `Snakefile`

Subprocesses

Ingest

Inferred Ancestral Root with Outgroup Rooting (Recommended)

Configuration

For Regular Dataset Builds

To Regenerate the Inferred Root

Template for Other Enteroviruses

Running the Workflow

Labeling Mutations of Interest

Visualizing Your Custom Nextclade Dataset

Author & Contact

Troubleshooting and Further Help

Task List

About

Uh oh!

Releases 1

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 84 Commits
dataset		dataset
inferred-root		inferred-root
ingest		ingest
resources		resources
scripts		scripts
testing		testing
.gitignore		.gitignore
README.md		README.md
Snakefile		Snakefile
nextclade_cli.md		nextclade_cli.md

enterovirus-phylo/nextclade_d68

Folders and files

Latest commit

History

Repository files navigation

Nextclade Workflow for Enterovirus D68

Citation

Quick Start

Folder Structure

Workflow Overview

Setup Instructions

1. Generate Reference Files

2. Configure pathogen.json

3. Prepare GenBank Reference

4. Update the Snakefile

Subprocesses

Ingest

Inferred Ancestral Root with Outgroup Rooting (Recommended)

Configuration

For Regular Dataset Builds

To Regenerate the Inferred Root

Template for Other Enteroviruses

Running the Workflow

Labeling Mutations of Interest

Visualizing Your Custom Nextclade Dataset

Author & Contact

Troubleshooting and Further Help

Task List

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Contributors 2

Uh oh!

Languages

2. Configure `pathogen.json`

4. Update the `Snakefile`