This repository houses all scripts, snakefiles, and configuration files for the Pekosz Lab nextstrain builds in JH-CEIRR.
-
Currently, builds are maintained for all 8 segments of circulating H1N1, H3N2, and B/Vic viruses detected through the Johns Hopkins Hospital (JHH) Network.
-
3 concatenated genome builds are also maintained for H1N1, H3N2, and B/Vic viruses.
-
As of 2024-11-26, all builds are constructed using a simplified snakemake pipeline.
Warning
For this tutorial, all scripts must be run from the nextstrain/ home directory.
Note
Dependencies for this build are maintained through conda. Download the latest version here. A brief introduction to conda and conda environments can be found here.
-
Clone:
git clone https://github.com/Pekosz-Lab/nextstrain.git -
Navigate to the head directory
cd nextstrain -
Build the environment and base dependencies
conda env create -f workflow/envs/environment.yml
Note
The included environment.yml attempts to install blastn and iqtree2 through conda-forge. If you encounter issues with these packages, please install them manually. blastn installation instructions can be found here and here via bioconda iqtree2 installation instructions can be found here
- Activate the environment
conda activate pekosz-nextstrain
2. Access genome and metadata from JHH and GISAID (vaccine strains) and place in a directory called source/
- Create the
source,dataandresultsdirectories within thenextstrain/directory.
mkdir source data results
- Populate the
source/folder with the following files:
- JHH_sequences.fasta
- JHH_metadata.tsv
- vaccines.fasta
The Contact Dr. Heba Mostafa and Dr. Andy Pekosz to access the source folder data. The vaccines.fasta file is manually downloaded and updated directly from GISAID.
Download the all data in the source/ folder or overwrite your source folder and moving it to the repo head directory nextstrain/. You nextstrain directory will now have the following additional folders:
nextstrain/
├── source/
├── GISAID_metadata.xls
├── GISAID_sequences.fasta
├── JHH_metadata.tsv
├── JHH_sequences.fasta
├── vaccines.fasta
└── vaccines.tsv
├── data/ # This will be empty
├── results/ # This will be empty
Warning
These builds are designed to ingest influenza genome data and metadata obtained from sources with regulated access. Access to all GISAID data requires individual user credentials, which cannot be shared publicly. Additionally, influenza genome data from the Johns Hopkins Hospital (JHH) network that were accessed prior to their release on GISAID are private and cannot be shared publicly through this repository.
As of 9478cb9, manual curation and organization of input sequence.fasta and metadata.tsv files is no longer necessary. An ingest snakemake rule has been constructed to automate the following tasks for all 24 segment builds for H1N1, H3N2, and B/Victoria:
- Segment typing and subtyping using flusort
- uploading all genomes and metadata to fludb, including vaccine strains
- generation of all build datasets in the
data/directory using download.py
From the nextstrain/ directory execute the following to initiate the construction of all builds.
snakemake --cores 8
auspice view --datasetDir auspice/h1n1
auspice view --datasetDir auspice/h3n2
auspice view --datasetDir auspice/vic
If all looks good, proceed to upload the builds to nextstrain.org/groups/PekoszLab
python scripts/nextstrain_upload_private_genomes.pynextstrain remote list nextstrain.org/groups/PekoszLabReplace ${YOUR_BUILD_NAME} with the file name of the build along with any additional sidecar files you desire to upload.
nextstrain remote upload \
nextstrain.org/groups/PekoszLab/${YOUR_BUILD_NAME} \
auspice/${YOUR_BUILD_NAME}.jsonNote
You can safely generate reports before running the snapshot_clean rule — the reports/ folder will be archived automatically during the snapshot process.
Following the execution of all builds, summary reports can be generated using the following commands:
python scripts/build-reports.py \
-i fludb.db \
-o reports/report.tsv \
-e reports/report.xlsx \
-h1 results/h1n1/ha/metadata.tsv \
-h3 results/h3n2/ha/metadata.tsv \
-b results/vic/ha/metadata.tsv
Once the summary data are generated, you can render the formatted HTML report using Quarto:
quarto render scripts/report-html-pdf.qmd --to html --output-dir ../reports/
The rendered report will be saved in the reports/ folder and can be viewed in any web browser.
Warning
Before starting a new build with updated data, you must run this rule.
It will archive your current results and clean the workspace to prepare for a fresh build.
When executed, the rule:
- Creates a timestamped backup folder inside
snapshots/(e.g.,snapshots/20251111T163000/). - Copies the following directories (if they exist):
auspice/,logs/,reports/, andsource/. - Compresses the snapshot into a
.tar.gzarchive. - Deletes temporary working folders (
data/,results/,logs/,reports/) and removes the database filefludb.db.
You have two ways to trigger the snapshot_clean rule:
Use this when you only want to clean and archive the current build:
snakemake snapshot_cleanUse this if you’ve added snapshot_clean to the main pipeline and want it to run automatically at the end:
snakemake --configfile config.yaml --config run_snapshot_clean=true
This approach is recommended if you want every completed build to automatically save a snapshot before cleanup.
📁 Example output structure After running the rule, you’ll find a compressed snapshot of your previous build in the snapshots/ folder:
snapshots/
├── 20251111T163000.tar.gz
- Automated report generation for all builds.
- Add t-SNE implementation for all builds using pathogen-embed.
- Manuscript: https://bedford.io/papers/nanduri-cartography/
- Automated concatenated genome builds for h1n1 and h3n2 - Implemented in 9478cb9.
snakemake --rulegraph | dot -Tpng > rulegraph.png
Below is a simplified representation of the rules implemented for each build. Because wildcard constraints have been defined by subtype and segment, this general rulegraph is executed for h1n1, h3n2, and vic for all 8 segments along with the 3 concatenated genome builds.
The pipeline automatically downloads the latest Nextclade dataset with each build, ensuring clade and subclade assignments stay current.
Why use Nextclade instead of augur clades?
This approach differs from the standard Nextstrain workflow and may be controversial. However, Nextclade offers several advantages:
-
Comprehensive QC metrics: Provides an appendable table with multiple quality control metrics, including:
- Missing data
- Problematic sites
- Private mutations
- Mutation clusters
- Frameshifts
- Stop codons
-
Metadata-centric workflow: Stores clade information directly in metadata, eliminating the need for a separate
augur cladesstep -
Built-in glycosylation prediction: Includes glycosylation site prediction functionality
-
Flexible filtering: QC metrics can be filtered downstream using
augur filterand inform qualitative decisions during analysis
This approach is open for discussion. See augur clades documentation.
HA clade assignments are appended to metadata for all segments.
Quality control metrics are added to each segment's metadata.
Sequences are filtered by coverage, QC status, and length using augur filter.
"--query", "(coverage >= 0.9) & (`qc.overallStatus` == 'good')", # Add qc_overallStatus == 'mediocre' if needed
"--min-length", str(min_length),- Align
- Build Raw Tree
- Refine branches
- Annotate
- Infer ancestral sequences
- Translate sequences
- Export (auspice V2)
- Upload and deploy the builds to Nextstrain
Automation by snakemake has increased our efficiency of these builds. The structure, scripts, and configuration filese herin are inspired tremendously by the seasonal-flu build maintained by the nextstrain team.
