This repository provides a comprehensive Nextstrain analysis of "your virus". You can choose to perform either a shorter run with specific proteins or a full genome run.
For those unfamiliar with Nextstrain or needing installation guidance, please refer to the Nextstrain documentation.
- Prerequisites
- Nextstrain Environment
- Repository Organization
- Usage Examples
- Ingest
- Acknowledgments
- Contact
Ensure you have the following installed:
- Python=3.8 or higher
- Micromamba or Conda
- Snakemake=7
- Nextstrain CLI
Install the Nextstrain environment by following these instructions.
-
Clone the repository:
git clone [email protected]:hodcroftlab/template_nextstrain.git cd template_nextstrain
-
Install the Nextstrain environment:
micromamba create -n nextstrain \ --override-channels --strict-channel-priority \ -c conda-forge -c bioconda --yes \ augur auspice nextclade \ snakemake=7 git ncbi-datasets-cli micromamba activate nextstrain
-
Update/install additional dependencies:
sudo apt-get update sudo apt-get install -y unzip micromamba install -c conda-forge -c bioconda csvtk seqkit tsv-utils ipdb entrez-direct micromamba install -c conda-forge fuzzywuzzy python-dotenv ipykernel
The data for this analysis is available from NCBI Virus. Instructions for downloading sequences are provided under Sequences.
This repository includes the following directories and files:
scripts: Custom Python scripts called by thesnakefile.snakefile: The entire computational pipeline, managed using Snakemake. Snakemake documentation can be found here.ingest: Contains Python scripts and thesnakefilefor automatic downloading of <your_virus> sequences and metadata.- <
protein_xy>: Sequences and configuration files for the specific protein_xy run. whole_genome: Sequences and configuration files for the whole genome run.
The config, protein_xy/config, and whole_genome/config directories contain necessary configuration files:
config.yaml: Configuration file for setting parameters and options for the analysiscolors.tsv: Color schemegeo_regions.tsv: Geographical locationslat_longs.tsv: Latitude datadropped_strains.txt: It will exclude these accessions duringaugur filterclades_genome.tsv: Manually Labeling Clades on a Nextstrain Tree (see documentation here)reference_sequence.gb: Reference sequence (add manually)auspice_config.json: Auspice configuration file - has to be in all data folders!
The reference sequence used is XYZ, accession number, sampled in 19XX.
Activate the Nextstrain environment:
micromamba activate nextstrainTo perform a build, run:
snakemake --cores 9 allFor specific builds:
- protein_xy build:
snakemake auspice/<your_virus>_protein_xy.json --cores 9- Whole genome build:
snakemake auspice/<your_virus>_whole-genome.json --cores 9To visualize the build, use Auspice:
auspice view --datasetDir auspiceTo run two visualizations simultaneously, you may need to set the port:
export PORT=4001For more information on how to run the ingest, please refer to the README in the ingest folder.
Sequences can be downloaded manually or automatically.
- Manual Download: Visit NCBI Virus, search for
<your_virus>or TaxidXXXXXX, and download the sequences. - Automated Download: The
ingestfunctionality, included in the mainsnakefile, handles automatic downloading.
The ingest pipeline is based on the Nextstrain RSV ingest workflow. Running the ingest pipeline produces data/metadata.tsv and data/sequences.fasta.
For questions or support, please contact [[email protected]].
Feel free to adjust the content according to your project's specifics.