Augur is Python package to track (and eventually forecast) flu evolution. It currently
- imports public sequence data
- subsamples, cleans and aligns sequences
- builds a phylogenetic tree from this data
The program is live on Amazon EC2 with results pushed to Amazon S3. The latest JSON-formatted flu tree is available as tree_streamline.json
. This tree is visualized at blab.github.io/auspice/.
You can run across platforms using Docker. An image is up on the Docker hub repository as trvrb/augur. With this public image, you can immediately run augur with
docker pull trvrb/augur
docker run -ti -e "GISAID_USER=$GISAID_USER" -e "GISAID_PASS=$GISAID_PASS" -e "S3_KEY=$S3_KEY" -e "S3_SECRET=$S3_SECRET" -e "S3_BUCKET=$S3_BUCKET" --privileged trvrb/augur
This starts up Supervisor to keep augur and helper programs running. This uses supervisord.conf
as a control file.
To run augur, you will need a GISAID account (to pull sequences) and an Amazon S3 account (to push results). Account information is stored in environment variables:
GISAID_USER
: GISAID user nameGISAID_PASS
: GISAID passwordS3_KEY
: Amazon S3 keyS3_SECRET
: Amazon S3 secretS3_BUCKET
: Amazon S3 bucket
Full dependency information can be seen in the Dockerfile
. To run locally, pull the docker image with
docker pull trvrb/augur
And start up a bash session with
docker run -ti -e "GISAID_USER=$GISAID_USER" -e "GISAID_PASS=$GISAID_PASS" trvrb/augur /bin/bash
From here, the build pipeline can be run with
python augur/run.py
Using Selenium to automate downloads from GISAID. GISAID requires login access. User credentials are stored in the ENV as GISAID_USER
and GISAID_PASS
.
Keeps viruses with full HA1 sequences, fully specified dates, cell passage and only one sequence per strain name. Subsamples to 100 sequences per month for the last 3 years before present.
Align sequences with mafft. Testing showed a much lower memory footprint than muscle.
Keep only sequences that have the full 1701 bases of HA in the alignment.
Using FastTree to get a starting tree. FastTree will build a tree for ~5000 sequences in a few minutes. Then using RAxML to refine this initial tree. A full RAxML run on a tree with ~5000 sequences could take days or weeks, so instead RAxML is run for a fixed 1 hour and the best tree found during this search is kept. This will always improve on FastTree.
Reroot the tree based on outgroup strain, collapse nodes with zero-length branches and ladderize the tree.