Add 16 NSPs

I looked into the feasibility of adding the 16 NSPs into the exported (Auspice) dataset. This'll need nextclade v3 since RdRp includes the slip site, so perhaps a time to make some bigger changes too. (We've decided not to modify the ORF1a ORF1b annotations; discussion [on slack](https://bedfordlab.slack.com/archives/CSKMU6YUC/p1692135824094549).)

* Nextclade does the translations, so we need to update the `genemap.gff` for Nextclade's 'sars-cov-2' dataset.
* Our ancestral reconstruction of the translations (`rule translate`) is what creates the annotations block in the JSON. This currently uses `defaults/reference_seq.gb` for the annotations, and nothing else uses this. 
  * We can shift the reconstruction to `augur ancestral`, and either keep the script to generate the JSON annotations, or (preferred) just keep a JSON representation of the annotations block in the repo and use this. (We'll want to have more than just the coordinates in the JSON - we'll want to add some extra display names / colours / descriptions; the latter being important to explain why we use ORF1a + ORF1b!)
  * This will allow us to remove this genbank file 


Other things noticed / improvements we could make:
* The `workflow-config-file.rst` has fallen out of date. This is seemingly inevitable with documentation, but this is a good chance to improve it.
* We don't use any nextclade datasets other than 'sars-cov-2'; I assumed we'd use the 'sars-cov-2-21L' dataset for our 21L builds, and we have config settings to allow this, but I don't think we do.
* `rule align` uses Nextalign, with a fasta + gff from the ncov repo. Why don't we replace the fasta+gff with the nextclade dataset we fetch later on in the process? 
  * My understanding of nextclade v3 is we'll replace nextalign with nextclade in this step anyways. 
* `rule build_mutation_summary` and `rule mutation_summary` seem unused. If these can be removed, we could then remove `defaults/reference.seq.fasta` (`alignment_reference`), `defaults/annotation.gff` (`annotation`). If the rules are still in use, we may want to use the nextclade dataset files anyway.
  * The 2nd rule here is the only place we use the translations from `rule align`, so we may be able to avoid translating every genome.




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add 16 NSPs #1081

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add 16 NSPs #1081

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions