-
Notifications
You must be signed in to change notification settings - Fork 407
Open
Description
I looked into the feasibility of adding the 16 NSPs into the exported (Auspice) dataset. This'll need nextclade v3 since RdRp includes the slip site, so perhaps a time to make some bigger changes too. (We've decided not to modify the ORF1a ORF1b annotations; discussion on slack.)
- Nextclade does the translations, so we need to update the
genemap.gfffor Nextclade's 'sars-cov-2' dataset. - Our ancestral reconstruction of the translations (
rule translate) is what creates the annotations block in the JSON. This currently usesdefaults/reference_seq.gbfor the annotations, and nothing else uses this.- We can shift the reconstruction to
augur ancestral, and either keep the script to generate the JSON annotations, or (preferred) just keep a JSON representation of the annotations block in the repo and use this. (We'll want to have more than just the coordinates in the JSON - we'll want to add some extra display names / colours / descriptions; the latter being important to explain why we use ORF1a + ORF1b!) - This will allow us to remove this genbank file
- We can shift the reconstruction to
Other things noticed / improvements we could make:
- The
workflow-config-file.rsthas fallen out of date. This is seemingly inevitable with documentation, but this is a good chance to improve it. - We don't use any nextclade datasets other than 'sars-cov-2'; I assumed we'd use the 'sars-cov-2-21L' dataset for our 21L builds, and we have config settings to allow this, but I don't think we do.
rule alignuses Nextalign, with a fasta + gff from the ncov repo. Why don't we replace the fasta+gff with the nextclade dataset we fetch later on in the process?- My understanding of nextclade v3 is we'll replace nextalign with nextclade in this step anyways.
rule build_mutation_summaryandrule mutation_summaryseem unused. If these can be removed, we could then removedefaults/reference.seq.fasta(alignment_reference),defaults/annotation.gff(annotation). If the rules are still in use, we may want to use the nextclade dataset files anyway.- The 2nd rule here is the only place we use the translations from
rule align, so we may be able to avoid translating every genome.
- The 2nd rule here is the only place we use the translations from
Metadata
Metadata
Assignees
Labels
No labels