-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Dengue virus DENVx genotypes dataset #203
base: master
Are you sure you want to change the base?
Conversation
Cool! I'll just drop short links for testing: Sadly I don't have any example sequences to run :( Are there any sequences with permissive licenses available to add them as example sequences into datasets? Would be nice to fill-in some info to the readme if you have a second. Readme is an optional file though. |
I can add some example sequences, it may take me a moment (aka. not in the next hour). |
@j23414 No worries at all. I will not be able to asses the coolness of it anyways, because I lack required science knowledge. But I'll happily test how it runs and whether any bugs manifest themselves sometimes :) |
FWIW here are some arbitrarily chosen dengue sequences that I use as examples on dev.usher.bio: https://www.ncbi.nlm.nih.gov/nuccore/OQ605998.1 And NCBI Virus can provide a bunch: https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/virus?SeqType_s=Nucleotide&VirusLineage_ss=Dengue%20virus,%20taxid:12637 |
Haha, I also lack the required science knowledge. I'm mostly wandering in the dark. |
We need that! @jamessiqueirap and I proposed a lineage system for dengue. Our work utilizes the genotype mutations table from Nextstrain Dengue. |
A small technical suggestion. These sequences seem to contain many mutations - too many for browser SVG engine to render efficiently in Nextclade's sequence views. If there's a clear "main" gene/CDS of interest for this virus, then one workaround would be to set the default CDS in pathogen.json, so that sequence view automatically switches to it when first rendering: "defaultCds": "S", Nuc sequence will of course still be available in the dropdown. But users will pay the associated performance price only if they switch to it. And, on related note, if you need to customize the order of genes in the dropdown, then you could also add "cdsOrderPreference": [
"S",
"N",
"M",
"E"
], Both are just eye-candy features, so no rush. The ultimate solution will be to implement a more performant sequence viewer in Nextclade. But this is quite far away. |
Thanks @ivan-aksamentov! I think the "main" gene/cds-of-interest for dengue is the E gene, and sometimes it's the only portion of the genome sequenced based on this user comment. I can add to the pathogen.json files following this pattern. "defaultCds": "E", Good to know about customizing the order of genes in drop down! Right now, the dropdown menu matches the gene/cds order in the genome, which feels logical and straightforward to me. However, I welcome differing perspectives on this matter from others in the field. Open to alternatives or potential improvements. |
Since dengue sequences seem to contain many mutations - too many for the browser SVG engine to render efficiently in Nextclade's sequence views - we will set the default CDS to display to the E gene as the "main" gene of interest. Viewing the full genome and other gene/CDS regions can still be displayed by selection from the dropdown menu at the top. Flagged by the following comment: nextstrain/nextclade_data#203 (comment)
Since dengue sequences seem to contain many mutations - too many for the browser SVG engine to render efficiently in Nextclade's sequence views - we will set the default CDS to display to the E gene as the "main" gene of interest. Viewing the full genome and other gene/CDS regions can still be displayed by selection from the dropdown menu at the top. Flagged by the following comment: nextstrain/nextclade_data#203 (comment)
A few additional remarks:
|
Incorporated some changes suggested by comment: #203 (comment) * Fixed pathogen.json for genotype-level dataset to include the example sequences fasta nextstrain/dengue@c029f1d * Enabled stop and frameshift QC nextstrain/dengue@90523a7 * Include reconstructed ancestor for the genotype-level dataset nextstrain/dengue@616979c
Thanks @rneher! I tried to incorporate your suggested changes in 610e3f5
An oversight on my part, fixed.
I had turned off several QC during development since dengue sequences seemed very divergent. I agree with adding stop and frameshift QC back in, done.
For genotype-level datasets (denv1-4), I swapped in the inferred ancestral root in for the reference and root of the tree. Done, although I could use help in evaluating the genotype-level datasets or any suggested next steps. I thought about blasting a serotype's sequences against the other 3 serotypes to find the nearest cross-serotype outgroup, but wasn't sure if that would be more or less effective then the inferred ancestral root. Or using the other 3 serotype's inferred-ancestral roots as outgroups. I wasn't sure, but suggestions welcome. |
I think the root should be given a clade |
Gah, I must have copied in the wrong dataset files. I was experimenting with using the "dengue/all" reconstructed root for all 4 serotypes. However, as you observed, it was giving me weird genotype calls (e.g. DENV2 genotypes in the DENV1 tree). I'll copy the correct ones (and double check this time) in a moment. |
I allowed myself to resolve merge conflict which appeared after merging measles #202 |
thanks, Jennifer. The dataset also contains the genotype annotations. If these are good, you could enable them by adding to them to the Also, the example data contain two sequences that don't align. That is not a problem per se if these sequences are very weird (and having examples of bad sequences is fine), but if this is unexpected that one could maybe tune parameters. |
Thanks for the question @rneher! Some clarification that the genotype annotations (named The more concerning problem occurs when we zoom into individual serotype trees (e.g. DENV2) where the
I believe @trvrb was going to explore modifying aa-mut defining mutations in clades_genotypes.tsv to apply to the "all" tree. Currently the aa-coords are by serotype reference (e.g. against the DENV1 reference, against the DENV3 reference which has a two amino acid deletion in E gene, etc). |
Thanks for flagging! I assume it's an example sequence with |
@j23414 I think why Richard is asking about the unalignable or otherwise "broken" (from the point of view of Nextclade results) example sequences is that we had a situation with SC2 dataset, when users come confused after trying Nextclade with example sequences and receiving error or warning messages. They thought they did something wrong or that there is a bug. So we try to keep examples nice and high quality since then. From one side, "broken" sequence might tell a story about some interesting science fact or just showcase how Nextclade software handles that particular situation technically - which is interesting. On the other hand, without context it might be unclear for the target audience. If you plan on keeping these samples, then perhaps you could explain the details in the readme. Alternatively, there might be a sciency solution, as you mentioned, to make them "good". Otherwise you could just delete the bad examples to avoid the troubles. |
Incorporated some changes suggested by comment: #203 (comment) * Fixed pathogen.json for genotype-level dataset to include the example sequences fasta nextstrain/dengue@c029f1d * Enabled stop and frameshift QC nextstrain/dengue@90523a7 * Include reconstructed ancestor for the genotype-level dataset nextstrain/dengue@616979c
d9db8ce
to
b3cc967
Compare
Add a dengue dataset to Nextclade.