-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Dengue virus All serotypes dataset #208
Conversation
Thanks @ivan-aksamentov ! Fixed the README to resolve
|
all
) dengue dataset
I noticed that in gene "E" at position I am wondering why reference sequence has a stop codon in the middle of a gene (the gene is supposed to end here?), but none of the examples have it (so they keep going?). And then the genome annotation also disregards this fact and continues on to 495 amino acids total in the gene. Is this something that's expected? Perhaps specifics of this particular virus? (again, I have virtually no knowledge about it, I am sorry). Could there be an error in ref translation (i.e. incorrect frame - all genes seems to be annotated in frame 3 currently - if you switch to nuc sequence and look at the visualization of the annotation at the bottom of results table - it's completely flat) or something else? |
The reference in this case is the "Reconstructed_root_sequence_of_https_nextstrain_org_dengue/all/genome". I wonder if the reconstructed root of a tree with DENV1-DENV4 viruses which has varying length E genes results in an average inferred root E gene with a stop codon in the middle. I'm not sure...how to fix this. Open to other interpretations, or discussions. The align rule for creating a nextclade dataset is here, the ancestral root was created in commit nextstrain/dengue@82007bc, and gff inferred in commit nextstrain/dengue@9a3f203 |
Good catch on the stop codon in the reference, @ivan-aksamentov! I wouldn't expect the reference here to necessarily be a valid coding sequence, since it is an ancestral sequence reconstruction across four highly divergent virus serotypes. For the purposes of this "all serotypes" dataset, the main goal is to coarsely classify the input sequences as DENV1, 2, 3, 4, or none of the above. This dataset seems to accomplish that general goal with maybe one slight issue. When I ran the example dengue sequences with this dataset, I noticed that most of the DENV2 samples got flagged as "mediocre" QC because of one "unexpected frameshift" and one of the DENV2 samples got a "bad" QC because of two unexpected frameshifts. I would bet that this frameshift is related to the use of an ancestral sequence as the reference and does not reflect an actual issue with the sequences, especially since it occurs in all of the sequences for a specific serotype. I know @rneher recommended enabling the frameshifts QC in the original PR with genotype-level classification, but for coarser serotype-level classification where the amino acid sequence is less important than the nucleotide sequence I wonder whether we could disable this frameshift QC? Another option would be to add these few frameshifts from the example data to a list of known frameshifts (NS5:632-640, NS5:633-638, NS4B:14-20, NS4B:17-23). Fixing this QC reporting issue could prevent people from accidentally throwing out valid classification of an entire serotype, if they have a QC filter in their workflows that only accepts "good" sequences. |
Dropped the frameShift QC as suggested above, thanks all! I ran a quick check to make sure the dataset is classifying as expected: I believe this is a reasonable dataset for the below:
And would provide the basis for future improvements if any. This is ready for review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great, @j23414!
The DENV-2 sequences have a compensated frameshift. The 5 base deletion is followed by a two base insertion. It could be that with an even stronger gap penalty, these gaps would disappear. One could 'rescue' the frameshift QC by adding this as a known frameshift. |
Hello all, thanks for all the feedback and fixes! I may need more explicit guidance regarding this PR. @rneher, could you please clarify if you are requesting the addition of known frameshifts somewhere in the pathogen.json file within this pull request, or if this is intended as potential future work to be addressed after the current pull request? If the addition of known frameshifts is required, I have conducted a code search for "frameshift". Should I follow the same template as the one provided in the following link? Alternatively, is there any documentation available that outlines the proper procedure for identifying and adding known frameshifts? Guidance greatly appreciated on this matter to ensure I proceed correctly. |
Sorry, no clear docs, other than this small and outdated example, but, to clarify, you can add a frame shift to the list of ignored in
and then it will be disregarded by the QC score. Note that the ranges here are 0-based half-open (right boundary is not included into the range), like in most programming languages, but for example in Web UI and TSV you will see 1-based closed ranges - the general convention in bioinformatics. Not sure about identification though, and whether this is required or can be added later. P.S. In this repo it is the most useful to search in the |
Flagged by #208 (comment) With help from #208 (comment)
I added one frameshift to test the instructions (2e21708). Thanks @ivan-aksamentov! But dropped it as I quickly realized that identifying known frameshifts would be a complex task... without documentation on the process. |
Is there anything blocking this from being merged? Also, who has merge-power in this PR? Based on commit history, mostly people on the Nextclade side? |
Fix typo in the title of the README file. Match the dataset name to pathogen.json.
smof grep -v "?" sequences.fasta > temp.fasta mv temp.fasta sequences.fasta
911a5cd
to
714be98
Compare
I don't have any technical reasons to hold it. If scientists are happy, then we are ready to go.
I think it is configured equally for core, so basically anyone. I answered on Slack with some more details about different branches and deploymens: https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1718427802913589?thread_ts=1716943598.732809&cid=C01LCTT7JNN
That's because it's rarely that someone else contributes a dataset 😆 You are one of the early explorers of this process, so it might feel a little frustrating at times, sorry. |
Adding a serotype-level (
all
) dengue dataset, since it is functioning as expected.Splitting this off of #203 since genotype-level datasets require further improvement to meet the desired standards.
https://clades.nextstrain.org/?dataset-server=gh:@add-serotype-level-dengue-dataset@&dataset-name=nextstrain/dengue/all