Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Dengue virus All serotypes dataset #208

Merged
merged 8 commits into from
Jun 16, 2024

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Jun 7, 2024

Adding a serotype-level (all) dengue dataset, since it is functioning as expected.

Splitting this off of #203 since genotype-level datasets require further improvement to meet the desired standards.

https://clades.nextstrain.org/?dataset-server=gh:@add-serotype-level-dengue-dataset@&dataset-name=nextstrain/dengue/all

@j23414 j23414 temporarily deployed to refs/pull/208/merge June 7, 2024 06:18 — with GitHub Actions Inactive
@j23414 j23414 temporarily deployed to refs/pull/208/merge June 7, 2024 16:09 — with GitHub Actions Inactive
@j23414
Copy link
Contributor Author

j23414 commented Jun 7, 2024

Thanks @ivan-aksamentov ! Fixed the README to resolve

  • typo in name
  • match dataset name in README to pathogen.json
  • and plan to go forward with "Dengue virus All serotypes" ... "Dengue virus DENVx genotypes" <- I agree there should be consistency here : )

@j23414 j23414 changed the title Add serotype-level (all) dengue dataset Add Dengue virus All serotypes dataset Jun 7, 2024
@j23414 j23414 temporarily deployed to refs/pull/208/merge June 7, 2024 21:19 — with GitHub Actions Inactive
@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Jun 10, 2024

I noticed that in gene "E" at position 227 there is always (in example sequences at least) an amino acid mutation from * (stop codon) to either E, Q or S.

I am wondering why reference sequence has a stop codon in the middle of a gene (the gene is supposed to end here?), but none of the examples have it (so they keep going?). And then the genome annotation also disregards this fact and continues on to 495 amino acids total in the gene. Is this something that's expected? Perhaps specifics of this particular virus? (again, I have virtually no knowledge about it, I am sorry). Could there be an error in ref translation (i.e. incorrect frame - all genes seems to be annotated in frame 3 currently - if you switch to nuc sequence and look at the visualization of the annotation at the bottom of results table - it's completely flat) or something else?

@j23414
Copy link
Contributor Author

j23414 commented Jun 11, 2024

why reference sequence has a stop codon in the middle of a gene

The reference in this case is the "Reconstructed_root_sequence_of_https_nextstrain_org_dengue/all/genome". I wonder if the reconstructed root of a tree with DENV1-DENV4 viruses which has varying length E genes results in an average inferred root E gene with a stop codon in the middle. I'm not sure...how to fix this. Open to other interpretations, or discussions.

The align rule for creating a nextclade dataset is here, the ancestral root was created in commit nextstrain/dengue@82007bc, and gff inferred in commit nextstrain/dengue@9a3f203

@j23414
Copy link
Contributor Author

j23414 commented Jun 11, 2024

Just an update that my current hypothesis is that the average of DENV1-DENV4 nucleotide codons is creating a stop codon.

Screenshot 2024-06-11 at 11 31 00 AM

@huddlej
Copy link
Contributor

huddlej commented Jun 12, 2024

Good catch on the stop codon in the reference, @ivan-aksamentov! I wouldn't expect the reference here to necessarily be a valid coding sequence, since it is an ancestral sequence reconstruction across four highly divergent virus serotypes. For the purposes of this "all serotypes" dataset, the main goal is to coarsely classify the input sequences as DENV1, 2, 3, 4, or none of the above. This dataset seems to accomplish that general goal with maybe one slight issue.

When I ran the example dengue sequences with this dataset, I noticed that most of the DENV2 samples got flagged as "mediocre" QC because of one "unexpected frameshift" and one of the DENV2 samples got a "bad" QC because of two unexpected frameshifts. I would bet that this frameshift is related to the use of an ancestral sequence as the reference and does not reflect an actual issue with the sequences, especially since it occurs in all of the sequences for a specific serotype.

QC status of example dengue sequences

I know @rneher recommended enabling the frameshifts QC in the original PR with genotype-level classification, but for coarser serotype-level classification where the amino acid sequence is less important than the nucleotide sequence I wonder whether we could disable this frameshift QC? Another option would be to add these few frameshifts from the example data to a list of known frameshifts (NS5:632-640, NS5:633-638, NS4B:14-20, NS4B:17-23).

Fixing this QC reporting issue could prevent people from accidentally throwing out valid classification of an entire serotype, if they have a QC filter in their workflows that only accepts "good" sequences.

@huddlej
Copy link
Contributor

huddlej commented Jun 12, 2024

I just looked into this a bit more by aligning the sequences.fasta in this dataset to the reference.fasta with augur align and inspecting in AliView. The most common frameshift I mentioned above (NS5:632-640) appears in the screenshot below where there is a 5 nucleotide deletion in all of the DENV2 sequences relative to the reference and a 3 nucleotide deletion in the DENV3 sequences (coloring of triplets here is by amino acid codon).

image

When I aligned just the subset of DENV2 sequences, they don't have any frameshift at that position. There are some still some lower quality example sequences like OR389325_DENV2 (3168 Ns) and MZ312931_DENV2 (333 Ns), but it does seem like the frameshift you get from Nextclade alignment to this PR's dataset is a reference-specific issue.

@j23414 j23414 temporarily deployed to refs/pull/208/merge June 12, 2024 22:57 — with GitHub Actions Inactive
@j23414
Copy link
Contributor Author

j23414 commented Jun 13, 2024

Dropped the frameShift QC as suggested above, thanks all! I ran a quick check to make sure the dataset is classifying as expected:

Screenshot 2024-06-12 at 5 16 28 PM

I believe this is a reasonable dataset for the below:

main goal is to coarsely classify the input sequences as DENV1, 2, 3, 4,

And would provide the basis for future improvements if any. This is ready for review.

Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, @j23414!

@rneher
Copy link
Member

rneher commented Jun 13, 2024

The DENV-2 sequences have a compensated frameshift. The 5 base deletion is followed by a two base insertion. It could be that with an even stronger gap penalty, these gaps would disappear.

One could 'rescue' the frameshift QC by adding this as a known frameshift.

@j23414
Copy link
Contributor Author

j23414 commented Jun 14, 2024

Hello all, thanks for all the feedback and fixes! I may need more explicit guidance regarding this PR.

@rneher, could you please clarify if you are requesting the addition of known frameshifts somewhere in the pathogen.json file within this pull request, or if this is intended as potential future work to be addressed after the current pull request?

If the addition of known frameshifts is required, I have conducted a code search for "frameshift". Should I follow the same template as the one provided in the following link?

Alternatively, is there any documentation available that outlines the proper procedure for identifying and adding known frameshifts?

Guidance greatly appreciated on this matter to ensure I proceed correctly.

@ivan-aksamentov
Copy link
Member

ivan-aksamentov commented Jun 14, 2024

@j23414

Alternatively, is there any documentation available that outlines the proper procedure for identifying and adding known frameshifts?

Sorry, no clear docs, other than this small and outdated example, but, to clarify, you can add a frame shift to the list of ignored in qc.frameShifts.ignoredFrameShifts in this format:

{
"codonRange": {
"begin": 141,
"end": 143
},
"cdsName": "S"
},

and then it will be disregarded by the QC score. Note that the ranges here are 0-based half-open (right boundary is not included into the range), like in most programming languages, but for example in Web UI and TSV you will see 1-based closed ranges - the general convention in bioinformatics.

Not sure about identification though, and whether this is required or can be added later.

P.S. In this repo it is the most useful to search in the data/ directory where the sources are (the data_output/ is auto-generated), and SC2 datasets are one of the most fully-featured: repo:nextstrain/nextclade_data path:data/**/sars*/** frameshift

j23414 added a commit that referenced this pull request Jun 14, 2024
@j23414 j23414 deployed to refs/pull/208/merge June 14, 2024 17:56 — with GitHub Actions Active
@j23414
Copy link
Contributor Author

j23414 commented Jun 14, 2024

I added one frameshift to test the instructions (2e21708). Thanks @ivan-aksamentov!

But dropped it as I quickly realized that identifying known frameshifts would be a complex task... without documentation on the process.

@j23414
Copy link
Contributor Author

j23414 commented Jun 14, 2024

Is there anything blocking this from being merged?

Also, who has merge-power in this PR? Based on commit history, mostly people on the Nextclade side?

j23414 and others added 4 commits June 14, 2024 17:06
Fix typo in the title of the README file.
Match the dataset name to pathogen.json.
@j23414 j23414 force-pushed the add-serotype-level-dengue-dataset branch from 911a5cd to 714be98 Compare June 15, 2024 00:07
@ivan-aksamentov
Copy link
Member

@j23414

Is there anything blocking this from being merged?

I don't have any technical reasons to hold it. If scientists are happy, then we are ready to go.

Also, who has merge-power in this PR?

I think it is configured equally for core, so basically anyone. I answered on Slack with some more details about different branches and deploymens: https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1718427802913589?thread_ts=1716943598.732809&cid=C01LCTT7JNN

Based on commit history, mostly people on the Nextclade side?

That's because it's rarely that someone else contributes a dataset 😆

You are one of the early explorers of this process, so it might feel a little frustrating at times, sorry.

@rneher rneher merged commit cb3f8de into master Jun 16, 2024
@rneher rneher deleted the add-serotype-level-dengue-dataset branch June 16, 2024 16:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants