Add Dengue virus All serotypes dataset #208

j23414 · 2024-06-07T06:18:52Z

Adding a serotype-level (all) dengue dataset, since it is functioning as expected.

Splitting this off of #203 since genotype-level datasets require further improvement to meet the desired standards.

https://clades.nextstrain.org/?dataset-server=gh:@add-serotype-level-dengue-dataset@&dataset-name=nextstrain/dengue/all

data/nextstrain/dengue/all/README.md

data/nextstrain/dengue/all/pathogen.json

j23414 · 2024-06-07T16:11:42Z

Thanks @ivan-aksamentov ! Fixed the README to resolve

typo in name
match dataset name in README to pathogen.json
and plan to go forward with "Dengue virus All serotypes" ... "Dengue virus DENVx genotypes" <- I agree there should be consistency here : )

ivan-aksamentov · 2024-06-10T23:51:49Z

I noticed that in gene "E" at position 227 there is always (in example sequences at least) an amino acid mutation from * (stop codon) to either E, Q or S.

I am wondering why reference sequence has a stop codon in the middle of a gene (the gene is supposed to end here?), but none of the examples have it (so they keep going?). And then the genome annotation also disregards this fact and continues on to 495 amino acids total in the gene. Is this something that's expected? Perhaps specifics of this particular virus? (again, I have virtually no knowledge about it, I am sorry). Could there be an error in ref translation (i.e. incorrect frame - all genes seems to be annotated in frame 3 currently - if you switch to nuc sequence and look at the visualization of the annotation at the bottom of results table - it's completely flat) or something else?

j23414 · 2024-06-11T17:01:10Z

why reference sequence has a stop codon in the middle of a gene

The reference in this case is the "Reconstructed_root_sequence_of_https_nextstrain_org_dengue/all/genome". I wonder if the reconstructed root of a tree with DENV1-DENV4 viruses which has varying length E genes results in an average inferred root E gene with a stop codon in the middle. I'm not sure...how to fix this. Open to other interpretations, or discussions.

The align rule for creating a nextclade dataset is here, the ancestral root was created in commit nextstrain/dengue@82007bc, and gff inferred in commit nextstrain/dengue@9a3f203

j23414 · 2024-06-11T18:53:55Z

Just an update that my current hypothesis is that the average of DENV1-DENV4 nucleotide codons is creating a stop codon.

huddlej · 2024-06-12T17:52:09Z

Good catch on the stop codon in the reference, @ivan-aksamentov! I wouldn't expect the reference here to necessarily be a valid coding sequence, since it is an ancestral sequence reconstruction across four highly divergent virus serotypes. For the purposes of this "all serotypes" dataset, the main goal is to coarsely classify the input sequences as DENV1, 2, 3, 4, or none of the above. This dataset seems to accomplish that general goal with maybe one slight issue.

When I ran the example dengue sequences with this dataset, I noticed that most of the DENV2 samples got flagged as "mediocre" QC because of one "unexpected frameshift" and one of the DENV2 samples got a "bad" QC because of two unexpected frameshifts. I would bet that this frameshift is related to the use of an ancestral sequence as the reference and does not reflect an actual issue with the sequences, especially since it occurs in all of the sequences for a specific serotype.

I know @rneher recommended enabling the frameshifts QC in the original PR with genotype-level classification, but for coarser serotype-level classification where the amino acid sequence is less important than the nucleotide sequence I wonder whether we could disable this frameshift QC? Another option would be to add these few frameshifts from the example data to a list of known frameshifts (NS5:632-640, NS5:633-638, NS4B:14-20, NS4B:17-23).

Fixing this QC reporting issue could prevent people from accidentally throwing out valid classification of an entire serotype, if they have a QC filter in their workflows that only accepts "good" sequences.

huddlej · 2024-06-12T18:24:20Z

I just looked into this a bit more by aligning the sequences.fasta in this dataset to the reference.fasta with augur align and inspecting in AliView. The most common frameshift I mentioned above (NS5:632-640) appears in the screenshot below where there is a 5 nucleotide deletion in all of the DENV2 sequences relative to the reference and a 3 nucleotide deletion in the DENV3 sequences (coloring of triplets here is by amino acid codon).

When I aligned just the subset of DENV2 sequences, they don't have any frameshift at that position. There are some still some lower quality example sequences like OR389325_DENV2 (3168 Ns) and MZ312931_DENV2 (333 Ns), but it does seem like the frameshift you get from Nextclade alignment to this PR's dataset is a reference-specific issue.

j23414 · 2024-06-13T00:20:44Z

Dropped the frameShift QC as suggested above, thanks all! I ran a quick check to make sure the dataset is classifying as expected:

I believe this is a reasonable dataset for the below:

main goal is to coarsely classify the input sequences as DENV1, 2, 3, 4,

And would provide the basis for future improvements if any. This is ready for review.

huddlej

This looks great, @j23414!

rneher · 2024-06-13T17:49:08Z

The DENV-2 sequences have a compensated frameshift. The 5 base deletion is followed by a two base insertion. It could be that with an even stronger gap penalty, these gaps would disappear.

One could 'rescue' the frameshift QC by adding this as a known frameshift.

j23414 · 2024-06-14T16:41:52Z

Hello all, thanks for all the feedback and fixes! I may need more explicit guidance regarding this PR.

@rneher, could you please clarify if you are requesting the addition of known frameshifts somewhere in the pathogen.json file within this pull request, or if this is intended as potential future work to be addressed after the current pull request?

If the addition of known frameshifts is required, I have conducted a code search for "frameshift". Should I follow the same template as the one provided in the following link?

https://github.com/search?q=repo%3Anextstrain%2Fnextclade_data%20frameshift&type=code

Alternatively, is there any documentation available that outlines the proper procedure for identifying and adding known frameshifts?

Guidance greatly appreciated on this matter to ensure I proceed correctly.

ivan-aksamentov · 2024-06-14T17:14:19Z

@j23414

Alternatively, is there any documentation available that outlines the proper procedure for identifying and adding known frameshifts?

Sorry, no clear docs, other than this small and outdated example, but, to clarify, you can add a frame shift to the list of ignored in qc.frameShifts.ignoredFrameShifts in this format:

nextclade_data/data/nextstrain/sars-cov-2/wuhan-hu-1/orfs/pathogen.json

Lines 57 to 63 in 03703e0

    
           { 
        
             "codonRange": { 
        
               "begin": 141, 
        
               "end": 143 
        
             }, 
        
             "cdsName": "S" 
        
           },

and then it will be disregarded by the QC score. Note that the ranges here are 0-based half-open (right boundary is not included into the range), like in most programming languages, but for example in Web UI and TSV you will see 1-based closed ranges - the general convention in bioinformatics.

Not sure about identification though, and whether this is required or can be added later.

P.S. In this repo it is the most useful to search in the data/ directory where the sources are (the data_output/ is auto-generated), and SC2 datasets are one of the most fully-featured: repo:nextstrain/nextclade_data path:data/**/sars*/** frameshift

Flagged by #208 (comment) With help from #208 (comment)

j23414 · 2024-06-14T23:02:15Z

I added one frameshift to test the instructions (2e21708). Thanks @ivan-aksamentov!

But dropped it as I quickly realized that identifying known frameshifts would be a complex task... without documentation on the process.

j23414 · 2024-06-14T23:13:21Z

Is there anything blocking this from being merged?

Also, who has merge-power in this PR? Based on commit history, mostly people on the Nextclade side?

Fix typo in the title of the README file. Match the dataset name to pathogen.json.

smof grep -v "?" sequences.fasta > temp.fasta mv temp.fasta sequences.fasta

ivan-aksamentov · 2024-06-15T05:25:10Z

@j23414

Is there anything blocking this from being merged?

I don't have any technical reasons to hold it. If scientists are happy, then we are ready to go.

Also, who has merge-power in this PR?

I think it is configured equally for core, so basically anyone. I answered on Slack with some more details about different branches and deploymens: https://bedfordlab.slack.com/archives/C01LCTT7JNN/p1718427802913589?thread_ts=1716943598.732809&cid=C01LCTT7JNN

Based on commit history, mostly people on the Nextclade side?

That's because it's rarely that someone else contributes a dataset 😆

You are one of the early explorers of this process, so it might feel a little frustrating at times, sorry.

j23414 temporarily deployed to refs/pull/208/merge June 7, 2024 06:18 — with GitHub Actions Inactive

ivan-aksamentov reviewed Jun 7, 2024

View reviewed changes

data/nextstrain/dengue/all/README.md Outdated Show resolved Hide resolved

ivan-aksamentov reviewed Jun 7, 2024

View reviewed changes

data/nextstrain/dengue/all/README.md Outdated Show resolved Hide resolved

ivan-aksamentov reviewed Jun 7, 2024

View reviewed changes

data/nextstrain/dengue/all/pathogen.json Show resolved Hide resolved

j23414 temporarily deployed to refs/pull/208/merge June 7, 2024 16:09 — with GitHub Actions Inactive

j23414 changed the title ~~Add serotype-level (all) dengue dataset~~ Add Dengue virus All serotypes dataset Jun 7, 2024

This was referenced Jun 7, 2024

hot fix: fill in READMEs for the Nextclade dataset nextstrain/dengue#64

Merged

fixup: Standardize nextclade dataset names nextstrain/dengue#65

Merged

j23414 temporarily deployed to refs/pull/208/merge June 7, 2024 21:19 — with GitHub Actions Inactive

j23414 mentioned this pull request Jun 7, 2024

Add Dengue virus DENVx genotypes dataset #203

Open

j23414 temporarily deployed to refs/pull/208/merge June 12, 2024 22:57 — with GitHub Actions Inactive

huddlej approved these changes Jun 13, 2024

View reviewed changes

j23414 added a commit that referenced this pull request Jun 14, 2024

Add known frameshift NS5:632-640

2e21708

Flagged by #208 (comment) With help from #208 (comment)

j23414 deployed to refs/pull/208/merge June 14, 2024 17:56 — with GitHub Actions Active

j23414 and others added 4 commits June 14, 2024 17:06

Add serotype level dengue dataset

56db033

chore: rebuild [skip ci]

8014ace

fixup: README

eee0c29

Fix typo in the title of the README file. Match the dataset name to pathogen.json.

chore: rebuild [skip ci]

ff12f9d

j23414 and others added 4 commits June 14, 2024 17:06

Drop unalignable example sequences

952da70

smof grep -v "?" sequences.fasta > temp.fasta mv temp.fasta sequences.fasta

chore: rebuild [skip ci]

7ebce4d

fixup: turn off frameShift QC

51c44f9

chore: rebuild [skip ci]

714be98

j23414 force-pushed the add-serotype-level-dengue-dataset branch from 911a5cd to 714be98 Compare June 15, 2024 00:07

ivan-aksamentov approved these changes Jun 15, 2024

View reviewed changes

rneher merged commit cb3f8de into master Jun 16, 2024

rneher deleted the add-serotype-level-dengue-dataset branch June 16, 2024 16:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Dengue virus All serotypes dataset #208

Add Dengue virus All serotypes dataset #208

j23414 commented Jun 7, 2024 •

edited by ivan-aksamentov

Loading

j23414 commented Jun 7, 2024 •

edited

Loading

ivan-aksamentov commented Jun 10, 2024 •

edited

Loading

j23414 commented Jun 11, 2024 •

edited

Loading

j23414 commented Jun 11, 2024

huddlej commented Jun 12, 2024

huddlej commented Jun 12, 2024

j23414 commented Jun 13, 2024

huddlej left a comment

rneher commented Jun 13, 2024

j23414 commented Jun 14, 2024

ivan-aksamentov commented Jun 14, 2024 •

edited

Loading

j23414 commented Jun 14, 2024 •

edited

Loading

j23414 commented Jun 14, 2024

ivan-aksamentov commented Jun 15, 2024

Add Dengue virus All serotypes dataset #208

Add Dengue virus All serotypes dataset #208

Conversation

j23414 commented Jun 7, 2024 • edited by ivan-aksamentov Loading

j23414 commented Jun 7, 2024 • edited Loading

ivan-aksamentov commented Jun 10, 2024 • edited Loading

j23414 commented Jun 11, 2024 • edited Loading

j23414 commented Jun 11, 2024

huddlej commented Jun 12, 2024

huddlej commented Jun 12, 2024

j23414 commented Jun 13, 2024

huddlej left a comment

Choose a reason for hiding this comment

rneher commented Jun 13, 2024

j23414 commented Jun 14, 2024

ivan-aksamentov commented Jun 14, 2024 • edited Loading

j23414 commented Jun 14, 2024 • edited Loading

j23414 commented Jun 14, 2024

ivan-aksamentov commented Jun 15, 2024

j23414 commented Jun 7, 2024 •

edited by ivan-aksamentov

Loading

j23414 commented Jun 7, 2024 •

edited

Loading

ivan-aksamentov commented Jun 10, 2024 •

edited

Loading

j23414 commented Jun 11, 2024 •

edited

Loading

ivan-aksamentov commented Jun 14, 2024 •

edited

Loading

j23414 commented Jun 14, 2024 •

edited

Loading