-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add simple influenza A H5N1 dataset with all segments #217
base: master
Are you sure you want to change the base?
Conversation
Hi! Oh cool, new datasets! :) Sorry I am not in the loop of these new developments. I'll let Richard and Cornelius to review. But just want to bring up a few technical/bureaucratic issues: Is it different from Is anyone from Nextstrain involved in development to place it into the "nextstrain" collection and not into "community"? |
Hi Ivan! The main difference is this contains references for all 8 segments. About people from nextstrain being involved... I guess not really - should I move this into community under genspectrum? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could potentially add example sequences to make it possible to test run the datasets, just a handful, like 5 genomes, are enough.
The segments are usually called: PB2, PB1, PA, NP, HA, NA, M, and NS not seg1
, seg2
Possibly better to use the common names.
Thanks for the quick review - I will update with the requested changes tomorrow! |
As far as I understood, PB2, PB1, etc. are the names of genes but not really the names of the segments, and some segments have multiple genes. Also, I chose them because, for GenSpectrum/LAPIS, it is better to avoid using the same names for nucleotide and amino acid sequences. Having the same names would make filtering for mutations more difficult because it would be unclear whether HA:123G refers to a nucleotide or amino acid mutation. This being said, this is independent of the Nextclade datasets, so we can, of course, rename them here (and just name them differently when importing into LAPIS) |
@corneliusroemer would it be ok to keep the segments as |
Community still shows in Nextclade by default so we should make sure Readme etc are meaningful. I'll review properly. You can use the dataset with Nextclade even without it being merged - just need to point it at the right repo/branch/path. I still think paths should be as obvious as possible and using segment is inconsistent with usage in the flu community and also with other Nextclade datasets for flu. Why does the path matter so much? For Genspectrum if you want to avoid clash of CDS names with segment names, you could just prefix segment names in queries with Also, Nextclade paths are just paths, you could decide to call the segments whatever you want and just map from the path to the segment name, if you want those to be different. |
But are HA, NA really the correct and firmly-established names in the community for the segments? NCBI virus shows the numbers in the segment column: If we look at the sequence names, it's often a mix. For the NCBI RefSeq of H5N1, this sequence only contains "HA", this sequence only contains "segment 7" (not "M"), and this sequence contains both "segment 1" and "PB 2". As said, for Nextclade, I am happy (and agree that it makes sense) to follow the Nextstrain conventions. For GenSpectrum, the evidence that I found so far indicates that segments 1-8 are accepted (and actually correct) names for the nucleotide sequences and that we should use them. (But I can be convinced otherwise if an influenza expert (e.g. @rneher) believes that this doesn't make sense.) |
I do think that The Moncla lab datasets for all of H5Nx uses the same reference of HA (Goose/Guangdong). For others they use more recent sequences. One thing that might things a little more difficult for you is that there are also quite a few strains have the H5 HA but that reassorted and use sequences very dissimilar form the Goose/Guangdong sequence in other segments. |
That said, using the the GG/1996 sequence is probably still useful. But thinking about the name space might be important.
|
Thanks so much for the comments! |
one additional comment: all other flu data sets on nextclade use lower case segments:
|
Copy of the nextclade dataset created by @chaoran-chen in https://github.com/GenSpectrum/nextclade-datasets/tree/main/data/flu/h5n1.
This dataset differs from https://github.com/nextstrain/nextclade_data/tree/master/data/community/moncla-lab/iav-h5/ha in that it includes all segments but does not do clade assignment.
Alignment parameter tuning
The default minSeedCover is 33%, this leads to over 10% (or 6349) H5N1 sequences from NCBI not aligning.
I additionally set minSeedCover to 0.01 (or 10%), this is used for other flu datasets:
nextclade_data/data/nextstrain/flu/h3n2/pb1/pathogen.json
Line 8 in c2d90b0
This reduces the number of total sequences that do not align to 2731, the majority of these sequences are in the NS segment (these were also the majority of sequences that did not align when using 33%)
I then additionally reduce the NS minSeedCover to 5%, this results in only 780 sequences that do not align, these are still primarily NS but the errors are now not linked to minSeedCover, e.g.:
I additionally reduce k-mer length to 7, and increase the number of allowed mismatches from 8 to 12, this leads to only 354 sequences that do not align, of these at least 280 are still of segment 8.
Examples
For each segment I include a randomly sampled 20 sequences for each segment from NCBI, these are sequences that aligned to the corresponding reference.