Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add simple influenza A H5N1 dataset with all segments #217

Open
wants to merge 12 commits into
base: master
Choose a base branch
from

Conversation

anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Jul 17, 2024

Copy of the nextclade dataset created by @chaoran-chen in https://github.com/GenSpectrum/nextclade-datasets/tree/main/data/flu/h5n1.

This dataset differs from https://github.com/nextstrain/nextclade_data/tree/master/data/community/moncla-lab/iav-h5/ha in that it includes all segments but does not do clade assignment.

Alignment parameter tuning

The default minSeedCover is 33%, this leads to over 10% (or 6349) H5N1 sequences from NCBI not aligning.

I additionally set minSeedCover to 0.01 (or 10%), this is used for other flu datasets:

This reduces the number of total sequences that do not align to 2731, the majority of these sequences are in the NS segment (these were also the majority of sequences that did not align when using 33%)

I then additionally reduce the NS minSeedCover to 5%, this results in only 780 sequences that do not align, these are still primarily NS but the errors are now not linked to minSeedCover, e.g.:

When processing sequence #23 'OP950309.1': When calculating seed matches: Unable to align: seed alignment was unable to find any matches that are long enough. Only matches of at least 40 nucleotides long are considered (configurable using 'min match length' CLI flag or dataset property). This is likely due to low quality of the provided sequence, or due to using incorrect reference sequence.
22	OP950305.1																																																																								

I additionally reduce k-mer length to 7, and increase the number of allowed mismatches from 8 to 12, this leads to only 354 sequences that do not align, of these at least 280 are still of segment 8.

Examples

For each segment I include a randomly sampled 20 sequences for each segment from NCBI, these are sequences that aligned to the corresponding reference.

@ivan-aksamentov
Copy link
Member

Hi! Oh cool, new datasets! :)

Sorry I am not in the loop of these new developments. I'll let Richard and Cornelius to review.

But just want to bring up a few technical/bureaucratic issues:

Is it different from
https://github.com/nextstrain/nextclade_data/tree/master/data/community/moncla-lab/iav-h5/ha
?

Is anyone from Nextstrain involved in development to place it into the "nextstrain" collection and not into "community"?

@anna-parker
Copy link
Contributor Author

Hi Ivan! The main difference is this contains references for all 8 segments.

About people from nextstrain being involved... I guess not really - should I move this into community under genspectrum?

.gitignore Outdated Show resolved Hide resolved
Copy link
Member

@corneliusroemer corneliusroemer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could potentially add example sequences to make it possible to test run the datasets, just a handful, like 5 genomes, are enough.

The segments are usually called: PB2, PB1, PA, NP, HA, NA, M, and NS not seg1, seg2

Possibly better to use the common names.

@anna-parker
Copy link
Contributor Author

Thanks for the quick review - I will update with the requested changes tomorrow!

@chaoran-chen
Copy link
Contributor

chaoran-chen commented Jul 17, 2024

As far as I understood, PB2, PB1, etc. are the names of genes but not really the names of the segments, and some segments have multiple genes. Also, I chose them because, for GenSpectrum/LAPIS, it is better to avoid using the same names for nucleotide and amino acid sequences. Having the same names would make filtering for mutations more difficult because it would be unclear whether HA:123G refers to a nucleotide or amino acid mutation.

This being said, this is independent of the Nextclade datasets, so we can, of course, rename them here (and just name them differently when importing into LAPIS)

@anna-parker
Copy link
Contributor Author

@corneliusroemer would it be ok to keep the segments as seg1 as we have now moved to a community folder?

@corneliusroemer
Copy link
Member

Community still shows in Nextclade by default so we should make sure Readme etc are meaningful. I'll review properly.

You can use the dataset with Nextclade even without it being merged - just need to point it at the right repo/branch/path.

I still think paths should be as obvious as possible and using segment is inconsistent with usage in the flu community and also with other Nextclade datasets for flu. Why does the path matter so much? For Genspectrum if you want to avoid clash of CDS names with segment names, you could just prefix segment names in queries with seg, or nuc, you already do that just with numbers 1-8 rather than the more commonly used gene based names.

Also, Nextclade paths are just paths, you could decide to call the segments whatever you want and just map from the path to the segment name, if you want those to be different.

@chaoran-chen
Copy link
Contributor

But are HA, NA really the correct and firmly-established names in the community for the segments? NCBI virus shows the numbers in the segment column:

image

If we look at the sequence names, it's often a mix. For the NCBI RefSeq of H5N1, this sequence only contains "HA", this sequence only contains "segment 7" (not "M"), and this sequence contains both "segment 1" and "PB 2".

As said, for Nextclade, I am happy (and agree that it makes sense) to follow the Nextstrain conventions. For GenSpectrum, the evidence that I found so far indicates that segments 1-8 are accepted (and actually correct) names for the nucleotide sequences and that we should use them. (But I can be convinced otherwise if an influenza expert (e.g. @rneher) believes that this doesn't make sense.)

@rneher
Copy link
Member

rneher commented Jul 21, 2024

I do think that PB2, PB1, PA etc are more common segment names than the numbers and we currently use the names rather than numbers across nextclade and nextstrain. So clearly both nomenclatures exist.

The Moncla lab datasets for all of H5Nx uses the same reference of HA (Goose/Guangdong). For others they use more recent sequences.

One thing that might things a little more difficult for you is that there are also quite a few strains have the H5 HA but that reassorted and use sequences very dissimilar form the Goose/Guangdong sequence in other segments.

@rneher
Copy link
Member

rneher commented Jul 21, 2024

That said, using the the GG/1996 sequence is probably still useful. But thinking about the name space might be important.

iav for Influenza A virus could be useful (sort of different for humans, since we have a few defined lineages of A and B circulating, but in animals it is a diverse mix of A). If you want to restrict yourself to viruses from the GG/1996 lineages, then maybe a name iav/h5n1/GG1996/pb1 etc could work.

@anna-parker
Copy link
Contributor Author

anna-parker commented Jul 22, 2024

Thanks so much for the comments!

@anna-parker anna-parker changed the title Add influenza A H5N1 dataset Add simple influenza A H5N1 dataset with all segments Nov 4, 2024
@rneher
Copy link
Member

rneher commented Nov 4, 2024

one additional comment: all other flu data sets on nextclade use lower case segments:

community/moncla-lab/iav-h5/ha/2.3.4.4

nextstrain/flu/h3n2/ha/CY163680

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants