Add simple influenza A H5N1 dataset with all segments #217

anna-parker · 2024-07-17T15:26:52Z

Copy of the nextclade dataset created by @chaoran-chen in https://github.com/GenSpectrum/nextclade-datasets/tree/main/data/flu/h5n1.

This dataset differs from https://github.com/nextstrain/nextclade_data/tree/master/data/community/moncla-lab/iav-h5/ha in that it includes all segments but does not do clade assignment.

Alignment parameter tuning

The default minSeedCover is 33%, this leads to over 10% (or 6349) H5N1 sequences from NCBI not aligning.

I additionally set minSeedCover to 0.01 (or 10%), this is used for other flu datasets:

nextclade_data/data/nextstrain/flu/h3n2/pb1/pathogen.json

Line 8 in c2d90b0

"minSeedCover": 0.1

This reduces the number of total sequences that do not align to 2731, the majority of these sequences are in the NS segment (these were also the majority of sequences that did not align when using 33%)

I then additionally reduce the NS minSeedCover to 5%, this results in only 780 sequences that do not align, these are still primarily NS but the errors are now not linked to minSeedCover, e.g.:

When processing sequence #23 'OP950309.1': When calculating seed matches: Unable to align: seed alignment was unable to find any matches that are long enough. Only matches of at least 40 nucleotides long are considered (configurable using 'min match length' CLI flag or dataset property). This is likely due to low quality of the provided sequence, or due to using incorrect reference sequence.
22	OP950305.1

I additionally reduce k-mer length to 7, and increase the number of allowed mismatches from 8 to 12, this leads to only 354 sequences that do not align, of these at least 280 are still of segment 8.

Examples

For each segment I include a randomly sampled 20 sequences for each segment from NCBI, these are sequences that aligned to the corresponding reference.

ivan-aksamentov · 2024-07-17T15:55:36Z

Hi! Oh cool, new datasets! :)

Sorry I am not in the loop of these new developments. I'll let Richard and Cornelius to review.

But just want to bring up a few technical/bureaucratic issues:

Is it different from
https://github.com/nextstrain/nextclade_data/tree/master/data/community/moncla-lab/iav-h5/ha
?

Is anyone from Nextstrain involved in development to place it into the "nextstrain" collection and not into "community"?

anna-parker · 2024-07-17T18:37:56Z

Hi Ivan! The main difference is this contains references for all 8 segments.

About people from nextstrain being involved... I guess not really - should I move this into community under genspectrum?

.gitignore

corneliusroemer

You could potentially add example sequences to make it possible to test run the datasets, just a handful, like 5 genomes, are enough.

The segments are usually called: PB2, PB1, PA, NP, HA, NA, M, and NS not seg1, seg2

Possibly better to use the common names.

anna-parker · 2024-07-17T19:59:28Z

Thanks for the quick review - I will update with the requested changes tomorrow!

chaoran-chen · 2024-07-17T20:21:08Z

As far as I understood, PB2, PB1, etc. are the names of genes but not really the names of the segments, and some segments have multiple genes. Also, I chose them because, for GenSpectrum/LAPIS, it is better to avoid using the same names for nucleotide and amino acid sequences. Having the same names would make filtering for mutations more difficult because it would be unclear whether HA:123G refers to a nucleotide or amino acid mutation.

This being said, this is independent of the Nextclade datasets, so we can, of course, rename them here (and just name them differently when importing into LAPIS)

anna-parker · 2024-07-18T15:52:40Z

@corneliusroemer would it be ok to keep the segments as seg1 as we have now moved to a community folder?

corneliusroemer · 2024-07-18T16:32:44Z

Community still shows in Nextclade by default so we should make sure Readme etc are meaningful. I'll review properly.

You can use the dataset with Nextclade even without it being merged - just need to point it at the right repo/branch/path.

I still think paths should be as obvious as possible and using segment is inconsistent with usage in the flu community and also with other Nextclade datasets for flu. Why does the path matter so much? For Genspectrum if you want to avoid clash of CDS names with segment names, you could just prefix segment names in queries with seg, or nuc, you already do that just with numbers 1-8 rather than the more commonly used gene based names.

Also, Nextclade paths are just paths, you could decide to call the segments whatever you want and just map from the path to the segment name, if you want those to be different.

chaoran-chen · 2024-07-18T17:14:32Z

But are HA, NA really the correct and firmly-established names in the community for the segments? NCBI virus shows the numbers in the segment column:

If we look at the sequence names, it's often a mix. For the NCBI RefSeq of H5N1, this sequence only contains "HA", this sequence only contains "segment 7" (not "M"), and this sequence contains both "segment 1" and "PB 2".

As said, for Nextclade, I am happy (and agree that it makes sense) to follow the Nextstrain conventions. For GenSpectrum, the evidence that I found so far indicates that segments 1-8 are accepted (and actually correct) names for the nucleotide sequences and that we should use them. (But I can be convinced otherwise if an influenza expert (e.g. @rneher) believes that this doesn't make sense.)

rneher · 2024-07-21T16:59:27Z

I do think that PB2, PB1, PA etc are more common segment names than the numbers and we currently use the names rather than numbers across nextclade and nextstrain. So clearly both nomenclatures exist.

The Moncla lab datasets for all of H5Nx uses the same reference of HA (Goose/Guangdong). For others they use more recent sequences.

One thing that might things a little more difficult for you is that there are also quite a few strains have the H5 HA but that reassorted and use sequences very dissimilar form the Goose/Guangdong sequence in other segments.

rneher · 2024-07-21T17:03:29Z

That said, using the the GG/1996 sequence is probably still useful. But thinking about the name space might be important.

iav for Influenza A virus could be useful (sort of different for humans, since we have a few defined lineages of A and B circulating, but in animals it is a diverse mix of A). If you want to restrict yourself to viruses from the GG/1996 lineages, then maybe a name iav/h5n1/GG1996/pb1 etc could work.

anna-parker · 2024-07-22T09:59:36Z

Thanks so much for the comments!

Co-authored-by: Cornelius Roemer <[email protected]>

This reverts commit 2dc2f10.

rneher · 2024-11-04T20:21:25Z

one additional comment: all other flu data sets on nextclade use lower case segments:

community/moncla-lab/iav-h5/ha/2.3.4.4

nextstrain/flu/h3n2/ha/CY163680

corneliusroemer reviewed Jul 17, 2024

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

corneliusroemer reviewed Jul 17, 2024

View reviewed changes

anna-parker mentioned this pull request Jul 18, 2024

Do not merge: preview instance loculus-project/loculus#2307

Closed

anna-parker force-pushed the h5n1 branch from 1a681cd to f042646 Compare November 4, 2024 08:52

anna-parker and others added 6 commits November 4, 2024 09:52

Add h5n1 dataset

f5dc0e0

Add a changelog.

5451194

Remove gene sequences as not needed for nextclade datasets.

601a433

Update .gitignore

f1ce01b

Co-authored-by: Cornelius Roemer <[email protected]>

Move h5n1 dataset into community

6c0ee76

To revert: Run ./scripts/rebuild

2dc2f10

anna-parker force-pushed the h5n1 branch from f042646 to 2dc2f10 Compare November 4, 2024 08:53

anna-parker added 5 commits November 4, 2024 09:58

Revert "To revert: Run ./scripts/rebuild"

684000e

This reverts commit 2dc2f10.

Rename folders, update alignmentParams

c3aaeda

Rename folders to be more precise

3f97a7a

Update README.md with more info

fc0be5c

Add examples: 20 random samples for each segment

5b22bba

anna-parker requested a review from corneliusroemer November 4, 2024 12:17

anna-parker changed the title ~~Add influenza A H5N1 dataset~~ Add simple influenza A H5N1 dataset with all segments Nov 4, 2024

Update dataset name in readme

46fdd30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add simple influenza A H5N1 dataset with all segments #217

Add simple influenza A H5N1 dataset with all segments #217

anna-parker commented Jul 17, 2024 •

edited

Loading

ivan-aksamentov commented Jul 17, 2024

anna-parker commented Jul 17, 2024

corneliusroemer left a comment

anna-parker commented Jul 17, 2024

chaoran-chen commented Jul 17, 2024 •

edited

Loading

anna-parker commented Jul 18, 2024

corneliusroemer commented Jul 18, 2024

chaoran-chen commented Jul 18, 2024

rneher commented Jul 21, 2024

rneher commented Jul 21, 2024

anna-parker commented Jul 22, 2024 •

edited

Loading

rneher commented Nov 4, 2024

Add simple influenza A H5N1 dataset with all segments #217

Are you sure you want to change the base?

Add simple influenza A H5N1 dataset with all segments #217

Conversation

anna-parker commented Jul 17, 2024 • edited Loading

Alignment parameter tuning

Examples

ivan-aksamentov commented Jul 17, 2024

anna-parker commented Jul 17, 2024

corneliusroemer left a comment

Choose a reason for hiding this comment

anna-parker commented Jul 17, 2024

chaoran-chen commented Jul 17, 2024 • edited Loading

anna-parker commented Jul 18, 2024

corneliusroemer commented Jul 18, 2024

chaoran-chen commented Jul 18, 2024

rneher commented Jul 21, 2024

rneher commented Jul 21, 2024

anna-parker commented Jul 22, 2024 • edited Loading

rneher commented Nov 4, 2024

anna-parker commented Jul 17, 2024 •

edited

Loading

chaoran-chen commented Jul 17, 2024 •

edited

Loading

anna-parker commented Jul 22, 2024 •

edited

Loading