Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingest VIDRL human sera references #160

Merged
merged 7 commits into from
Aug 29, 2024
Merged

Ingest VIDRL human sera references #160

merged 7 commits into from
Aug 29, 2024

Conversation

joverlee521
Copy link
Contributor

@joverlee521 joverlee521 commented Aug 26, 2024

Description of proposed changes

First pass for ingesting VIDRL human sera references, focused on 2024 data.
I'm hoping that ingesting older data should be as simple as updating the VACCINE_MAPPING, but we'll see...

Related issue(s)

Related to #158

Checklist

  • Checks pass

@joverlee521 joverlee521 linked an issue Aug 26, 2024 that may be closed by this pull request
@@ -24,6 +24,20 @@
# This is based on the vaccine composition for the Southern Hemisphere
# because all human pooled sera should be from Australia
VACCINE_MAPPING = {
"2023": {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up needing to add 2023 vaccine mapping because some of the 2024 files included human sera references from 2023.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the date of the XLS file and other context, do you feel confident that the date in the spreadsheet(s) was intended to be "2023" and not "2024"? Do you have an example spreadsheet with 2023 human pool data that I could check out?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I'll dig up some examples.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All files from 2024-01-04 to 2024-07-12 are "2023", but from 2024-07-18 on they are all "2024". This general pattern makes me pretty confident they are intended to be "2023".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, that sounds right. Thank you for checking!

@joverlee521
Copy link
Contributor Author

Used this Snakefile to upload 2024 human sera data to test_tdb with changes up to f34c826

pyenv activate fauna
nextstrain build --cpus 2 --ambient --envdir ../env.d/seasonal-flu/ . --snakefile data/Snakefile --config year='2024' preview=False

This ran through 73 Excel workbooks without raising any errors 🎉
I will dig more into the uploaded data tomorrow to make sure nothing looks too out of place...

tdb/vidrl_upload.py Outdated Show resolved Hide resolved
In the parsing of human pooled sera, the subtype will be required, so
just error early if the subtype is not provided.

Also adds an additional check that the subtype is one of the expected
values.
Defining VIDRL specific human serum patterns to make it easier to
track which patterns are being used to match the human serum references.

Doing this in preparation for parsing human serum references for VIDRL.
Using a hard-coded VACCINE_MAPPING to keep track of Southern Hemisphere
vaccine strains, which is used to map human pooled sera references to
specific serum strains. This commit only includes 2024 vaccines from
<https://www.who.int/publications/m/item/recommended-composition-of-influenza-virus-vaccines-for-use-in-the-2024-southern-hemisphere-influenza-season>
Will update the vaccine mapping as I backfill more data.

I had originally tried to use a4c4336 +
5e72c59 to parse the "clade" row for
the extra egg/cell info, but that pattern matching fails because of
excess clade rows in the Excel sheet.¹ Instead, I've opted to just
force include an extra row of info for the human serum data in
`find_serum_rows`.

¹ <#160 (comment)>
Because the `serum_host` field is unreliable in fauna, seasonal-flu
uses substring matches on the `serum_id` field to separate ferret,
human, and mouse sera.¹

Updating the `serum_id` to be `Human pool <year>` so that it can be
matched in seasonal-flu. This also has the side-effect of setting the
`serum_host` field to "human" within fauna because of the `serum_id`
matching in tdb/upload.py.²

¹ <https://github.com/nextstrain/seasonal-flu/blob/89f6cfd11481b2c51c50d68822c18d46ed56db51/workflow/snakemake_rules/download_from_fauna.smk#L93>
² <https://github.com/nextstrain/fauna/blob/88a607db53d36fc91482cae2009eefddf9477f97/tdb/upload.py#L382-L383>
Allows us to only ingest the human sera references as we are backfilling
the data to avoid accidentally duplicating the ferret titer data.

This flag can be removed once we've ingested all of the human sera
references that have been previously skipped.
Updating with 2023 vaccines from
<https://www.who.int/publications/m/item/recommended-composition-of-influenza-virus-vaccines-for-use-in-the-2023-southern-hemisphere-influenza-season>

I ended up needing to add 2023 vaccine mapping because some of the 2024
files included human sera references from 2023.
@joverlee521
Copy link
Contributor Author

Of the 73 processed workbooks, 1 did not have any human sera references.
From the other 72 workbooks, the upload workflow added 5722 measurements to test_tdb/flu.

I tested download of the titer measurements using the seasonal-flu workflow with a small patch to use the `test_tdb` database.
diff --git a/workflow/snakemake_rules/download_from_fauna.smk b/workflow/snakemake_rules/download_from_fauna.smk
index 64e8350..04544df 100644
--- a/workflow/snakemake_rules/download_from_fauna.smk
+++ b/workflow/snakemake_rules/download_from_fauna.smk
@@ -68,7 +68,7 @@ rule download_titers:
     output:
         titers = "data/{lineage}/{center}_{passage}_{assay}_titers.tsv"
     params:
-        dbs = _get_tdb_databases,
+        dbs = 'test_tdb',
         assays = _get_tdb_assays,
         virus_passage_category=_get_virus_passage_category,
     conda: "../envs/nextstrain.yaml"
nextstrain build --envdir ../env.d/seasonal-flu/ . data/h3n2/who_human_cell_hi_titers.tsv data/h3n2/who_human_egg_hi_titers.tsv data/h3n2/who_human_cell_fra_titers.tsv data/h3n2/who_human_egg_fra_titers.tsv data/h1n1pdm/who_human_cell_hi_titers.tsv data/h1n1pdm/who_human_egg_hi_titers.tsv data/vic/who_human_cell_hi_titers.tsv data/vic/who_human_egg_hi_titers.tsv --configfile profiles/upload.yaml

This downloaded 5694 measurements that were all appropriately selected for the human host files. There were 28 measurements excluded because the virus_passage_category was egg while the serum_passage_category was cell. The seasonal-flu workflow explicitly excludes egg passaged test viruses in cell passaged titer data.

I manually spot checked 3 workbooks per subtype to verify all of the human sera reference measurements were included.
At this point, I'm pretty confident that this at least works for the 2024 files.

@joverlee521 joverlee521 marked this pull request as ready for review August 27, 2024 23:00
Copy link
Contributor

@huddlej huddlej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks excellent, @joverlee521! Thank you for the detailed checks with test_tdb, too. My only review questions below are data-specific. The only blocking comment is a quick check of the "2023" data, just to be sure that your algorithm didn't catch a typo in the spreadsheet.

tdb/vidrl_upload.py Outdated Show resolved Hide resolved
tdb/vidrl_upload.py Show resolved Hide resolved
tdb/titer_block.py Show resolved Hide resolved
tdb/vidrl_upload.py Show resolved Hide resolved
tdb/vidrl_upload.py Show resolved Hide resolved
@@ -24,6 +24,20 @@
# This is based on the vaccine composition for the Southern Hemisphere
# because all human pooled sera should be from Australia
VACCINE_MAPPING = {
"2023": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the date of the XLS file and other context, do you feel confident that the date in the spreadsheet(s) was intended to be "2023" and not "2024"? Do you have an example spreadsheet with 2023 human pool data that I could check out?

Based on feedback from @huddlej¹

We should _always_ have year info, so raising a loud error when it
cannot be parsed from the human sera references. I'm choosing _not_
to update the similar check for egg/cell distinction since I've already
seen examples of it missing in Excel sheets from 2023.

¹ <#160 (comment)>
@joverlee521
Copy link
Contributor Author

I'll plan to merge and upload the 2024 data as part of tomorrow's ingest.

@joverlee521 joverlee521 merged commit a25749a into master Aug 29, 2024
@joverlee521 joverlee521 deleted the vidrl-human-pools branch August 29, 2024 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parse VIDRL human sera pool measurements
2 participants