Ingest VIDRL human sera references #160

joverlee521 · 2024-08-26T22:45:42Z

Description of proposed changes

First pass for ingesting VIDRL human sera references, focused on 2024 data.
I'm hoping that ingesting older data should be as simple as updating the VACCINE_MAPPING, but we'll see...

Related issue(s)

Related to #158

Checklist

Checks pass

tdb/vidrl_upload.py

joverlee521 · 2024-08-27T00:28:09Z

tdb/vidrl_upload.py

@@ -24,6 +24,20 @@
 # This is based on the vaccine composition for the Southern Hemisphere
 # because all human pooled sera should be from Australia
 VACCINE_MAPPING = {
+    "2023": {


Ended up needing to add 2023 vaccine mapping because some of the 2024 files included human sera references from 2023.

Based on the date of the XLS file and other context, do you feel confident that the date in the spreadsheet(s) was intended to be "2023" and not "2024"? Do you have an example spreadsheet with 2023 human pool data that I could check out?

Good point! I'll dig up some examples.

All files from 2024-01-04 to 2024-07-12 are "2023", but from 2024-07-18 on they are all "2024". This general pattern makes me pretty confident they are intended to be "2023".

Agreed, that sounds right. Thank you for checking!

joverlee521 · 2024-08-27T00:30:32Z

Used this Snakefile to upload 2024 human sera data to test_tdb with changes up to f34c826

pyenv activate fauna
nextstrain build --cpus 2 --ambient --envdir ../env.d/seasonal-flu/ . --snakefile data/Snakefile --config year='2024' preview=False

This ran through 73 Excel workbooks without raising any errors 🎉
I will dig more into the uploaded data tomorrow to make sure nothing looks too out of place...

tdb/vidrl_upload.py

In the parsing of human pooled sera, the subtype will be required, so just error early if the subtype is not provided. Also adds an additional check that the subtype is one of the expected values.

Defining VIDRL specific human serum patterns to make it easier to track which patterns are being used to match the human serum references. Doing this in preparation for parsing human serum references for VIDRL.

Using a hard-coded VACCINE_MAPPING to keep track of Southern Hemisphere vaccine strains, which is used to map human pooled sera references to specific serum strains. This commit only includes 2024 vaccines from <https://www.who.int/publications/m/item/recommended-composition-of-influenza-virus-vaccines-for-use-in-the-2024-southern-hemisphere-influenza-season> Will update the vaccine mapping as I backfill more data. I had originally tried to use a4c4336 + 5e72c59 to parse the "clade" row for the extra egg/cell info, but that pattern matching fails because of excess clade rows in the Excel sheet.¹ Instead, I've opted to just force include an extra row of info for the human serum data in `find_serum_rows`. ¹ <#160 (comment)>

Because the `serum_host` field is unreliable in fauna, seasonal-flu uses substring matches on the `serum_id` field to separate ferret, human, and mouse sera.¹ Updating the `serum_id` to be `Human pool <year>` so that it can be matched in seasonal-flu. This also has the side-effect of setting the `serum_host` field to "human" within fauna because of the `serum_id` matching in tdb/upload.py.² ¹ <https://github.com/nextstrain/seasonal-flu/blob/89f6cfd11481b2c51c50d68822c18d46ed56db51/workflow/snakemake_rules/download_from_fauna.smk#L93> ² <https://github.com/nextstrain/fauna/blob/88a607db53d36fc91482cae2009eefddf9477f97/tdb/upload.py#L382-L383>

Allows us to only ingest the human sera references as we are backfilling the data to avoid accidentally duplicating the ferret titer data. This flag can be removed once we've ingested all of the human sera references that have been previously skipped.

Updating with 2023 vaccines from <https://www.who.int/publications/m/item/recommended-composition-of-influenza-virus-vaccines-for-use-in-the-2023-southern-hemisphere-influenza-season> I ended up needing to add 2023 vaccine mapping because some of the 2024 files included human sera references from 2023.

joverlee521 · 2024-08-27T23:00:51Z

Of the 73 processed workbooks, 1 did not have any human sera references.
From the other 72 workbooks, the upload workflow added 5722 measurements to test_tdb/flu.

I tested download of the titer measurements using the seasonal-flu workflow with a small patch to use the `test_tdb` database.

diff --git a/workflow/snakemake_rules/download_from_fauna.smk b/workflow/snakemake_rules/download_from_fauna.smk
index 64e8350..04544df 100644
--- a/workflow/snakemake_rules/download_from_fauna.smk
+++ b/workflow/snakemake_rules/download_from_fauna.smk
@@ -68,7 +68,7 @@ rule download_titers:
     output:
         titers = "data/{lineage}/{center}_{passage}_{assay}_titers.tsv"
     params:
-        dbs = _get_tdb_databases,
+        dbs = 'test_tdb',
         assays = _get_tdb_assays,
         virus_passage_category=_get_virus_passage_category,
     conda: "../envs/nextstrain.yaml"

nextstrain build --envdir ../env.d/seasonal-flu/ . data/h3n2/who_human_cell_hi_titers.tsv data/h3n2/who_human_egg_hi_titers.tsv data/h3n2/who_human_cell_fra_titers.tsv data/h3n2/who_human_egg_fra_titers.tsv data/h1n1pdm/who_human_cell_hi_titers.tsv data/h1n1pdm/who_human_egg_hi_titers.tsv data/vic/who_human_cell_hi_titers.tsv data/vic/who_human_egg_hi_titers.tsv --configfile profiles/upload.yaml

This downloaded 5694 measurements that were all appropriately selected for the human host files. There were 28 measurements excluded because the virus_passage_category was egg while the serum_passage_category was cell. The seasonal-flu workflow explicitly excludes egg passaged test viruses in cell passaged titer data.

I manually spot checked 3 workbooks per subtype to verify all of the human sera reference measurements were included.
At this point, I'm pretty confident that this at least works for the 2024 files.

huddlej

This looks excellent, @joverlee521! Thank you for the detailed checks with test_tdb, too. My only review questions below are data-specific. The only blocking comment is a quick check of the "2023" data, just to be sure that your algorithm didn't catch a typo in the spreadsheet.

tdb/vidrl_upload.py

tdb/titer_block.py

tdb/vidrl_upload.py

huddlej · 2024-08-28T18:15:41Z

tdb/vidrl_upload.py

@@ -24,6 +24,20 @@
 # This is based on the vaccine composition for the Southern Hemisphere
 # because all human pooled sera should be from Australia
 VACCINE_MAPPING = {
+    "2023": {


Based on the date of the XLS file and other context, do you feel confident that the date in the spreadsheet(s) was intended to be "2023" and not "2024"? Do you have an example spreadsheet with 2023 human pool data that I could check out?

@huddlej

Based on feedback from @huddlej¹ We should _always_ have year info, so raising a loud error when it cannot be parsed from the human sera references. I'm choosing _not_ to update the similar check for egg/cell distinction since I've already seen examples of it missing in Excel sheets from 2023. ¹ <#160 (comment)>

joverlee521 · 2024-08-28T21:21:02Z

I'll plan to merge and upload the 2024 data as part of tomorrow's ingest.

joverlee521 linked an issue Aug 26, 2024 that may be closed by this pull request

Parse VIDRL human sera pool measurements #158

Open

joverlee521 commented Aug 26, 2024

View reviewed changes

tdb/vidrl_upload.py Show resolved Hide resolved

joverlee521 commented Aug 27, 2024

View reviewed changes

tdb/vidrl_upload.py Outdated Show resolved Hide resolved

joverlee521 force-pushed the vidrl-human-pools branch from f34c826 to e32d1f9 Compare August 27, 2024 20:24

joverlee521 added 6 commits August 27, 2024 14:53

vidrl_upload: Error early if --subtype is not provided

0c56e79

In the parsing of human pooled sera, the subtype will be required, so just error early if the subtype is not provided. Also adds an additional check that the subtype is one of the expected values.

vidrl_upload: Define VIDRL specific human_serum_pattern

211e325

Defining VIDRL specific human serum patterns to make it easier to track which patterns are being used to match the human serum references. Doing this in preparation for parsing human serum references for VIDRL.

joverlee521 force-pushed the vidrl-human-pools branch from e32d1f9 to b7bade4 Compare August 27, 2024 22:08

joverlee521 marked this pull request as ready for review August 27, 2024 23:00

joverlee521 requested review from j23414 and huddlej August 27, 2024 23:05

huddlej reviewed Aug 28, 2024

View reviewed changes

huddlej approved these changes Aug 28, 2024

View reviewed changes

joverlee521 merged commit a25749a into master Aug 29, 2024

joverlee521 deleted the vidrl-human-pools branch August 29, 2024 16:26

joverlee521 mentioned this pull request Aug 29, 2024

Parse VIDRL human sera pool measurements #158

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingest VIDRL human sera references #160

Ingest VIDRL human sera references #160

joverlee521 commented Aug 26, 2024 •

edited

Loading

joverlee521 Aug 27, 2024

huddlej Aug 28, 2024

joverlee521 Aug 28, 2024

joverlee521 Aug 28, 2024

huddlej Aug 28, 2024

joverlee521 commented Aug 27, 2024

joverlee521 commented Aug 27, 2024

huddlej left a comment

huddlej Aug 28, 2024

joverlee521 commented Aug 28, 2024

Ingest VIDRL human sera references #160

Ingest VIDRL human sera references #160

Conversation

joverlee521 commented Aug 26, 2024 • edited Loading

Description of proposed changes

Related issue(s)

Checklist

joverlee521 Aug 27, 2024

Choose a reason for hiding this comment

huddlej Aug 28, 2024

Choose a reason for hiding this comment

joverlee521 Aug 28, 2024

Choose a reason for hiding this comment

joverlee521 Aug 28, 2024

Choose a reason for hiding this comment

huddlej Aug 28, 2024

Choose a reason for hiding this comment

joverlee521 commented Aug 27, 2024

joverlee521 commented Aug 27, 2024

huddlej left a comment

Choose a reason for hiding this comment

huddlej Aug 28, 2024

Choose a reason for hiding this comment

joverlee521 commented Aug 28, 2024

joverlee521 commented Aug 26, 2024 •

edited

Loading