-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingest VIDRL human sera references #160
Conversation
@@ -24,6 +24,20 @@ | |||
# This is based on the vaccine composition for the Southern Hemisphere | |||
# because all human pooled sera should be from Australia | |||
VACCINE_MAPPING = { | |||
"2023": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ended up needing to add 2023 vaccine mapping because some of the 2024 files included human sera references from 2023.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the date of the XLS file and other context, do you feel confident that the date in the spreadsheet(s) was intended to be "2023" and not "2024"? Do you have an example spreadsheet with 2023 human pool data that I could check out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! I'll dig up some examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All files from 2024-01-04 to 2024-07-12 are "2023", but from 2024-07-18 on they are all "2024". This general pattern makes me pretty confident they are intended to be "2023".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, that sounds right. Thank you for checking!
Used this Snakefile to upload 2024 human sera data to
This ran through 73 Excel workbooks without raising any errors 🎉 |
f34c826
to
e32d1f9
Compare
In the parsing of human pooled sera, the subtype will be required, so just error early if the subtype is not provided. Also adds an additional check that the subtype is one of the expected values.
Defining VIDRL specific human serum patterns to make it easier to track which patterns are being used to match the human serum references. Doing this in preparation for parsing human serum references for VIDRL.
Using a hard-coded VACCINE_MAPPING to keep track of Southern Hemisphere vaccine strains, which is used to map human pooled sera references to specific serum strains. This commit only includes 2024 vaccines from <https://www.who.int/publications/m/item/recommended-composition-of-influenza-virus-vaccines-for-use-in-the-2024-southern-hemisphere-influenza-season> Will update the vaccine mapping as I backfill more data. I had originally tried to use a4c4336 + 5e72c59 to parse the "clade" row for the extra egg/cell info, but that pattern matching fails because of excess clade rows in the Excel sheet.¹ Instead, I've opted to just force include an extra row of info for the human serum data in `find_serum_rows`. ¹ <#160 (comment)>
Because the `serum_host` field is unreliable in fauna, seasonal-flu uses substring matches on the `serum_id` field to separate ferret, human, and mouse sera.¹ Updating the `serum_id` to be `Human pool <year>` so that it can be matched in seasonal-flu. This also has the side-effect of setting the `serum_host` field to "human" within fauna because of the `serum_id` matching in tdb/upload.py.² ¹ <https://github.com/nextstrain/seasonal-flu/blob/89f6cfd11481b2c51c50d68822c18d46ed56db51/workflow/snakemake_rules/download_from_fauna.smk#L93> ² <https://github.com/nextstrain/fauna/blob/88a607db53d36fc91482cae2009eefddf9477f97/tdb/upload.py#L382-L383>
Allows us to only ingest the human sera references as we are backfilling the data to avoid accidentally duplicating the ferret titer data. This flag can be removed once we've ingested all of the human sera references that have been previously skipped.
Updating with 2023 vaccines from <https://www.who.int/publications/m/item/recommended-composition-of-influenza-virus-vaccines-for-use-in-the-2023-southern-hemisphere-influenza-season> I ended up needing to add 2023 vaccine mapping because some of the 2024 files included human sera references from 2023.
e32d1f9
to
b7bade4
Compare
Of the 73 processed workbooks, 1 did not have any human sera references. I tested download of the titer measurements using the seasonal-flu workflow with a small patch to use the `test_tdb` database.diff --git a/workflow/snakemake_rules/download_from_fauna.smk b/workflow/snakemake_rules/download_from_fauna.smk
index 64e8350..04544df 100644
--- a/workflow/snakemake_rules/download_from_fauna.smk
+++ b/workflow/snakemake_rules/download_from_fauna.smk
@@ -68,7 +68,7 @@ rule download_titers:
output:
titers = "data/{lineage}/{center}_{passage}_{assay}_titers.tsv"
params:
- dbs = _get_tdb_databases,
+ dbs = 'test_tdb',
assays = _get_tdb_assays,
virus_passage_category=_get_virus_passage_category,
conda: "../envs/nextstrain.yaml"
This downloaded 5694 measurements that were all appropriately selected for the human host files. There were 28 measurements excluded because the I manually spot checked 3 workbooks per subtype to verify all of the human sera reference measurements were included. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks excellent, @joverlee521! Thank you for the detailed checks with test_tdb, too. My only review questions below are data-specific. The only blocking comment is a quick check of the "2023" data, just to be sure that your algorithm didn't catch a typo in the spreadsheet.
@@ -24,6 +24,20 @@ | |||
# This is based on the vaccine composition for the Southern Hemisphere | |||
# because all human pooled sera should be from Australia | |||
VACCINE_MAPPING = { | |||
"2023": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the date of the XLS file and other context, do you feel confident that the date in the spreadsheet(s) was intended to be "2023" and not "2024"? Do you have an example spreadsheet with 2023 human pool data that I could check out?
Based on feedback from @huddlej¹ We should _always_ have year info, so raising a loud error when it cannot be parsed from the human sera references. I'm choosing _not_ to update the similar check for egg/cell distinction since I've already seen examples of it missing in Excel sheets from 2023. ¹ <#160 (comment)>
I'll plan to merge and upload the 2024 data as part of tomorrow's ingest. |
Description of proposed changes
First pass for ingesting VIDRL human sera references, focused on 2024 data.
I'm hoping that ingesting older data should be as simple as updating the VACCINE_MAPPING, but we'll see...
Related issue(s)
Related to #158
Checklist