-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update ingest for VIDRL flat files #164
base: master
Are you sure you want to change the base?
Commits on Nov 5, 2024
-
read_flat_vidrl: add column map to script
The column map will be more complicated with the need to ingest two slightly different flat files (_flat_file.csv and _reference_panel.csv) as discussed in #161 (comment). I also found myself constantly toggling back and forth between the separate column_map.tsv and the upload script to figure out how the columns are being used, so it makes more sense to just hard-code the column map in the script.
Configuration menu - View commit details
-
Copy full SHA for 27757d8 - Browse repository at this point
Copy the full SHA 27757d8View commit details -
read_flat_vidrl: Update column map
Update column map based on `0906.xlsx_H1_flat_file.csv` in comparison to the matching Excel file `20240906\ H1N1.xlsx` available on VIDRL's OneDrive.
Configuration menu - View commit details
-
Copy full SHA for 03c00d9 - Browse repository at this point
Copy the full SHA 03c00d9View commit details -
read_flat_vidrl: replace pandas with csv module
Avoid pandas typing issues by just using the Python csv module to read and write the flat files. Mimics `augur curate` with independent functions for reading, curating, and writing records.
Configuration menu - View commit details
-
Copy full SHA for 0a773d1 - Browse repository at this point
Copy the full SHA 0a773d1View commit details -
vidrl_upload: refactor human serum id to separate function
Doing this in preparation for processing the flat files that includes human sera measurements. The human serum ids will be parsed the same way for the flat files to ensure that we use the same standardized id.
Configuration menu - View commit details
-
Copy full SHA for 8b71644 - Browse repository at this point
Copy the full SHA 8b71644View commit details -
read_flat_vidrl: standardize human sera strain/id/type
Strip the "pool" suffix from the serum strain name, standardize the egg or cell type, and standardize the serum id. While looking into this change, I discovered that the strain name used for the human sera references in H1 and H3 is the egg vaccine strain regardless of passage annotation. Currently unclear if this is an error in the flat files or if we've misunderstood the passage annotations for human sera data. Once we clear this up, we should add some type of vaccine strain verification so that we can flag mismatches like this automatically.
Configuration menu - View commit details
-
Copy full SHA for 04ab483 - Browse repository at this point
Copy the full SHA 04ab483View commit details
Commits on Nov 6, 2024
-
read_flat_vidrl: Include "assay_date"
In order to include the "assay_date" in the uploaded data, the VIDRL column needs to be "date" so that it can be parsed within `elife_upload` as "assay_date". This is an ugly work around, but it's similar to how cdc_upload handles the field.¹ ¹ <https://github.com/nextstrain/fauna/blob/b133974275ee1ed4e91816c76db6b7616247b6dc/tdb/cdc_upload.py#L58>
Configuration menu - View commit details
-
Copy full SHA for 63a3fa1 - Browse repository at this point
Copy the full SHA 63a3fa1View commit details -
read_flat_vidrl: Validate records
Validate records in single flat file. Ensure that the serum abbreviations map to a single serum strain and all records have the same test date. As a side effect, the validated `serum_abbr_map` and `test_date` are returned to be used for processing the reference panel records in following commits.
Configuration menu - View commit details
-
Copy full SHA for 62e85b0 - Browse repository at this point
Copy the full SHA 62e85b0View commit details -
vidrl_upload: refactor curate_flat_records
Pull out curation into individual functions that can be shared with the curation of the _reference_panel.csv file.
Configuration menu - View commit details
-
Copy full SHA for 476b262 - Browse repository at this point
Copy the full SHA 476b262View commit details -
read_flat_vidrl: ingest reference_panel.csv file
Ingests the matching "*_reference_panel.csv" for a provided "*_flat_file" fstem if the reference panel file exists. The records parsed from the reference panel file is appended to the same tmp file that is then passed to elife_upload.py. This currently includes "extra" records in comparison to Excel files, where the human sera pool strain is the "test virus" against the other references. If I strip the `pool` suffix from human sera pool strain, the measurements are exact duplicates of the measurements for the matching reference strain. We will need to decide whether or not these records should be dropped.
Configuration menu - View commit details
-
Copy full SHA for b3d0dc8 - Browse repository at this point
Copy the full SHA b3d0dc8View commit details -
read_flat_vidrl: Parse human serum passage from specified field
Based on comment in Slack¹ that the "e" or "c" suffix in the serum ID is not a reliable indicator of human serum passage. ¹ <https://bedfordlab.slack.com/archives/C03KWDET9/p1728430958054989?thread_ts=1699914235.686809&cid=C03KWDET9>
Configuration menu - View commit details
-
Copy full SHA for 3abf014 - Browse repository at this point
Copy the full SHA 3abf014View commit details -
Configuration menu - View commit details
-
Copy full SHA for 467cc58 - Browse repository at this point
Copy the full SHA 467cc58View commit details -
read_flat_vidrl: Clean up
virus_strain
that includes "pool" suffixBased on meeting with VIDRL, we should only keep homologous titers for `virus_strain` that includes "pool" suffix. This will act as a proxy homologous titer for the human serum references. All other virus strains that include the "pool" suffix are ignored because they are duplicate data.
Configuration menu - View commit details
-
Copy full SHA for 21cbcd8 - Browse repository at this point
Copy the full SHA 21cbcd8View commit details -
read_flat_vidrl: Check for potential duplicate _reference_panel files
Based on meeting with VIDRL, a/b and _1/_2 reference panel files are created from the same Excel file so they are duplicates while capital A/B files are separate assays. So, this changes allows us to check for the a/b and _1/_2 patterns and ignore the reference panel file if it's a duplicate. This means we always ingest the a or _1 file but ignore the b and _2 files.
Configuration menu - View commit details
-
Copy full SHA for f85a2e2 - Browse repository at this point
Copy the full SHA f85a2e2View commit details