Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ingest for VIDRL flat files #164

Draft
wants to merge 13 commits into
base: master
Choose a base branch
from
Draft

Commits on Nov 5, 2024

  1. read_flat_vidrl: add column map to script

    The column map will be more complicated with the need to ingest two
    slightly different flat files (_flat_file.csv and _reference_panel.csv)
    as discussed in #161 (comment).
    
    I also found myself constantly toggling back and forth between the
    separate column_map.tsv and the upload script to figure out how the
    columns are being used, so it makes more sense to just hard-code the
    column map in the script.
    joverlee521 committed Nov 5, 2024
    Configuration menu
    Copy the full SHA
    27757d8 View commit details
    Browse the repository at this point in the history
  2. read_flat_vidrl: Update column map

    Update column map based on `0906.xlsx_H1_flat_file.csv` in comparison
    to the matching Excel file `20240906\ H1N1.xlsx` available on
    VIDRL's OneDrive.
    joverlee521 committed Nov 5, 2024
    Configuration menu
    Copy the full SHA
    03c00d9 View commit details
    Browse the repository at this point in the history
  3. read_flat_vidrl: replace pandas with csv module

    Avoid pandas typing issues by just using the Python csv module
    to read and write the flat files. Mimics `augur curate` with independent
    functions for reading, curating, and writing records.
    joverlee521 committed Nov 5, 2024
    Configuration menu
    Copy the full SHA
    0a773d1 View commit details
    Browse the repository at this point in the history
  4. vidrl_upload: refactor human serum id to separate function

    Doing this in preparation for processing the flat files that includes
    human sera measurements. The human serum ids will be parsed the same way
    for the flat files to ensure that we use the same standardized id.
    joverlee521 committed Nov 5, 2024
    Configuration menu
    Copy the full SHA
    8b71644 View commit details
    Browse the repository at this point in the history
  5. read_flat_vidrl: standardize human sera strain/id/type

    Strip the "pool" suffix from the serum strain name, standardize the
    egg or cell type, and standardize the serum id.
    
    While looking into this change, I discovered that the strain name used
    for the human sera references in H1 and H3 is the egg vaccine strain
    regardless of passage annotation. Currently unclear if this is an error
    in the flat files or if we've misunderstood the passage annotations for
    human sera data. Once we clear this up, we should add some type of
    vaccine strain verification so that we can flag mismatches like this
    automatically.
    joverlee521 committed Nov 5, 2024
    Configuration menu
    Copy the full SHA
    04ab483 View commit details
    Browse the repository at this point in the history

Commits on Nov 6, 2024

  1. read_flat_vidrl: Include "assay_date"

    In order to include the "assay_date" in the uploaded data, the VIDRL
    column needs to be "date" so that it can be parsed within `elife_upload`
    as "assay_date".
    
    This is an ugly work around, but it's similar to how cdc_upload handles
    the field.¹
    
    ¹ <https://github.com/nextstrain/fauna/blob/b133974275ee1ed4e91816c76db6b7616247b6dc/tdb/cdc_upload.py#L58>
    joverlee521 committed Nov 6, 2024
    Configuration menu
    Copy the full SHA
    63a3fa1 View commit details
    Browse the repository at this point in the history
  2. read_flat_vidrl: Validate records

    Validate records in single flat file. Ensure that the serum
    abbreviations map to a single serum strain and all records have the
    same test date.
    
    As a side effect, the validated `serum_abbr_map` and `test_date` are
    returned to be used for processing the reference panel records in
    following commits.
    joverlee521 committed Nov 6, 2024
    Configuration menu
    Copy the full SHA
    62e85b0 View commit details
    Browse the repository at this point in the history
  3. vidrl_upload: refactor curate_flat_records

    Pull out curation into individual functions that can be shared with the
    curation of the _reference_panel.csv file.
    joverlee521 committed Nov 6, 2024
    Configuration menu
    Copy the full SHA
    476b262 View commit details
    Browse the repository at this point in the history
  4. read_flat_vidrl: ingest reference_panel.csv file

    Ingests the matching "*_reference_panel.csv" for a provided
    "*_flat_file" fstem if the reference panel file exists. The records
    parsed from the reference panel file is appended to the same tmp
    file that is then passed to elife_upload.py.
    
    This currently includes "extra" records in comparison to Excel files,
    where the human sera pool strain is the "test virus" against the other
    references. If I strip the `pool` suffix from human sera pool strain,
    the measurements are exact duplicates of the measurements for the
    matching reference strain. We will need to decide whether or not these
    records should be dropped.
    joverlee521 committed Nov 6, 2024
    Configuration menu
    Copy the full SHA
    b3d0dc8 View commit details
    Browse the repository at this point in the history
  5. read_flat_vidrl: Parse human serum passage from specified field

    Based on comment in Slack¹ that the "e" or "c" suffix in the serum ID
    is not a reliable indicator of human serum passage.
    
    ¹ <https://bedfordlab.slack.com/archives/C03KWDET9/p1728430958054989?thread_ts=1699914235.686809&cid=C03KWDET9>
    joverlee521 committed Nov 6, 2024
    Configuration menu
    Copy the full SHA
    3abf014 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    467cc58 View commit details
    Browse the repository at this point in the history
  7. read_flat_vidrl: Clean up virus_strain that includes "pool" suffix

    Based on meeting with VIDRL, we should only keep homologous titers
    for `virus_strain` that includes "pool" suffix. This will act as a proxy
    homologous titer for the human serum references. All other virus strains
    that include the "pool" suffix are ignored because they are duplicate
    data.
    joverlee521 committed Nov 6, 2024
    Configuration menu
    Copy the full SHA
    21cbcd8 View commit details
    Browse the repository at this point in the history
  8. read_flat_vidrl: Check for potential duplicate _reference_panel files

    Based on meeting with VIDRL, a/b and _1/_2 reference panel files are
    created from the same Excel file so they are duplicates while capital
    A/B files are separate assays.
    
    So, this changes allows us to check for the a/b and _1/_2 patterns and
    ignore the reference panel file if it's a duplicate. This means we
    always ingest the a or _1 file but ignore the b and _2 files.
    joverlee521 committed Nov 6, 2024
    Configuration menu
    Copy the full SHA
    f85a2e2 View commit details
    Browse the repository at this point in the history