Skip to content

Check strain names in titer data #286

@joverlee521

Description

@joverlee521

Context

@jameshadfield recently looked into matches of titer strain names against the strain names of sequences from the new ingest workflow and it's as good/bad as current strain name matches against fauna sequences (Slack).

We should add a check to verify that titer strain names match an existing sequence strain name. This would help us identify misspelled strain names such as those in #60.

Description

Even before we move titer ingest out of fauna, we can add this check into the existing workflow for titers.

We currently download titers from fauna, separated by lineage/center/passage/assay, but the who center is a superset of all other centers. So after download_titers, we can check the data/<lineage>/who_<passage>_<assay>_titers.tsv files against the <lineage>/metadata.tsv. This would verify the titers' virus_strain and serum_strain values match strain in the metadata. Every value that does not match would get output to a log file.

This would create a path to use the log file to generate a mapping file of bad titer strain names to existing sequence strain names. This can be applied in a curation step before the titer data is uploaded to S3.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions