-
Notifications
You must be signed in to change notification settings - Fork 29
Description
Context
@jameshadfield recently looked into matches of titer strain names against the strain names of sequences from the new ingest workflow and it's as good/bad as current strain name matches against fauna sequences (Slack).
We should add a check to verify that titer strain names match an existing sequence strain name. This would help us identify misspelled strain names such as those in #60.
Description
Even before we move titer ingest out of fauna, we can add this check into the existing workflow for titers.
We currently download titers from fauna, separated by lineage/center/passage/assay, but the who center is a superset of all other centers. So after download_titers, we can check the data/<lineage>/who_<passage>_<assay>_titers.tsv files against the <lineage>/metadata.tsv. This would verify the titers' virus_strain and serum_strain values match strain in the metadata. Every value that does not match would get output to a log file.
This would create a path to use the log file to generate a mapping file of bad titer strain names to existing sequence strain names. This can be applied in a curation step before the titer data is uploaded to S3.