You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Locations defined in the source-data/geo_synonyms.tsv are ingested with the label column as the key in 3 dicts for location, division, and country. This means if there are multiple locations defined in the TSV file with the same label, the last one in the file is used as it overwrites all previous entries.
Specifically in vdb/flu_upload, the ingested location is based on the single location label in the strain name, which makes it impossible to identify the specific location.
After updating the upload scripts to use the hierarchical location curation, we should also audit the metadata in fauna to check for any erroneous location assignments.
I believe these can be fixed if we re-upload them with the --overwrite flag to update the location info.
Thank you @huddlej for pointing out the repeat_location in flu_upload. This looks like how these specific duplication location names are being handled in the current upload.
Current Behavior
Locations defined in the
source-data/geo_synonyms.tsv
are ingested with the label column as the key in 3 dicts for location, division, and country. This means if there are multiple locations defined in the TSV file with the same label, the last one in the file is used as it overwrites all previous entries.Specifically in
vdb/flu_upload
, the ingested location is based on the single location label in the strain name, which makes it impossible to identify the specific location.Possible solution
Use a hierarchical location curation process similar to what we have in ncov-ingest or monkeypox ingest.
The text was updated successfully, but these errors were encountered: