Skip to content
This repository has been archived by the owner on Jun 1, 2022. It is now read-only.

Identify locations that existed in a previous scraper run but do not exist any more #712

Open
simonw opened this issue Jun 30, 2021 · 6 comments
Labels
research Research needed

Comments

@simonw
Copy link
Collaborator

simonw commented Jun 30, 2021

For #704 it's likely that one of the strongest signals we can get for if a location has shut down is if it no longer appears in our source location scraped data.

Can we use SQL to notice locations that appear to no longer be picked up by our scrapers?

@simonw simonw added the research Research needed label Jun 30, 2021
@simonw
Copy link
Collaborator Author

simonw commented Jun 30, 2021

Here's an interesting query: https://vial.calltheshots.us/dashboard/?sql=with+most_recent_import+as+(%0D%0A++select+source_name%2C+max(last_imported_at)+as+most_recent_for_source_location%0D%0A++from+source_location%0D%0A++group+by+source_name%0D%0A)%0D%0Aselect%0D%0A++source_location.source_name%2C+source_uid%2C+last_imported_at%2C+most_recent_for_source_location%0D%0Afrom%0D%0A++source_location+join+most_recent_import+on+source_location.source_name+%3D+most_recent_import.source_name%0D%0Awhere+--+it%27s+more+than+24+hours+away+from+the+most+recent+import%0D%0A++most_recent_for_source_location+-+INTERVAL+%272+DAYS%27+%3E+last_imported_at%3AGExHMWMOpNFEHojcnv-TDZOmruJEIOAcUFgvBye2yHw

with most_recent_import as (
  select source_name, max(last_imported_at) as most_recent_for_source_location
  from source_location
  group by source_name
)
select
  source_location.source_name, source_uid, last_imported_at, most_recent_for_source_location
from
  source_location join most_recent_import on source_location.source_name = most_recent_import.source_name
where -- it's more than 24 hours away from the most recent import
  most_recent_for_source_location - INTERVAL '2 DAYS' > last_imported_at

For each of our source_name (which should correspond to a scraper) we find the most recent last_imported_at date.

Then we look for source locations with that source name which have a last_imported_at date more than 2 days prior to that most recent date.

@simonw
Copy link
Collaborator Author

simonw commented Jun 30, 2021

A count(*) against that query returns 206,254 - that's out of a total of 241,631 records in source_location total. That seems suspect - do we really have a vast majority of our source location records that far out of date?

@simonw
Copy link
Collaborator Author

simonw commented Jun 30, 2021

Version of that query which only looks at source locations for sources that have been scraped within the past 7 days (based on their most_recent_for_source_location) value:

with most_recent_import as (
  select source_name, max(last_imported_at) as most_recent_for_source_location
  from source_location
  group by source_name
)
select
  count(*)
from
  source_location join most_recent_import on source_location.source_name = most_recent_import.source_name
where -- it's more than 24 hours away from the most recent import
  most_recent_for_source_location - INTERVAL '2 DAYS' > last_imported_at
  and now() - INTERVAL '7 DAYS' < most_recent_for_source_location

https://vial.calltheshots.us/dashboard/?sql=with+most_recent_import+as+(%0D%0A++select+source_name%2C+max(last_imported_at)+as+most_recent_for_source_location%0D%0A++from+source_location%0D%0A++group+by+source_name%0D%0A)%0D%0Aselect%0D%0A++source_location.source_name%2C+source_uid%2C+last_imported_at%2C+most_recent_for_source_location%0D%0Afrom%0D%0A++source_location+join+most_recent_import+on+source_location.source_name+%3D+most_recent_import.source_name%0D%0Awhere+--+it%27s+more+than+24+hours+away+from+the+most+recent+import%0D%0A++most_recent_for_source_location+-+INTERVAL+%272+DAYS%27+%3E+last_imported_at%0D%0A++and+now()+-+INTERVAL+%277+DAYS%27+%3C+most_recent_for_source_location%3AdZ49ezANY9TzSMYoQ6CZMrwtnz1OMkLpQsoBG-qi-Oo

Still returns 116,395 records.

I'm pausing this research to work on other things. I'm not at all convinced I've been running the right queries, so don't take anything in this issue thread up to this point as factual and not the result of one or more mistakes.

@simonw
Copy link
Collaborator Author

simonw commented Jun 30, 2021

Here's why the results look odd: the scrapers have an optimization where if a record hadn't changed they don't send it to VIAL at all.

Discussed here: https://discord.com/channels/799147121357881364/813861006718926848/859865627191410700

One possible solution: teach the scrapers to send a special minimal document that tells VIAL "this location is still in the feed but has not changed since last time"

Maybe a special shape of document that gets sent to /api/importSourceLocations - {"source_name": "blah", "source_uid": "blah:123", "unchanged": true} could work

This could be used to populate a last_seen_at column.

@simonw
Copy link
Collaborator Author

simonw commented Jun 30, 2021

I feel pretty good about this as a way for us to start reliably removing locations that are no longer active - we can back it up with a dashboard showing "recently deactivated locations", plus some kind of manual override for it we want to keep a location live even if a feed has stopped returning it.

Maybe we have an allow-list or scrapers that we trust and cause locations to be automatically deactivated - vaccinefinder_org for example.

And a human review queue for locations that go missing from other less trusted scrapers.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
research Research needed
Projects
None yet
Development

No branches or pull requests

1 participant