-
Notifications
You must be signed in to change notification settings - Fork 1
Identify locations that existed in a previous scraper run but do not exist any more #712
Comments
with most_recent_import as (
select source_name, max(last_imported_at) as most_recent_for_source_location
from source_location
group by source_name
)
select
source_location.source_name, source_uid, last_imported_at, most_recent_for_source_location
from
source_location join most_recent_import on source_location.source_name = most_recent_import.source_name
where -- it's more than 24 hours away from the most recent import
most_recent_for_source_location - INTERVAL '2 DAYS' > last_imported_at For each of our Then we look for source locations with that source name which have a |
A |
Version of that query which only looks at source locations for sources that have been scraped within the past 7 days (based on their with most_recent_import as (
select source_name, max(last_imported_at) as most_recent_for_source_location
from source_location
group by source_name
)
select
count(*)
from
source_location join most_recent_import on source_location.source_name = most_recent_import.source_name
where -- it's more than 24 hours away from the most recent import
most_recent_for_source_location - INTERVAL '2 DAYS' > last_imported_at
and now() - INTERVAL '7 DAYS' < most_recent_for_source_location Still returns 116,395 records. I'm pausing this research to work on other things. I'm not at all convinced I've been running the right queries, so don't take anything in this issue thread up to this point as factual and not the result of one or more mistakes. |
Here's a parameterized query that takes a number of days and a |
Here's why the results look odd: the scrapers have an optimization where if a record hadn't changed they don't send it to VIAL at all. Discussed here: https://discord.com/channels/799147121357881364/813861006718926848/859865627191410700 One possible solution: teach the scrapers to send a special minimal document that tells VIAL "this location is still in the feed but has not changed since last time" Maybe a special shape of document that gets sent to This could be used to populate a |
I feel pretty good about this as a way for us to start reliably removing locations that are no longer active - we can back it up with a dashboard showing "recently deactivated locations", plus some kind of manual override for it we want to keep a location live even if a feed has stopped returning it. Maybe we have an allow-list or scrapers that we trust and cause locations to be automatically deactivated - And a human review queue for locations that go missing from other less trusted scrapers. |
For #704 it's likely that one of the strongest signals we can get for if a location has shut down is if it no longer appears in our source location scraped data.
Can we use SQL to notice locations that appear to no longer be picked up by our scrapers?
The text was updated successfully, but these errors were encountered: