-
Notifications
You must be signed in to change notification settings - Fork 1
Populate Location fields: vaccines_offered, accepts_appointments and accepts_walkins #650
Comments
The fields in question, added in #494: Lines 241 to 252 in 14c9098
These were editable in the VIAL interface for a while - they are no longer editable (or at least they are hidden by default). The number of locations with these fields populated is:
This is from the short period of time when these were editable. I'm going to export that data and otherwise pretend it didn't exist. |
select
id, public_id, name, full_address,
vaccines_offered, accepts_walkins, accepts_appointments, public_notes
from
location
where (
vaccines_offered is not null
or
accepts_appointments is not null
or
accepts_walkins is not null
or
(public_notes is not null and public_notes != '')
) Exported data is here: https://gist.github.com/simonw/d7644d1f444bc4221b3b284f73468360 |
There are two steps here: backfill the existing data, and ensure that when new reports or source locations are ingested the data is updated to reflect our best available versions. |
This query is really useful: with source_location_info as (
select
id,
source_uid,
json_extract_path(import_json::json, 'availability') as availability,
json_extract_path(import_json::json, 'inventory') as inventory
from
source_location
)
select
*
from
source_location_info
where
availability is not null
or inventory is not null
limit
100 |
with source_location_info as (
select
id,
source_uid,
json_extract_path(import_json::json, 'availability') as availability,
json_extract_path(import_json::json, 'inventory') as inventory
from
source_location
)
select
count(*)
from
source_location_info
where
availability is not null
or inventory is not null Too long to run through the dashboard, so I used an unlimited local connection - returned 155,304 - |
with source_location_info as (
select
id,
source_uid,
matched_location_id,
json_extract_path(import_json::json, 'availability') as availability,
json_extract_path(import_json::json, 'inventory') as inventory
from
source_location
)
select
count(distinct matched_location_id)
from
source_location_info
where
availability is not null
or inventory is not null Takes 17.8s and returns 63,730 - and since |
I changed that where to: where
availability is not null
and inventory is not null And it returned 50,228 locations that have a matched source location with BOTH of those fields. |
As for reports... with skip_reports as (
select report_id from call_report_availability_tag where availabilitytag_id = (
select id from availability_tag where "group" = 'skip'
)
)
select count(distinct location_id) from report where id not in (select report_id from skip_reports) and vaccines_offered is not null Returns 15,760 - there are 15,760 locations for which we have at least one non-skip report which has populated |
One way to look at this is that we have a sequence of opinions about which vaccines are offered - from imported source locations and from reports. Slight hitch is that since we over-write source locations when we import them we don't have the full history of those opinions stored in our PostgreSQL database - though we likely have them in a git history somewhere. |
Also interesting: which of our locations have the most matched source locations?
https://vial-staging.calltheshots.us/location/lpptz is the top one - that's Walgreens Co. #19134 in CT - because of the CT scrapers: https://vial.calltheshots.us/dashboard/?sql=select+source_uid%2C+source_name%2C+name+from+source_location+where+matched_location_id+%3D+35007%3AncVcHDs5Axbk6K_g7YTGp8BW-m7RT3HWkoABzBgzpgE Actually that looks bad - I think a bunch of different CT Walgreens may have been incorrectly matched:
|
I'm going to exclude |
Sample of the values of So those don't always have supply levels. |
Here's the scraper code that sets |
I ran this (took 20s so not through Django SQL Dashboard) to see which sources have both drop_in and appointment records: select "source_name", count(*) as n from (with source_location_info as (
select
json_extract_path(import_json::json, 'availability', 'drop_in') as drop_in,
json_extract_path(import_json::json, 'availability', 'appointments') as appointments,
import_json,
id, source_uid, source_name, name, created_at, matched_location_id, last_imported_at
from
source_location
)
select
*
from
source_location_info
where
drop_in is not null and appointments is not null) as results group by "source_name" order by n desc
|
Looks like select json_extract_path(import_json::json, 'availability')::text as availability, count(*)
from source_location where source_name = 'vaccinefinder_org'
group by json_extract_path(import_json::json, 'availability')::text
|
I'm adding the following + vaccines_offered_provenance_report = models.ForeignKey(
+ "Report",
+ null=True,
+ blank=True,
+ related_name="+",
+ help_text="The report that last populated vaccines_offered",
+ on_delete=models.PROTECT,
+ )
+ vaccines_offered_provenance_source_location = models.ForeignKey(
+ "SourceLocation",
+ null=True,
+ blank=True,
+ related_name="+",
+ help_text="The source location that last populated vaccines_offered",
+ on_delete=models.PROTECT,
+ )
+ vaccines_offered_last_updated_at = models.DateTimeField(
+ help_text="When vaccines_offered was last updated",
+ blank=True,
+ null=True,
+ )
+
+ appointments_walkins_provenance_report = models.ForeignKey(
+ "Report",
+ null=True,
+ blank=True,
+ related_name="+",
+ help_text="The report that last populated accepts_walkins and accepts_appointments",
+ on_delete=models.PROTECT,
+ )
+ appointments_walkins_provenance_source_location = models.ForeignKey(
+ "SourceLocation",
+ null=True,
+ blank=True,
+ related_name="+",
+ help_text="The source location that last populated accepts_walkins and accepts_appointments",
+ on_delete=models.PROTECT,
+ )
+ appointments_walkins_last_updated_at = models.DateTimeField(
+ help_text="When accepts_walkins and accepts_appointments were last updated",
+ blank=True,
+ null=True,
+ ) |
My code doesn't (yet) populate those new columns - that's what the I'm pushing this live to staging and then I'll track down a bunch of interesting examples - locations with multiple reports and source locations - that I can use to demonstrate what the |
Once I'm comfortable with the behaviour of that method, I'll do the following:
|
I want to find good example locations for this - locations that have both source locations AND reports against them which cover vaccines offered and availability. Problem: we don't seem to have any on staging. Here's a query: with last_1000_vaccine_source_locations as (
select * from source_location where json_extract_path(import_json::json, 'inventory') is not null
order by id desc limit 1000
),
all_reports_with_vaccines as (
select * from report where vaccines_offered is not null
)
select * from location where id in (
select matched_location_id from last_1000_vaccine_source_locations
) and id in (
select location_id from all_reports_with_vaccines
) limit 100 Since the code I've written so far is completely safe - it shows things on a debug page but doesn't update any database records - I'm going to ship it to production in order to see more examples. |
My biggest question here is around the trustworthiness of our scrapers. We don't want to discard information from a high- I think the fix for that is going to be allow-listing the scrapers - maybe even start with only Lines 527 to 532 in 344153e
Lines 594 to 599 in 344153e
I can use the new debug information on |
select location.public_id, count(*), array_agg(source_location.source_name), count(distinct source_location.source_name) as source_name_count
from source_location join location on source_location.matched_location_id = location.id
group by location.public_id
having count(*) > 1
order by source_name_count desc Here's a top result from that: https://vial.calltheshots.us/location/lqwzd |
This variant of that query returns only locations that also have at least one non-skip report: select
location.public_id,
count(*) as num_source_locations,
array_agg(distinct source_location.source_name),
count(distinct source_location.source_name) as num_distinct_source_names
from
source_location
join location on source_location.matched_location_id = location.id
where
-- Only locations that have at least one non-skip report
location.id in (
select location_id from report where report.id not in (select report_id from call_report_availability_tag where availabilitytag_id = 20)
)
group by
location.public_id
having count(*) > 1
order by num_distinct_source_names desc |
https://vial.calltheshots.us/location/lykhz is an interesting example: |
https://vial.calltheshots.us/dashboard/?sql=select%20%22json_extract_path%22%2C%20count%28%2A%29%20as%20n%20from%20%28select%20json_extract_path%28import_json%3A%3Ajson%2C%20%27availability%27%29%3A%3Atext%20from%20source_location%20where%20source_name%20%3D%20%27getmyvax_org%27%29%20as%20results%20group%20by%20%22json_extract_path%22%20order%20by%20n%20desc%3APE6S3PfuMROXLq15I658NslW4tlc2PMasbC3HeyDqBI confirms that Running this against the DB: select "json_extract_path", count(*) as n from (select json_extract_path(import_json::json, 'availability')::text from source_location ) as results group by "json_extract_path" order by n desc Returns this:
So the Maybe if the most recent source location has no explicit opinion on drop-ins we should fall back to the most recent report, if one exists? |
I'm going to upgrade the display of that derived data on the |
Examples I need to find:
|
And some reports examples using the query from #650 (comment)
select
location.public_id,
count(*) as num_source_locations,
array_agg(distinct source_location.source_name),
count(distinct source_location.source_name) as num_distinct_source_names
from
source_location
join location on source_location.matched_location_id = location.id
where
-- Only locations that have at least one non-skip report
location.id in (
select location_id from report where report.id not in (select report_id from call_report_availability_tag where availabilitytag_id = 20)
)
group by
location.public_id
having count(*) > 1
and 'vaccinefinder_org' = any(array_agg(distinct source_location.source_name))
order by num_distinct_source_names desc The
|
I'm going to wrap up this work up by:
|
It looks like |
Both |
The fact that Lines 541 to 546 in 9b7cc30
|
Just realized I need to exclude reports with |
I need to do another in-depth review of places that might add/remove/edit source locations and reports to make sure they all update the derived data correctly. |
Here's a progress report on how population is going based on imported source locations and reports:
That's out of 77,067 not-soft-deleted locations. |
I still need to run a back-fill mechanism for locations that haven't had a report or a source location import in the past week. |
On staging https://vial-staging.calltheshots.us/dashboard/?sql=select+count%28%2A%29+from+location+where+vaccines_offered_last_updated_at+is+not+null%3Ax05v1QiDVM1fIntikQuIJSI7Umqolhu3YXvrdJsdPiU&sql=select+count%28%2A%29+from+location+where+appointments_walkins_last_updated_at+is+not+null%3AZPXVjIp8wjdVr8xjePGpvlDD4GyWUDSgnB8rbZbuPoQ both return around 9,000 records, presumably due to test source location imports run against staging. |
Replaces #504 - needed by #649.
From https://docs.google.com/document/d/17svyCVXcloArj1wbUgu7QwEb6xeIDa-p7C3U0_yLgKg/edit
The text was updated successfully, but these errors were encountered: