Open
Description
Context
We have some pieces coming together, but it's time to focus efforts around a few central goals. Our timeline is through June 2025.
Overall Goals
- Grow the database with quality sources according to our priorities
- common topics: analyzing response effectiveness (calls for service, dispatch); documenting interactions & outcomes (stops, arrests, use of force); understanding basics of systems (agency completion & metadata, personnel)
- geographic: go deep on one county at a time, starting with the most populous
- followed areas: add sources to searches followed by our users
- Improve labeling models
- eventually, we should be able to get a fairly accurate
relevancy
,record type
, andagency
from trained language models. This is easier the more quality sources there are in the database.
- eventually, we should be able to get a fairly accurate
- Quickly investigate requests
System overview
Check out the diagram in the README!
Q3–4
Close the loop
Show off
New Source Collectors
- New Source Collector: "research" ML model to generate batches #299
- Scraper templates / strategies for common data portals scrapers#216
- "citation follower": track down sources used in published analysis
Encourage participation, add Big Value
- Data Collection Status map pdap.io#342
- Automatically create batches periodically based on followed locations
- One-step source submission
- all we need is a URL; we'll look up the rest of the metadata for you.
- Gradually (visibly!!!) pre-fill high-confidence annotations for human approval.
Maintain the dataset
- when sources break (a change in URL status) we should try to find a replacement source
- agency crawler automation: check known websites for sources were moved, updated, or added
- Log usefulness of sources
- how often are people clicking?
- capture sentiment (did you find what you needed with this search? is this source useful?)\
Improve the models
- Try Out New Machine Learning Models #142
- make training happen on digital ocean #41
- Replace the tag collector with an enhanced version, or an image snapshot.
Q1–2 (we got a lot done!)
Build the engine
Refinements (feature creep!)
- Source Collector duplicate checking #198
- Source Collector "agency homepage" labeling #199
- Relevancy Labeling: add options #223
Use metrics to measure success
- Automate data cleaning #117
- "Collector priorities" tab of the source collector #167
- Keep track of the cost of running the collector, and generating batches, in the form of compute time. Ideally, we'll come up with a cost for different types of crawls, and use them to direct donations.
- Data Source ID: Simple reporting dashboard #125
Run the engine
Metadata
Metadata
Assignees
Type
Projects
Status
Reference