Data Source Identification 2025 plan

# Context

We have some pieces coming together, but it's time to focus efforts around a few central goals. Our timeline is through June 2025.

## Overall Goals
1. Grow the database with quality sources according to our priorities
   - common topics: analyzing response effectiveness (calls for service, dispatch); documenting interactions & outcomes (stops, arrests, use of force); understanding basics of systems (agency completion & metadata, personnel)
   - geographic: go deep on one county at a time, starting with the most populous
   - followed areas: add sources to searches followed by our users
2. Improve labeling models
   - eventually, we should be able to get a fairly accurate `relevancy`, `record type`, and `agency` from trained language models. This is easier the more quality sources there are in the database.
3. Quickly investigate requests

# System overview

[Check out the diagram in the README!](https://github.com/Police-Data-Accessibility-Project/data-source-identification/blob/main/README.md)

# Q3–4

## Close the loop
- https://github.com/Police-Data-Accessibility-Project/data-source-identification/issues/89
- https://github.com/Police-Data-Accessibility-Project/data-source-identification/issues/141

## Show off
- [ ] https://github.com/Police-Data-Accessibility-Project/pdap.io/issues/342

## New Source Collectors
- https://github.com/Police-Data-Accessibility-Project/data-source-identification/issues/299
- [ ] https://github.com/Police-Data-Accessibility-Project/scrapers/issues/216
- [ ] "citation follower": track down sources used in published analysis

## Encourage participation, add Big Value
- https://github.com/Police-Data-Accessibility-Project/pdap.io/issues/342
- [ ] Automatically create batches periodically based on followed locations
   - https://github.com/Police-Data-Accessibility-Project/data-source-identification/issues/260
- [ ] One-step source submission
   - all we need is a URL; we'll look up the rest of the metadata for you.
- [ ] Gradually (visibly!!!) pre-fill high-confidence annotations for human approval.

## Maintain the dataset
- [ ] when sources break (a change in URL status) we should try to find a replacement source
- [ ] agency crawler automation: check known websites for sources were moved, updated, or added
- [ ] Log usefulness of sources
  - how often are people clicking?
  - capture sentiment (did you find what you needed with this search? is this source useful?)\

## Improve the models
- https://github.com/Police-Data-Accessibility-Project/data-source-identification/issues/142
- https://github.com/Police-Data-Accessibility-Project/data-source-identification/issues/41
- Replace the tag collector with an enhanced version, or an image snapshot.


# Q1–2 (we got a lot done!)

## Build the engine
- #112 
- https://github.com/Police-Data-Accessibility-Project/data-source-identification/issues/109

## Refinements (feature creep!)
- https://github.com/Police-Data-Accessibility-Project/data-source-identification/issues/198
- https://github.com/Police-Data-Accessibility-Project/data-source-identification/issues/199
- https://github.com/Police-Data-Accessibility-Project/data-source-identification/issues/223

## Use metrics to measure success
- #117
- #167 
- [x] Keep track of the cost of running the collector, and generating batches, in the form of compute time. Ideally, we'll come up with a cost for different types of crawls, and use them to direct donations.
- [x] https://github.com/Police-Data-Accessibility-Project/data-source-identification/issues/125

## Run the engine
- [x] https://github.com/Police-Data-Accessibility-Project/data-source-identification/issues/53
- [x] https://github.com/Police-Data-Accessibility-Project/data-source-identification/issues/126


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data Source Identification 2025 plan #108

Context

Overall Goals

System overview

Q3–4

Close the loop

Show off

New Source Collectors

Encourage participation, add Big Value

Maintain the dataset

Improve the models

Q1–2 (we got a lot done!)

Build the engine

Refinements (feature creep!)

Use metrics to measure success

Run the engine

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Data Source Identification 2025 plan #108

Description

Context

Overall Goals

System overview

Q3–4

Close the loop

Show off

New Source Collectors

Encourage participation, add Big Value

Maintain the dataset

Improve the models

Q1–2 (we got a lot done!)

Build the engine

Refinements (feature creep!)

Use metrics to measure success

Run the engine

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions