Skip to content

Properties of Data Sources to identify #11

Open
@josh-chamberlain

Description

@josh-chamberlain

Context

We want to add metadata to URLs, filter for relevancy, and expand our database of valid data sources.

Flowchart

The overall plan for data source identification is now in the readme of this repo.

Properties

These are all explained in the data dictionary

S tier

A tier

  • description, a subjective thing—fills in the gaps left by name, record type, and agency. Can be used to disambiguate similar sources. Difficult to automate.
  • aggregation_type
  • access_type
  • record_download_option_provided
  • record_format
  • Is it agency_supplied and agency_originated? If not, who are the supplier and originator?
  • coverage_start
  • coverage_end
  • portal_type
  • scraper_url
  • readme_url

Still A tier, but rarely published:

  • retention_schedule
  • update_frequency
  • source_last_updated

B tier

  • size
  • update_method
  • sort_method
  • access_restrictions

Related reading

https://github.com/palewire/storysniffer/
http://blog.apps.npr.org/2016/06/17/scraping-tips.html

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Reference

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions