Skip to content

Latest commit

 

History

History
78 lines (50 loc) · 5.11 KB

README.md

File metadata and controls

78 lines (50 loc) · 5.11 KB

City of Denton Datasets

Issue tracker for City of Denton Datasets

Purpose

This repo is meant to provide a collaborative way to assess and discuss the new datasets provided by the city of Denton. Using the built in CKAN API (see also here) on their platform (data.cityofdenton.com), we autogenerated issues for each dataset available. This gives us the ability to sort, discuss, and report back findings to the city technical staff on these datasets, not just to improve them for Open Data Day, but for all future open data purposes as well.

Inside each issue you will see some basic data about the dataset, a link to the dataset on the website, and a summary of the resources for each dataset.

Expected of Each Data Set

Each data set should be usable in a database, and should have variables with which the data set can be merged with other data sets. Barring that, the data should be geocoded (tagged with geographic positioning data, such as latitude and longitude, or state plane coordinates).

To ensure these data are useful without additional munging, the following should be true:

  • Can be imported into any of the usual RDBMSes (MySQL/MariaDB, PostgreSQL, FileMaker Pro, MS Access). Practically, this means:

    • the columns are delimited consistently or (less desirable) column separated correctly
    • variable types aren't mixed within columns (no text in integer columns, etc.)
    • Text strings are quoted (but this isn't strictly required, provided the text isn't comma-separated)
    • Ideally, there is exactly one header row, but it's reasonable for some datasets to have two-line records OR lack any header row, provided there is a correct codebook
  • Has a unique identifier or a combination of columns that can be assembled into a unique identifier (day/date/month, e.g.)

  • Ideally, there is an import file or files for commonly available databases and/or stats packages like R, Stata, PostgreSQL, and MySQL/MariaDB

  • Has a codebook describing each data variable (field), variable type, and, where needed, column width

  • The user/committer has actually imported the data and confirmed that it worked without errors

  • If the data are geocoded, they should be easy to import into QGIS and/or ArcGIS, either directly or by merging with an existing GIS coverage

What's Missing? Is It Useful?

It's possible that the data are usable but less useful than they could be, if, for instance, the data were in a different format, or if they contained an additional variable or level of analysis. A "wish list" for reasonable changes or upgrades is also potentially useful for the data catalog, moving forward.

Additionally, a measure of usability is valuable. Please tag each data set with "usable", "unusable", "needs repair", etc.

Issue / Data fetching script

If you want to play around, or even use this data script for something else, you can clone the repo and use composer to install the dependencies.

$ git clone [email protected]:OpenDenton/City-of-Denton-Datasets.git

If you don't have composer installed, download that first.

$ brew install composer

or

$ curl -s http://getcomposer.org/installer | php

Then install the dependencies from the root of the repository.

$ composer install

From here, you'll need to generate a new Github token, and replace the demo token in the script.

$token = 'NEW TOKEN HERE';

Tags

  • unusable: any of a list of conditions that prevent the data from being used --
    • the data set cannot be read by a computer program
    • there are no definitions for variables
    • text data are not represented by either ASCII or Unicode
  • complete: includes code book, machine-usable data, no missing data, and the columns match the data and expected data types
  • incomplete: missing some element that make the data imperfect or difficult to use; potentially usable but, obviously, not totally finished
  • gis: data include a geographic (whether or not they are geocoded) component
  • nocodebook: it is not possible to know what the values in a given column mean, for certain, because there is no supporting documentation for the data set
  • pdf: data are in PDF format, either machine-readable or scanned images
  • xls: data are in some Excel version format
  • text: data are provided in an ASCII text format
  • needsupdate: for time series data, are there missing years (especially the most recent years)?
  • needscleaning: Data may be usable but data within columns may be inconsistent or benefit from breaking values out into additional variables (A text field with "Males, 21-34, Causasian - Non-Hispanic" should really be at least four fields)
  • personalinfo: Data contains Personally Identifiable Information (PII) such as names and addresses
  • unclearvariables: Even if the column headers are in English, if they are imprecise or uncertain, this tag applies