Skip to content

Latest commit

 

History

History
 
 

datadictionaries

Data Central: How to Contribute, Sources, and Data Dictionaries

As our work continues to expand, this will be a central repository to document summaries, sources, and field names for all our data sets. Data is housed in our repo on data.world.

How Do I Contribute Data?


We're glad you asked!

If you have a data source that would help with our objectives, we'd be grateful to have it. Here's an overview of how to most effectively contribute. Please join the discussion on our Slack channel - our group would love to work with you. (If you're not already in the Data for Democracy Slack team, you'll need an invitation - more info here.)

  1. Tidy the data, using lower_snake_case for variable and file names and ISO format (YYYY-MM-DD) for dates. Also keep in mind these best practices from data.world. We prefer CSV format; feather format is also very useful (feel free to add both).
  2. Fork this repo and request to be a contributor to our dataset on data.world, if you haven't already.
  3. Submit a pull request to this repo including the following:
    • In either python/datawrangling or R/datawrangling, as appropriate, add any script(s) you used to scrape, tidy, etc. (If you have multiple scripts, feel free to create a subdirectory.) Be specific when you name the scripts and directories - eg, scrape_druglist_from_genomejp.py is better than drugscraping.py.
    • In /datadictionaries, add a data dictionary for your data source named [datasource].md. We have a data dictionary template; for more specifics, check out the other dictionaries available in this folder.
    • Edit this README with a short overview of your dataset.
  4. Once the PR is reviewed by our maintainers and merged, upload your final data set to data.world and label it "clean data" (click on Edit). Add a link to the data dictionary in the Description field. (If you'd rather not join data.world, a maintainer can do this as well. It's a fun place, though!)
    • If you'd like to add the raw data as well (eg, XLSX files), feel free; make sure to label it "raw data."
    • Bonus points: Edit the info for each field in your data.world dataset with a detailed description.
  5. Submit a PR to update this overview file (this can be done by you or maintainers).
  6. Receive our grateful thanks, likely including emoji.

Overview of Currently Available Datasets

All datasets are available in our repo on data.world. If individual datasets can be queried, direct links are included.


Formats: XLSX (original); CSV, feather (tidied)
Original Source: US Centers for Medicare and Medicaid Services (CMS.gov)

This is the data that initially inspired our project.

The Excel file contains aggregate data for total and average spending by Medicare and by consumers, as well as total and average number of claims, for each brand name drug by year. Generic names are also included.

In our data.world repo, the original file has been tidied and split into one dataset per year, available in both .csv and .feather format; these are titled, for example, spending-2011.feather. We also have a feather file containing solely the unique brand names + generic names included in all five years of data (drugnames.feather).

Links to full data dictionaries: 2011 2012 2013 2014 2015


Formats: KEG (original); CSV (tidied)
Original Source: www.genome.jp

The Anatomical Therapeutic Chemical Classification System, maintained by the WHO, is used to classify drugs based on both the organ or system on which they act and their therapeutic, pharmacological and chemical properties. Procuring the codes from WHO is prohibitively expensive; our dataset is scraped from www.genome.jp.

Link to full data dictionary [in progress]


3. FDA-Approved Drugs

Formats: JSON
Original Source: Center Watch

This dataset contains a list of FDA-approved drugs, their approval date, manufacturer, and specific purpose.


Formats: CSV, feather
Original Source: n/a

This is a first pass at a crosswalk between the ATC codes and Medicare Part D spending data. Work to finalize this is welcome!

Link to full data dictionary [in progress]


Formats: CSV
Original Source: CMS.gov

This dataset contains the information you'd need to link specific drugs and their dosages to the manufacturer - helpful for creating a path from Medicaid spending to lobbying efforts. Brand name and generic or descriptive names are both offered, as well as dosage and package size. Further, there are identifying codes for each drug (HCPCS and NDC).


6. Medical Expenditure Panel Survey (too large for direct query link)

Formats: zip, CSV, feather
Original Source: meps.ahrq.gov

I'll need Alex to write this one, and/or I'll look at it later.

Link to full data dictionary [in progress]


Formats: CSV
Original Source: OpenSecrets

OpenSecrets has data on lobbying transactions from pharmaceutical companies and their subsidiaries, totaled by year.

Link to full data dictionary [in progress]


Formats: text, CSV
Original Source: KEGG ("USP drug classification" in the drop-down menu)

The US Pharmacopeial Convention Drug Classification system. Contains category and class information on outpatient drugs available in the US market. TBD if data also contains information on Part D eligible drugs only, though it seems like it likely doesn't: "The USP DC is intended to be complementary to the USP MMG and is developed with similar guiding principles, taxonomy, and structure of the USP Categories and Classes."

Link to full data dictionary: usp_drug_classification.md