Skip to content
Hilmar Lapp edited this page Oct 7, 2013 · 3 revisions

Table of Contents

Goals

  1. Scanning: For a subset of the data, scan the original pages at high resolution in order to be able to archive the pages.
  2. Transcription: Transcribe the tables in the scanned images into tabular data.
  3. Educational materials: Research possibilities at NESCent or in the Triangle region for development of educational resources using the data.

Software Requirements

  • data needs to be in a re-usable format (csv or spreadsheet files), not simply scanned images
  • need to have appropriate metadata for re-use
  • data should be available publicly

DryadLab pilot project

We scanned the initial datasets, labelled "Park 1948, 29%-70%, T. castaneum Controls, 7 replicates" and "Park 1948, 29%-70%, T. confusum Controls, 15 replicates". In total, there were 66 pages:

  • 21 pages for T. castaneum (7 half pages and 14 full pages)
  • 45 pages for T. confusum (15 half pages and 30 full pages)

Mechanical Turk

After scanning, we used Amazon Mechanical Turk to transcribe the data from scanned images into Google spreadsheets. There is a separate page for detail about Mechanical Turk.

The digitization involved the following steps:

  1. create a spreadsheet with only the header row (see below) as a template. Set permissions to "Anyone with the link can edit".
  2. put all of the jpg files (scans) in a publicly accessible location
  3. create a new spreadsheet for each input file (scanned page) by copying the template using the Google Docs API
  4. create an MTurk human intelligence task (HIT) for each combination of jpg image and Google Spreadsheet
  5. download the spreadsheets into csv format using the Docs API

Quality Control

The format of the tabular data is 9 or 10 columns, depending on whether small (col 4) and medium (col 5) have been combined into one column. This may be true for some rows (by merging the cells) or all rows (by reducing the number of columns) in a given page.

Date Age Obsr. small medium large Sum Pupae Imago Sum Total


The inclusion of sums in the original data gives us a means of checking the data entry without doing double entry:

  • Sum (column 7) = small + medium + large.
  • Sum Total = Sum + Pupae + Imago.

Data from pilot project




Experiments 1 and 2

Links & References