Technical overview

Table of Contents Goals Software Requirements DryadLab pilot project Mechanical Turk Quality Control Data from pilot project Experiments 1 and 2 Links & References

Goals

Scanning: For a subset of the data, scan the original pages at high resolution in order to be able to archive the pages.
Transcription: Transcribe the tables in the scanned images into tabular data.
Educational materials: Research possibilities at NESCent or in the Triangle region for development of educational resources using the data.

Software Requirements

data needs to be in a re-usable format (csv or spreadsheet files), not simply scanned images
need to have appropriate metadata for re-use
data should be available publicly

DryadLab pilot project

We scanned the initial datasets, labelled "Park 1948, 29%-70%, T. castaneum Controls, 7 replicates" and "Park 1948, 29%-70%, T. confusum Controls, 15 replicates". In total, there were 66 pages:

21 pages for T. castaneum (7 half pages and 14 full pages)
45 pages for T. confusum (15 half pages and 30 full pages)

Mechanical Turk

After scanning, we used Amazon Mechanical Turk to transcribe the data from scanned images into Google spreadsheets. There is a separate page for detail about Mechanical Turk.

The digitization involved the following steps:

create a spreadsheet with only the header row (see below) as a template. Set permissions to "Anyone with the link can edit".
put all of the jpg files (scans) in a publicly accessible location
create a new spreadsheet for each input file (scanned page) by copying the template using the Google Docs API
create an MTurk human intelligence task (HIT) for each combination of jpg image and Google Spreadsheet
download the spreadsheets into csv format using the Docs API

Quality Control

The format of the tabular data is 9 or 10 columns, depending on whether small (col 4) and medium (col 5) have been combined into one column. This may be true for some rows (by merging the cells) or all rows (by reducing the number of columns) in a given page.

Date	Age	Obsr.	small	medium	large	Sum	Pupae	Imago	Sum Total

The inclusion of sums in the original data gives us a means of checking the data entry without doing double entry:

Sum (column 7) = small + medium + large.
Sum Total = Sum + Pupae + Imago.

Data from pilot project

scanned images, e.g. sample image
csv files, e.g.: http://www.nescent.org/Park_notebooks/Duke_Libraries/csv/01/ftbms010010010_1.csv

Experiments 1 and 2

See Process for experiments 1 and 2

Links & References

Thomas Park (1948). Interspecies Competition in Populations of Trilobium confusum Duval and Trilobium castaneum Herbst. Ecological Monographs, Vol. 18(2), pp. 265-307.
Google Docs python developers guide

Provide feedback

Saved searches

Use saved searches to filter your results more quickly