-
Notifications
You must be signed in to change notification settings - Fork 0
Technical overview
- Scanning: For a subset of the data, scan the original pages at high resolution in order to be able to archive the pages.
- Transcription: Transcribe the tables in the scanned images into tabular data.
- Educational materials: Research possibilities at NESCent or in the Triangle region for development of educational resources using the data.
- data needs to be in a re-usable format (csv or spreadsheet files), not simply scanned images
- need to have appropriate metadata for re-use
- data should be available publicly
We scanned the initial datasets, labelled "Park 1948, 29%-70%, T. castaneum Controls, 7 replicates" and "Park 1948, 29%-70%, T. confusum Controls, 15 replicates". In total, there were 66 pages:
- 21 pages for T. castaneum (7 half pages and 14 full pages)
- 45 pages for T. confusum (15 half pages and 30 full pages)
After scanning, we used Amazon Mechanical Turk to transcribe the data from scanned images into Google spreadsheets. There is a separate page for detail about Mechanical Turk.
The digitization involved the following steps:
- create a spreadsheet with only the header row (see below) as a template. Set permissions to "Anyone with the link can edit".
- put all of the jpg files (scans) in a publicly accessible location
- create a new spreadsheet for each input file (scanned page) by copying the template using the Google Docs API
- create an MTurk human intelligence task (HIT) for each combination of jpg image and Google Spreadsheet
- download the spreadsheets into csv format using the Docs API
The format of the tabular data is 9 or 10 columns, depending on whether small (col 4) and medium (col 5) have been combined into one column. This may be true for some rows (by merging the cells) or all rows (by reducing the number of columns) in a given page.
Date | Age | Obsr. | small | medium | large | Sum | Pupae | Imago | Sum Total |
---|
The inclusion of sums in the original data gives us a means of checking the data entry without doing double entry:
- Sum (column 7) = small + medium + large.
- Sum Total = Sum + Pupae + Imago.
- scanned images, e.g. sample image
- csv files, e.g.: http://www.nescent.org/Park_notebooks/Duke_Libraries/csv/01/ftbms010010010_1.csv
- Thomas Park (1948). Interspecies Competition in Populations of Trilobium confusum Duval and Trilobium castaneum Herbst. Ecological Monographs, Vol. 18(2), pp. 265-307.
- Google Docs python developers guide