Skip to content
This repository was archived by the owner on Jan 25, 2024. It is now read-only.

The Evolution of Darwin's Notes

Mike Caprio edited this page Mar 27, 2017 · 35 revisions

Use Computer Vision to Explore Darwin's Own "Excised" Notes on Evolution and Piece Them Together

Hackathon Findings

The idea of using computer vision and machine learning to detect and match edges of manuscript fragments has been fully proven! The DarWIN team continues to work on finding matches and has formed a Special Interest Group working directly with the Darwin Manuscripts Project at the museum.

Hackathon Projects

Background

The digital-images repository of the Darwin Manuscripts Project at AMNH currently holds 25,540 high-resolution digitized images of Charles Darwin's hand-written scientific papers. These include the loose notes that constitute the raw material out of which Darwin wrote On the Origin of Species as well as his 15 other books devoted to solving the riddle of evolution. Many of these digitized images are fragments of no more than one or two pages.

Charles Darwin's study with notebooks in foreground Charles Darwin's study with notebooks in foreground, and his Metaphysical notebook

Darwin was in the habit of frequently reorganizing his notes according to the topics that particularly interested him at a given time. This meant, for example, that a page of observations made in 1840 on bees visiting flowers might, in 1850, have been torn apart or cut up into two pieces and each piece put into a separate topical portfolio. One portfolio could be devoted to the behavior of bees and the other to the anatomy of flowers. In an age of only literal cutting and pasting, this was an important way for Darwin to shuffle and organize his thoughts and his observations.

Darwin had two characteristic ways of breaking up a note into separate pieces: by scissors or by tearing with a straight edge. These two methods yielded four basic edge types: (1) Cutting with scissors would sometimes produce two saw-toothed—hence matchable—edges. (2) Scissors could also yield fairly smooth rounded edges with matching concave and convex pieces. (3) The tearing method usually yielded edges which, at least to the naked eye, are fairly indistinguishable from one another—unless he slipped and left an irregular tear. (4) Finally, tearing sometimes seems to have produced rather fuzzy edges. These four edge types are all illustrated in the accompanying Powerpoint & PDF documents in the repository.

The challenge is to focus on the cut edges and to devise a computerized tool to (1) detect all the types of cut pages in the Darwin digital repository and (2) to match up pieces of Darwin manuscript that have long been separated. The number of torn or cut fragments is unknown, but our estimate is that out of the total 25K Darwin Mss, there are anywhere from 250-2500 visually distinctive scissor cut edges (i.e. edge types 1 & 2), but there are far more type 3 & 4 fragments. Hence the challenge can be as focused or as broad as you prefer, and still may produce useful results for the Darwin Project.

The ultimate goal is to digitally reassemble Darwin's notes to understand the chronology and order of his thought process, perhaps to discover how he moved from Lamarck to natural selection. Matching two long separated pieces of paper can help Historians of Science understand the fine structure and development of Darwin’s thinking. In the past, the small number of such matches that have been made relied on the visual memory of the scholars working with the originals and patiently leafing through page upon page of manuscripts. Nevertheless, the matches that have been made in this tedious manner have, over the years, frequently thrown new light on the patterns and pathways of Darwin’s research. Of course having this kind of tool as a general purpose utility can potentially help solve problems with other manuscripts that have been cut or torn over time, such as effaced and cut palimpsests like the Archimedes Palimpsest.


Solutions

At a minimum, we think the simplest deliverable would be producing two lists of image file names: (1) all saw-toothed images found in the repository and (2) all matches of saw-toothed fragments. Overall the potential solutions we are looking for:

  • Identify Matching Pieces (Create Possible Match Data). Using machine learning / computer vision, try to classify the particular types of edges. Search across the entire collection of 25,540 images (available on site at the hackathon via an external hard drive) and group the images into five folders according to the five edge types, which are illustrated as the selected image types in the challenge repository: Saw tooth, Curved cut, Fuzzy, Very fuzzy, Torn / Other. Find any matches between images in each of these folders. Report your results in a spread sheet, where the file names for any two matched images appear in the same row: e.g.

  • Digital Reassembly - Visualize Strips Together. Create a tool that will allow an archivist or researcher to drag and drop the "puzzle pieces" around and see if they match; this would save the time and effort of manually leafing through pages. This tool would likely include some kind of "zoom" feature to carefully check how edges match.

  • Any other useful tool or utility you can imagine!


Resources

The Darwin Manuscript Project (DMP) website is located at http://www.amnh.org/our-research/darwin-manuscripts-project. The original manuscripts are mainly located at Cambridge University Library, with which DMP has been collaborating over the past decade. Each scanned image has a unique file name, which is coordinated with a database record in Darbase, the project's MySQL database. The Darbase records include XML transcriptions of the manuscripts. The image files are jpegs about 2MB in size; each has a 2" ruler in the image for scale. The images and XML text data will be available at the hackathon via an external hard drive. If you have a Mac, you will need the NTFS for Mac software in this repository in the data directory for this challenge to access the hard drive!

  • The images and data are divided into folders that correspond to physical volumes filled with Darwin manuscripts & mounted on acid free paper.

  • Each folder corresponds to one DAR volume. Our database is organized on the same basis: http://www.amnh.org/our-research/darwin-manuscripts-project/catalogues/union-catalogue/cambridge-university-library

  • The image file names give you the DAR volume number as well as the image numbers in the sequence in which the manuscripts were photographed. For example: MS-DAR-00205-00001-000-00047, corresponds to image 47 in the volume DAR 205.1

  • Darwin Project staff will help translate the image file names into database page numbers.

See the presentation PDF in this repository within this challenge's documents directory for more details!

Clone this wiki locally