Skip to content

LapDevelopment_Annotations

StephanOepen edited this page Jan 20, 2015 · 15 revisions

General Background

This page documents the structure and form of annotations in LAP, i.e. the adptation of the Linguistic Annotation Framework (LAF) to LAP and serialization in JSON (in the LAP Store).

Types of Annotations

Database Terminology

The MongoDB storage distinguishes three types of entities that nest hierarchically: databases, collections, and documents. In January 2015, all LAP annotations are stored in the same database (called lapstore), but we envision moving to a set-up where each Galaxy user has their own MogoDB database (which will make it easier to track ownership of annotation records). A collection is an unordered set of JSON Documents, i.e. a group of records representing invididual tokens.

Running Example

Following are the annotation recorded in LAP Store when processing the default English Dependency Parsing workflow, which in January 2015 comprised tokenizer, REPP, HunPos, and MaltParser. We assume the following input file (toy.txt):

  The cat chased the dog.
  Fido barked.

In general, each of the four components in the workflow will take one input file and one output files, where output files are receipts for annotation records added to LAP Store, and the output receipt of one component typically serves as the input specification for the following component in a pipeline workflow.

Segmentation of toy.txt generates the following receipt:

  {
    "id": "641b309e-a080-11e4-9fa2-00259074dbc0",
    "annotations": {
      "sentence": {
        "tokenizer": "641b33be-a080-11e4-9fa2-00259074dbc0"
      }
    }
  }

Tokenization of these segments, in turn, generates the following receipt:

  {
    "id": "641b309e-a080-11e4-9fa2-00259074dbc0",
    "annotations": {
      "token": {
        "repp": "7508ae86-a080-11e4-baab-00259074dbc0"
      },
      "sentence": {
        "tokenizer": "641b33be-a080-11e4-9fa2-00259074dbc0"
      }
    }
  }

as is evident from this example, layers of annotation accumlate in the receipts. After completion of the remaining two components in the workflow (HunPos and MaltParser), the receipt comprises

  {
    "id": "641b309e-a080-11e4-9fa2-00259074dbc0",
    "annotations": {
      "pos_tag": {
        "hunpos": "88b980d6-a080-11e4-a9d8-00259074dbc0"
      },
      "token": {
        "repp": "7508ae86-a080-11e4-baab-00259074dbc0"
      },
      "dep_parser": {
        "maltparser": "aeaaa5a4-a080-11e4-83b5-00259074a29c"
      },
      "sentence": {
        "tokenizer": "641b33be-a080-11e4-9fa2-00259074dbc0"
      }
    }
  }

Annotation Records

Each layer of annotations, as named by its ‘universally’ unique identifier (UUID), corresponds to a collection in the database. Using the mongo shell, collections can be inspected interactively.

According to the LAF philosophy, there are two distinct types of collections, media records and graph records:

  > db["641b309e-a080-11e4-9fa2-00259074dbc0.media"].find().pretty()
  {
    "_id" : ObjectId("54be152a3ac8eb0487b71729"),
    "data" : "The cat chased the dog.\nFido barked.\n",
    "type" : "main"
  }

  > db["641b33be-a080-11e4-9fa2-00259074dbc0.graph"].find().pretty()
  {
    "_id" : ObjectId("54be152b3ac8eb0487b7172a"),
    "origin" : "tokenizer",
    "index" : 0,
    "anchors" : [ 0, 23 ],
    "type" : "region",
    "id" : "tokenizer-r1"
  }
  {
    "_id" : ObjectId("54be152b3ac8eb0487b7172b"),
    "origin" : "tokenizer",
    "index" : 0,
    "out_edges" : [ ],
    "links" : [ "tokenizer-r1" ],
    "annotation_spaces" : [ "tokenizer" ],
    "type" : "node",
    "id" : "tokenizer-n1",
    "annotations" : {
      "tokenizer" : {
        "class" : "sentence",
        "label" : "The cat chased the dog."
      }
    }
  }
  {
    "_id" : ObjectId("54be152b3ac8eb0487b7172c"),
    "origin" : "tokenizer",
    "index" : 1,
    "anchors" : [ 24, 36 ],
    "type" : "region",
    "id" : "tokenizer-r2"
  }
  {
    "_id" : ObjectId("54be152b3ac8eb0487b7172d"),
    "origin" : "tokenizer",
    "index" : 1,
    "out_edges" : [ ],
    "links" : [ "tokenizer-r2" ],
    "annotation_spaces" : [ "tokenizer" ],
    "type" : "node",
    "id" : "tokenizer-n2",
    "in_edges" : [ ],
    "annotations" : {
      "tokenizer" : {
        "class" : "sentence",
        "label" : "Fido barked."
      }
    }
  }
Clone this wiki locally