Skip to content

LapDevelopment_Annotations

ArneSkjærholt edited this page Feb 20, 2015 · 15 revisions

General Background

This page documents the structure and form of annotations in LAP, i.e. the adptation of the Linguistic Annotation Framework (LAF) to LAP and serialization in JSON (in the LAP Store).

Types of Annotations

Database Terminology

The MongoDB storage distinguishes three types of entities that nest hierarchically: databases, collections, and documents. In January 2015, all LAP annotations are stored in the same database (called lapstore), but we envision moving to a set-up where each Galaxy user has their own MogoDB database (which will make it easier to track ownership of annotation records). A collection is an unordered set of JSON Documents, i.e. a group of records representing invididual tokens.

Running Example

Following are the annotation recorded in LAP Store when processing the default English Dependency Parsing workflow, which in January 2015 comprised tokenizer, REPP, HunPos, and MaltParser. We assume the following input file (toy.txt):

  The cat chased the dog.
  Fido barked.

In general, each of the four components in the workflow will take one input file and one output files, where output files are receipts for annotation records added to LAP Store, and the output receipt of one component typically serves as the input specification for the following component in a pipeline workflow.

Segmentation of toy.txt generates the following receipt:

  {
    "id": "641b309e-a080-11e4-9fa2-00259074dbc0",
    "annotations": {
      "sentence": {
        "tokenizer": "641b33be-a080-11e4-9fa2-00259074dbc0"
      }
    }
  }

Tokenization of these segments, in turn, generates the following receipt:

  {
    "id": "641b309e-a080-11e4-9fa2-00259074dbc0",
    "annotations": {
      "token": {
        "repp": "7508ae86-a080-11e4-baab-00259074dbc0"
      },
      "sentence": {
        "tokenizer": "641b33be-a080-11e4-9fa2-00259074dbc0"
      }
    }
  }

as is evident from this example, layers of annotation accumlate in the receipts. After completion of the remaining two components in the workflow (HunPos and MaltParser), the receipt comprises

  {
    "id": "641b309e-a080-11e4-9fa2-00259074dbc0",
    "annotations": {
      "pos_tag": {
        "hunpos": "88b980d6-a080-11e4-a9d8-00259074dbc0"
      },
      "token": {
        "repp": "7508ae86-a080-11e4-baab-00259074dbc0"
      },
      "dep_parser": {
        "maltparser": "aeaaa5a4-a080-11e4-83b5-00259074a29c"
      },
      "sentence": {
        "tokenizer": "641b33be-a080-11e4-9fa2-00259074dbc0"
      }
    }
  }

Annotation Records

Each layer of annotations, as named by its ‘universally’ unique identifier (UUID), corresponds to a collection in the database. Using the mongo shell, collections can be inspected interactively.

According to the LAF philosophy, there are two distinct types of collections, media records and graph records:

  > db["641b309e-a080-11e4-9fa2-00259074dbc0.media"].find().pretty()
  {
    "_id" : ObjectId("54be152a3ac8eb0487b71729"),
    "data" : "The cat chased the dog.\nFido barked.\n",
    "type" : "main"
  }

Graph collections contain records of different types, viz. regions, nodes, and edges. Tokenizer is a segmenting tool and produces both regions and nodes, e.g. the following regions:

  > db["641b33be-a080-11e4-9fa2-00259074dbc0"]['graph'].find({'type': 'region'}).sort({"index": 1}).pretty()
 {
    "_id" : ObjectId("54be152b3ac8eb0487b7172a"),
    "origin" : "tokenizer",
    "index" : 0,
    "anchors" : [ 0, 23 ],
    "type" : "region",
    "id" : "tokenizer-r1"
  }
  {
    "_id" : ObjectId("54be152b3ac8eb0487b7172c"),
    "origin" : "tokenizer",
    "index" : 1,
    "anchors" : [ 24, 36 ],
    "type" : "region",
    "id" : "tokenizer-r2"
  }

And, corresponding to each region (in this case), the following two nodes:

  > db["641b33be-a080-11e4-9fa2-00259074dbc0"]['graph'].find({'type': 'node'}).sort({"index": 1}).pretty()
  {
    "_id" : ObjectId("54be152b3ac8eb0487b7172b"),
    "origin" : "tokenizer",
    "index" : 0,
    "out_edges" : [ ],
    "links" : [ "tokenizer-r1" ],
    "annotation_spaces" : [ "tokenizer" ],
    "type" : "node",
    "id" : "tokenizer-n1",
    "in_edges" : [ ],
    "annotations" : {
      "tokenizer" : {
        "class" : "sentence",
        "label" : "The cat chased the dog."
      }
    }
  }
  {
    "_id" : ObjectId("54be152b3ac8eb0487b7172d"),
    "origin" : "tokenizer",
    "index" : 1,
    "out_edges" : [ ],
    "links" : [ "tokenizer-r2" ],
    "annotation_spaces" : [ "tokenizer" ],
    "type" : "node",
    "id" : "tokenizer-n2",
    "in_edges" : [ ],
    "annotations" : {
      "tokenizer" : {
        "class" : "sentence",
        "label" : "Fido barked."
      }
    }
  }

Record Structure

Some fields are common to all record types:

  • _id:ObjectId - The id assigned by mongoDB internally

  • id:String - The LAF record id that follows the strategy in the LAF literature: <annotator name>-<r|n|e><index>, e.g. tokenizer-r1

  • index:int - index used for sorting

  • origin:string - identifier for the LAP tool that produced the annotation

Region-specific:

  • anchors:[int, int] - string offsets into the media record, i.e. start and end offset

Node-specific:

  • in_edges:[] - list of incoming edges

  • out_edges:[] - list of outgoing edges

  • annotation_spaces:[] - list of annotations associated to this node (never used internally yet, since we operate with one node collection per annotator)

  • annotations:{} - dictionary of feature structures containing the actual annotation and its class (Note that currently there is always one annotation per node)

  • links:[] - a list of links to the region described by the annotation in the node (where applicable, i.e. for ‘segmenting’ tools)

Edge-specific

  • from:id - the id of the parent node

  • to:id - the id of the child node

  • Note that edges also have annotation_spaces and annotations, though LAF maintains that these should not be used in the same way as their node counterparts on nodes.

Clone this wiki locally