GoldAlign

Synopsis

GoldAlign provides a browser-based interface for creating gold standard annotations of sentence alignment. Output is generated automatically from a graphical grid-based workspace.

Sentence pairs to be aligned are stored in tab-separated files called batches, which can contain any number of such pairs (batch sizes under 10 are recommended but not required). Datasets are groups (directories) of related batches for the sake of organization and convenience. Once users select a dataset to work with, they can view their completion status for each batch within the set and open any batch to begin annotation.

After any two users tasked with primary annotation have completed a given batch, users tasked with arbitration can look at both users' annotations simultaneously on a color-coded grid, solve any discrepancies, and output the finalized alignment.

GoldAlign is introduced in the paper A Corpus of Word-Aligned Asked and Anticipated Questions in a Virtual Patient Dialogue System by Ajda Gokcen, Evan Jaffe, Johnsey Erdmann, Michael White, Douglas Danforth. This paper is included in the proceedings of the 10th edition of the Language Resources and Evaluation Conference (LREC 2016).

Any future updates to GoldAlign can be tracked at the GitHub repository.

Installation and Running

GoldAlign is run through Node.js, which can be downloaded for installation here or via command line as outlined here.

Once Node.js is installed on the machine on which you want to run GoldAlign, you can simply execute the bash script RUN in the project's root directory. By default, this will run GoldAlign at the local port numbered 2001, although you can specify a different port number by instead running app.js {port # of choice}. The tool can then be accessed at localhost:{PORT #} within any browser, or at {URL OF CHOICE}:{PORT #} if you are running it on a properly set-up server.

You can run CHECK (again, within the project's root directory) to ensure that everything is in fact running, or KILL to end the process.

Administration

Creating new batch files. Batch files are flat, tab-delimited text files of variable length, stored in dataset directories as explained in the next paragraph. Example batch files can be found at data/datasets/demo-corpus/. They are of the following 10-column structure:

PAIRWISE-ID (string; unique ID for the two sentences being aligned)
SOURCE-STRING (string; the source sentence to be aligned)
SOURCE-LINK (string; a relative link to the context file containing the source sentence; left blank if there is no such file)
TARGET-STRING (string; the target sentence to be aligned)
TARGET-LINK (string; a relative link to the context file containing the target sentence; left blank if there is no such file)
CORPUS-PARAPHRASE-JUDGMENT (boolean 0 or 1; the paraphrase judgment for the two sentences according to the original corpus, 0 for non-paraphrases and 1 for paraphrases)
USER-PARAPHRASE-JUDGMENT (boolean 0 or 1; the paraphrase judgment for the two sentences according to the user, 0 for non-paraphrases and 1 for paraphrases)
SURE-ALIGNMENT (string; space-delimited list of index pairs of the form {SOURCE-INDEX}-{TARGET-INDEX} representing certain or strong word-wise alignments between the source and target sentences)
POSS-ALIGNMENT (string; space-delimited list of index pairs of the form {SOURCE-INDEX}-{TARGET-INDEX} representing possible or weak word-wise alignments between the source and target sentences)
COMMENTS (string; freeform user-created notes)

Creating new datasets. New datasets/batches cannot be added through the tool's interface; they can only be added directly into the file structure of your instance of GoldAlign. Datasets must be added as subdirectories of data/datasets/. For example, a dataset called "new-dataset" would be represented as a directory with the path data/datasets/new-dataset/ relative to the root directory. The initial batch files must be put into this directory in order to be viewable by the tool (e.g., at data/datasets/new-dataset/batch-1.tsv). Once datasets/batches are initialized as such, the tool automatically handles the creation of modified and arbitration-related files.

Creating context files. New context files cannot be added through the tool's interface; they can only be added directly into the file structure of your instance of GoldAlign. Context files must parallel a dataset, being stored in subdirectories of data/context/. You must create a directory parallel to the one in data/datasets/, so a dataset stored at data/datasets/new-dataset/ would have a parallel context directory created at data/context/new-dataset/. Within this directory, context files may look like anything you'd like; there are some examples at data/context/demo-corpus/.

Within the tool's workspace there is a fixed relative link to a file at data/context/{DATASET-NAME}/all_labels.html. This file, should it be included, is for context that is relevant to the entire dataset, and thus should be available for users to access regardless of which sentence pair or batch they are working on at the time.

Note that context links in the batch files must be relative to the context path specific to the dataset, so if the root-relative path to a context file is data/context/new-dataset/context-1.html, the source or target links in the batch files would simply be "context-1.html".

Creating new users. New users cannot be added through the tool's interface; they can only be added directly into the file structure of your instance of GoldAlign. Users must be added as subdirectories of data/users/. For example, a user called "new-user" would be represented as a directory with the path data/users/new-user/ relative to the root directory. Once users are initialized as such, the tool automatically handles the creation of modified and arbitration-related files.

Accessing finalized data. Finalized datasets for a given user can be found at data/users/{USERNAME}/complete/. Finalized gold (arbitrated) datasets for a given two users can be found at data/users/arbit/{USERNAME-1}_{USERNAME-2}/complete/.

Credits

A significant portion of the code for the alignment grid (as in index.js) utilizes elements of Chris Callison-Burch’s word alignment tool:

Zaidan, Omar, and Chris Callison-Burch. "Fast. Cheap and Creative: Evaluating Translation Quality Using Amazon’s Mechanical Turk." Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing. 2009. https://www.aclweb.org/anthology/D/D09/D09-1030.pdf.
- (see locally: pdf/callison-burch_2009.pdf)
Callison-Burch, Chris, David Talbot, and Miles Osborne. "Statistical machine translation with word-and sentence-aligned parallel corpora." Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics. 2004. https://aclweb.org/anthology/P/P04/P04-1023.pdf.
- (see locally: pdf/callison-burch_2004.pdf)

License

GoldAlign is licensed under GPL 3.0, shown in the file LICENSE.txt.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
css		css
data		data
js		js
node_modules		node_modules
pdf		pdf
.PID		.PID
CHECK		CHECK
KILL		KILL
LICENSE.txt		LICENSE.txt
README.md		README.md
RUN		RUN
app.js		app.js
favicon.ico		favicon.ico
index.html		index.html
index.js		index.js
nohup.out		nohup.out

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GoldAlign

Synopsis

Installation and Running

Administration

Credits

License

About

Uh oh!

Releases

Packages

Languages

License

ajdagokcen/goldalign-repo

Folders and files

Latest commit

History

Repository files navigation

GoldAlign

Synopsis

Installation and Running

Administration

Credits

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages