Skip to content

Latest commit

 

History

History
99 lines (57 loc) · 4.04 KB

File metadata and controls

99 lines (57 loc) · 4.04 KB

ACE-2005 Translation and Annotation Alignment Pipeline

This pipeline translates the ACE 2005 corpus into Portuguese and aligns the translated annotations with the corresponding text. Additionally, it can be easily adapted for translating the ACE-2005 corpus into other languages.

Overview

This repository contains a Python translation and annotation alignment pipeline that is designed to translate the ACE 2005 dataset into Portuguese using machine translation and then align the translated annotations with the corresponding text. The pipeline was developed for automatic translation of ACE-2005 into Portuguese but can also be adapted for other languages with little effort.

It is composed of two main components:

  • Text translation: We used DeepL Translator and Google Translator to translate ACE-2005 texts and annotations into European and Brazilian Portuguese.
  • Annotations alignments: We developed an annotation alignment pipeline that aligns the translated annotations within the translated text.

Prerequisites

  1. Prepare ACE 2005 dataset

    Download: (https://catalog.ldc.upenn.edu/LDC2006T06). Note that ACE 2005 dataset is not freely accesible.)

  2. ACE-2005 Pre-processing

    We adopted a commonly used ACE-2005 pre-processing approach that can be found in this repository.

  3. Install the packages. Create a python Env (Optional):

    python3 -m venv myenv
    myenv\Scripts\activate
    source myenv/bin/activate

    Intall python requirements:

    pip install -r ./src/requirements.txt

Annotation Alignment Modules

Our pipeline is composed of a total of five annotation alignment components:

- Lemmatization
- Multiple word translation
- Synonyms
- BERT-based word aligner
- Fuzzy Match (Gestalt Patter Matching and Levenstein distance)

The pipeline operates sequentially, meaning that annotations aligned by earlier methods are not addressed by subsequent pipeline elements. According to our experiments, the list above corresponds to the best order sequence.

Usage

  1. Translate ACE-2005 to Portuguese

    By default we use Google Translate for the translation process. An API key is need in order to use DeepL Translator.

    Usage: python3 translation.py <input_file> <output_dir>
    Example: python src/translate.py data/sample_en.json data/sample_pt.json
  2. Run the Annotation Alignment Pipeline

    To align the translated annotations, the alignment pipeline can be executed with the following command:

    Usage: python3 pipeline.py <input_file> <output_dir>
    Example: python src/translate.py data/sample_pt.json data/sample_pt_aligned.json

    The pipeline can be configured in the config.yaml file. Users can select the aligners they intend to use and must indicate the path for the alignment resources for each alignment component, such as multiple translations of annotations, previously calculated lemmas, synonyms, etc. All of these resources are already pre-calculated for the Portuguese language in the resources folder. Additionally, the input and output files can be configured in the config.yaml file as well.

Evaluation

To measure the effectiveness of the alignment pipeline, manual alignments were conducted on the entire ACE-2005-PT test set, which includes 1,310 annotations (triggers and arguments). These alignments were performed by a linguist expert to ensure high-quality annotations, following the same annotation guidelines of the original ACE-2005 corpus.

The evaluation results are presented in Table 1:

Results
Table 1: Evaluation Results by pipeline component

Contributing

Contributions are welcome! Feel free to open issues or submit pull requests.

License

This project is licensed under the MIT License.