Machine Translation for Pandemic Response

This repository contains the code to use the Microsoft Azure and Google Cloud translation services to translate the full TICO-19 dataset from each language into English, and calculate the BLEU scores for each. My collaborator ran the same translations from English to the target languages. As the TICO-19 dataset is large, running this script on the full dataset takes significant time. Therefore, toy files containing the first 10 sentences of data for each language have been provided here, in order to conveniently demonstrate the code functionality. The full dataset with all sentences can be downloaded here.

This project was completed as part of a class titled 'Language Technologies for Crisis Response' at the University of Washington.

Project Overview

Are currently available machine translation (MT) systems ready for pandemic response? This project aims to answer this question by evaluating the performance of two existing systems against the TICO-19 dataset, a corpus of content related to the COVID-19 pandemic translated into 38 languages. Two MT systems (Google and Microsoft) were used for translation, both from English to the target languages and from the target languages to English. Three different automatic evaluation metrics were used and analyzed (BLEU, BERTscore, and Comet), and a limited subset of translations were scored using human evaluation. In addition, we provide an analysis based on language status and region.

Running the scripts

Azure and Google Cloud API Access

Running each script requires an account with access to each respective API. To run azure_translate.py, you will need to add your account key and endpoint to the code in lines 31 and 32, and your location to line 37. To run google_translate.py, save the JSON file containing your Google key in the same directory as the script, and update the code on line 19. If using alternative methods of access, simply comment out line 19 altogether.

Directory Structure

Each script iterates through a directory containing the TICO-19 files to access and translate the data. The output translations are the saved in the folders microsoft_output and google_output respectively. The script will produce one tsv file per translated languages, and one file containing the BLEU scores for each language (bleu_scores.txt). Given python's directory iteration requirements, empty folders for the output files must be created beforehand, and the folder containing the original dataset needs to be named correctly. All folders must be in the same directory as the script. Therefore, empty, correctly named output folders have been included in this repository. Overall, ensure your directory structure directly mimics the one here to successfully run the code.

Note: The files contained in the folder tico_files are toy files, not the full dataset.

Note 2: If you run the code more than once, delete the bleu_scores.txt file beforehand. Otherwise, results from subsequent runs will be appended to the bottom of the existing file rather than overwriting them. This is not an issue for the output tsv files, which will just be overwritten.

Directory Contents

azure_translate.py --> Script to translate TICO-19 sentences using Microsoft Azure
google_translate.py --> Script to translate TICO-19 sentences using Microsoft Azure
tico_files --> Toy Tico-19 files that can be used to demo the code
microsoft_output --> Empty folder where output files from azure_translate.py will be saved upon running the code
google_output --> Empty folder where output files from google_translate.py will be saved upon running the code
results --> Results obtained from translating each language into English using the full TICO-19 dataset, as well as corresponding BLEU scores. which are presented in the final report. Results translating from English to the target language (obtained by my collaborator) are not included in this repository but are presented in the final report. Note: All BERT and COMET scores were run by my collaborator due to compatibility issues.
final_report --> The final paper outlining results and findings for this project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Machine Translation for Pandemic Response

Project Overview

Running the scripts

Azure and Google Cloud API Access

Directory Structure

Directory Contents

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Results		Results
google_output		google_output
microsoft_output		microsoft_output
tico_files		tico_files
Final_Report.pdf		Final_Report.pdf
README.md		README.md
azure_translate.py		azure_translate.py
google_translate.py		google_translate.py

VipashaB94/MachineTranslation_CrisisResponse

Folders and files

Latest commit

History

Repository files navigation

Machine Translation for Pandemic Response

Project Overview

Running the scripts

Azure and Google Cloud API Access

Directory Structure

Directory Contents

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages