The MRQA 2019 Shared Task focuses on generalization in question answering. An effective question answering system should do more than merely interpolate from the training set to answer test examples drawn from the same distribution: it should also be able to extrapolate to out-of-distribution examples — a significantly harder challenge.
The format of the task is extractive question answering. Given a question and context passage, systems must find the word or phrase in the document that best answers the question. While this format is somewhat restrictive, it allows us to leverage many existing datasets, and its simplicity helps us focus on out-of-domain generalization, instead of other important but orthogonal challenges.
We release an official training dataset containing examples from existing extractive QA datasets, and evaluate submitted models on ten hidden test datasets. Both train and test datasets have the same format described above, but may differ in some of the following ways:
- Passage distribution: Test examples may involve passages from different sources (e.g., science, news, novels, medical abstracts, etc) with pronounced syntactic and lexical differences.
- Question distribution: Test examples may emphasize different styles of questions (e.g., entity-centric, relational, other tasks reformulated as QA, etc) which may come from different sources (e.g., crowdworkers, domain experts, exam writers, etc.)
- Joint distribution: Test examples may vary according to the relationship of the question to the passage (e.g., collected independent vs. dependent of evidence, multi-hop, etc)
Each participant will submit a single QA system trained on the provided training data. We will then privately evaluate each system on the hidden test data.
This repository contains resources for accessing the official training and development data. If you are interested in participating, please fill out this form! We will e-mail participants who sign up of any important announcements regarding the shared task.
Updated 7/12/2019 to correct for minor exact-match discrepancies (See #11 for details.)
Updated 6/13/2019 to correct for duplicate context in HotpotQA (See #7 for details.)
Updated 5/29/2019 to correct for truncated detected_answers
field
(See #5 for details.)
We have adapted several existing datasets from their original formats and settings to conform to our unified extractive setting. Most notably:
- We provide only a single, length-limited context.
- There are no unanswerable or non-span answer questions.
- All questions have at least one accepted answer that is found exactly in the context.
A span is judged to be an exact match if it matches the answer string after performing normalization consistent with the SQuAD dataset. Specifically:
- The text is uncased.
- All punctuation is stripped.
- All articles
{a, an, the}
are removed. - All consecutive whitespace markers are compressed to just a single normal space
' '
.
Dataset | Download | MD5SUM | Examples |
---|---|---|---|
SQuAD | Link | efd6a551d2697c20a694e933210489f8 | 86,588 |
NewsQA | Link | 182f4e977b849cb1dbfb796030b91444 | 74,160 |
TriviaQA | Link | e18f586152612a9358c22f5536bfd32a | 61,688 |
SearchQA | Link | 612245315e6e7c4d8446e5fcc3dc1086 | 117,384 |
HotpotQA | Link | d212c7b3fc949bd0dc47d124e8c34907 | 72,928 |
NaturalQuestions | Link | e27d27bf7c49eb5ead43cef3f41de6be | 104,071 |
Dataset | Download | MD5SUM | Examples |
---|---|---|---|
SQuAD | Link | 05f3f16c5c31ba8e46ff5fa80647ac46 | 10,507 |
NewsQA | Link | 5c188c92a84ddffe2ab590ac7598bde2 | 4,212 |
TriviaQA | Link | 5c9fdc633dfe196f1b428c81205fd82f | 7,785 |
SearchQA | Link | 9217ad3f6925c384702f2a4e6d520c38 | 16,980 |
HotpotQA | Link | 125a96846c830381a8acff110ff6bd84 | 5,904 |
NaturalQuestions | Link | c0347eebbca02d10d1b07b9a64efe61d | 12,836 |
Note: This in-domain data may be used for helping develop models. The final testing, however, will only contain out-of-domain data.
Dataset | Download | MD5SUM | Examples |
---|---|---|---|
BioASQ | Link | 70752a39beb826a022ab21353cb66e54 | 1,504 |
DROP | Link | 070eb2ac92d2b2fc1b99abeda97ac37a | 1,503 |
DuoRC | Link | b325c0ad2fa10e699136561ee70c5ddd | 1,501 |
RACE | Link | ba8063647955bbb3ba63e9b17d82e815 | 674 |
RelationExtraction | Link | 266be75954fcb31b9dbfa9be7a61f088 | 2,948 |
TextbookQA | Link | 8b52d21381d841f8985839ec41a6c7f7 | 1,503 |
Note: As previously mentioned, the out-of-domain dataset have been modified from their original settings to fit the unified MRQA Shared Task paradigm (see MRQA Format). Once again, at a high level, the following two major modifications have been made:
- All QA-context pairs are extractive. That is, the answer is selected from the context and not via, e.g., multiple-choice.
- All contexts are capped at a maximum of
800
tokens. As a result, for longer contexts like Wikipedia articles, we only consider examples where the answer appears in the first800
tokens.
As a result, some splits are harder than the original datasets (e.g., removal of multiple-choice in RACE), while some are easier (e.g., restricted context length in NaturalQuestions --- we use the short answer selection). Thus one should expect different performance ranges if comparing to previous work on these datasets.
For additional sources of training data, we are whitelisting some non-QA datasets that may be helpful for multi-task learning or pretraining. If you have any other dataset in mind , please raise an issue or send us an email at [email protected] .
Whitelist:
- SNLI
- MultiNLI
We have provided a convenience script to download all of the training and development data (that is released).
Please run:
./download_train.sh path/to/store/downloaded/directory
To download the development data of the training datasets (in-domain), run:
./download_in_domain_dev.sh path/to/store/downloaded/directory
To download the out-of-domain development data, run:
./download_out_of_domain_dev.sh path/to/store/downloaded/directory
All of the datasets for this task have been adapted to follow a unified format. They are stored as compressed JSONL files (with file extension .jsonl.gz
).
The general format is:
{
"header": {
"dataset": <dataset name>,
"split": <train|dev|test>,
}
}
...
{
"context": <context text>,
"context_tokens": [(token_1, offset_1), ..., (token_l, offset_l)],
"qas": [
{
"qid": <uuid>,
"question": <question text>,
"question_tokens": [(token_1, offset_1), ..., (token_q, offset_q)],
"detected_answers": [
{
"text": <answer text>,
"char_spans": [[<start_1, end_1>], ..., [<start_n, end_n>]],
"token_spans": [[<start_1, end_1>], ..., [<start_n, end_n>]],
},
...
],
"answers": [<answer_text_1>, ..., <answer_text_m>]
},
...
]
}
Note that it is permissible to download the original datasets and use them as you wish. However, this is the format that the test data will be presented in.
- context: This is the raw text of the supporting passage. Three special token types have been inserted:
[TLE]
precedes document titles,[DOC]
denotes document breaks, and[PAR]
denotes paragraph breaks. The maximum length of the context is 800 tokens. - context_tokens: A tokenized version of the supporting passage, using spaCy. Each token is a tuple of the token string and token character offset. The maximum number of tokens is 800.
- qas: A list of questions for the given context.
- qid: A unique identifier for the question. The
qid
is unique across all datasets. - question: The raw text of the question.
- question_tokens: A tokenized version of the question. The tokenizer and token format is the same as for the context.
- detected_answers: A list of answer spans for the given question that index into the context. For some datasets these spans have been automatically detected using searching heuristics. The same answer may appear multiple times in the text --- each of these occurrences is recorded. For example, if
42
is the answer, the context"The answer is 42. 42 is the answer."
, has two occurrences marked.- text: The raw text of the detected answer.
- char_spans: Inclusive
[start, end]
character spans (indexing into the raw context). - token_spans: Inclusive
[start, end]
token spans (indexing into the tokenized context).
- answers: All accepted answer to the question, whether or not there is an exact match in the given context.
To view examples in the terminal please install requirements.txt
(pip install requirements.txt
) and then run:
python visualize.py path/or/url
The script argument may be either a URL or a local file path. For example:
python visualize.py https://s3.us-east-2.amazonaws.com/mrqa/release/train/SQuAD.jsonl.gz
Answers are evaluated using exact match and token-level F1 metrics. The mrqa_official_eval.py script is used to evaluate predictions on a given dataset:
python mrqa_official_eval.py <url_or_filename> <predictions_file>
The predictions file must be a valid JSON file of qid
, answer
pairs:
{
"qid_1": "answer span text 1",
...
"qid_n": "answer span text N"
}
The final score for the MRQA shared task will be the macro-average across all test datasets.
An implementation of a simple multi-task BERT-based baseline model is available in the baseline directory.
Below are our baseline results (I = in-domain, O = out-of-domain):
Dataset | Multi-Task BERT-Base | Multi-Task BERT-Large |
---|---|---|
(I) SQuAD | 78.5 / 86.7 | 80.3 / 88.4 |
(I) HotpotQA | 59.8 / 76.6 | 62.4 / 79.0 |
(I) TriviaQA Web | 65.6 / 71.6 | 68.2 / 74.7 |
(I) NewsQA | 50.8 / 66.8 | 49.6 / 66.3 |
(I) SearchQA | 69.5 / 76.7 | 71.8 / 79.0 |
(I) NaturalQuestions | 65.4 / 77.4 | 67.9 / 79.8 |
(O) DROP | 25.7 / 34.5 | 34.6 / 43.8 |
(O) RACE | 30.4 / 41.4 | 31.3 / 42.5 |
(O) BioASQ | 47.1 / 62.7 | 51.9 / 66.8 |
(O) TextbookQA | 44.9 / 53.9 | 47.4 / 55.7 |
(O) RelationExtraction | 72.6 / 83.8 | 72.7 / 85.2 |
(O) DuoRC | 44.8 / 54.6 | 46.8 / 58.0 |
Submission will be handled through the Codalab platform: see these instructions.
Note that submissions should start a local server that accepts POST requests of single JSON objects in our standard format, and returns a JSON prediction object.
The official predict_server.py
script (in this directory) will query this server to get predictions.
The baseline
directory includes an example implementation in serve.py
.
We have chosen this format so that we can create interactive demos for all submitted models.
Codalab results for all models submitted to the shared task are available in the results
directory.
These files include the dev and test EM and F1 scores for every model and every dataset.
@inproceedings{fisch2019mrqa,
title={{MRQA} 2019 Shared Task: Evaluating Generalization in Reading Comprehension},
author={Adam Fisch and Alon Talmor and Robin Jia and Minjoon Seo and Eunsol Choi and Danqi Chen},
booktitle={Proceedings of 2nd Machine Reading for Reading Comprehension (MRQA) Workshop at EMNLP},
year={2019},
}