Hannes von Essen, Daniel Hesslow
Datasets for PaM2020 "Building a Swedish Question-Answering Model" [Paper link]
This repository contains the datasets for Swedish and Spanish question-answering generated from the SQuAD dataset using the novel cross-lingual projection method introduced in our PaM2020 paper.
They can be used to train an NLP model for extractive question answering, such as Multilingual BERT. Check out the Transformers library for more details.
When used to train Multilingual BERT, the Spanish dataset achieves a new state of the art in the XQuAD and MLQA question-answering benchmarks.