Dakshina Dataset
Processing South Asian Languages Written in the Latin Script: the Dakshina Dataset. Roark, B., Wolf-Sonkin, L., Kirov, C., Mielke, S. J., Johny, C., Demirsahin, I., & Hall, K. (2020, May). In Proceedings of The 12th Language Resources and Evaluation Conference (pp. 2413-2423).
https://github.com/google-research-datasets/dakshina
AI4Bharat-IndicNLP Dataset
IndicNLPSuite: Monolingual Corpora, Evaluation Benchmarks and Pre-trained Multilingual Language Models for Indian Languages. Kakwani, D., Kunchukuttan, A., Golla, S., Bhattacharyya, A., Khapra, M. M., & Kumar, P. (2020). Findings of EMNLP.
https://github.com/AI4Bharat/indicnlp_corpus
Oscar Corpus
A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages Ortiz Suárez, P., Romary, L., & Sagot, B. (2020). arXiv, arXiv-2006.
https://oscar-corpus.com/

Provide feedback

Saved searches