A collection of isolated OCR errors from the cleaned ACL anthology reference corpus.
A benchmark with random paragraphs from the ACL corpus (downloaded from Universität Heidelberg, because it was not available through the original source), with a manually corrected ground truth. The benchmark is presented in our paper about Tokenization Repair (under review). Here, it is filtered for OCR errors, excluding whitespace errors and hyphenation errors.
- For the correction of hyphenation errors, see Improved Dehyphenation of Line Breaks for PDF Text Extraction (Bachelor thesis by Mari Hernaes, 2019).
- For the correction of whitespace errors, see Tokenization Repair in the Presence of Spelling Errors (arXiv 2020) + our upcoming paper (under review).
- For the correction of spelling errors, see Neural Language Models for Spelling Correction (Master thesis by Matthias Hertel, 2019).