OCR postcorrection

Resources

ocr-errors.txt

A collection of isolated OCR errors from the cleaned ACL anthology reference corpus.

ACL-benchmark

A benchmark with random paragraphs from the ACL corpus (downloaded from Universität Heidelberg, because it was not available through the original source), with a manually corrected ground truth. The benchmark is presented in our paper about Tokenization Repair (under review). Here, it is filtered for OCR errors, excluding whitespace errors and hyphenation errors.

Related work by our group

For the correction of hyphenation errors, see Improved Dehyphenation of Line Breaks for PDF Text Extraction (Bachelor thesis by Mari Hernaes, 2019).
For the correction of whitespace errors, see Tokenization Repair in the Presence of Spelling Errors (arXiv 2020) + our upcoming paper (under review).
For the correction of spelling errors, see Neural Language Models for Spelling Correction (Master thesis by Matthias Hertel, 2019).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

OCR postcorrection

Resources

ocr-errors.txt

ACL-benchmark

Related work by our group

Files

README.md

Latest commit

History

README.md

File metadata and controls

OCR postcorrection

Resources

ocr-errors.txt

ACL-benchmark

Related work by our group