Skip to content

Latest commit

 

History

History
19 lines (12 loc) · 1.38 KB

README.md

File metadata and controls

19 lines (12 loc) · 1.38 KB

OCR postcorrection

Resources

ocr-errors.txt

A collection of isolated OCR errors from the cleaned ACL anthology reference corpus.

ACL-benchmark

A benchmark with random paragraphs from the ACL corpus (downloaded from Universität Heidelberg, because it was not available through the original source), with a manually corrected ground truth. The benchmark is presented in our paper about Tokenization Repair (under review). Here, it is filtered for OCR errors, excluding whitespace errors and hyphenation errors.

Related work by our group