-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
In their current form, the OCR-D transcription guidelines are often of little use to annotators looking for answers or guidance. They are written top-down intellectual accounts, but not formal (i.e. runnable/verifiable) and not searchable and – well, quite incomplete. Although many examples are given already, this is not nearly enough for the diverse set of materials and pecularities which annotators face (esp. those without a bibliological / humanities background).
How can we improve that?
I propose attacking this on multiple levels:
- first fixing GT guidelines: fix formatting #225 and toc sidebar for GT guidelines #207 (and perhaps google custom search for documentation #102)
- finally starting a software implementation (which can normalize arbitrary text input at each GT level or canonicalize to the next lower level)
- opening up the repository for comments and amendments by users/practitioners (perhaps in the same way that the workflow guide was mirrored to the wiki and gets synchronized back every now and then)
- supplementing https://ocr-d.de/en/gt-guidelines/trans/ocr_d_koordinationsgremium_codierung.html and https://ocr-d.de/en/gt-guidelines/trans/trFremdsprache.html with data columns for all GT levels (for quick lookup)
- starting a public glyph repository by aggregating diverse textual GT, enriching it with glyph coordinates via OCR (e.g. Tesseract 3 segmenter) forced alignment, and extracting glyph image-text file pairs
- tying the website to the glyph repo with a dedicated search interface: text→image search and image→text search (via image similarity like in Newspaper Navigator)