-
Notifications
You must be signed in to change notification settings - Fork 477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
XML element tree for export_as_xml is nested incorrectly? #1869
Comments
Yes, by inspection, the change would solve the nesting issue. At present, our code ignores the specific element tags in the HOCR and just pays attention to the nesting itself. For what it's worth, the code as written creates a complete HOCR document for each page. This is understandable if the common use case is OCR for single images. But if you supply a multi-page PDF (or perhaps a multi-image TIFF--I haven't tried that), you'll end up multiple XML documents. I had been working on a slight alteration to the ocr endpoint in api/app/routes/ocr.py. If the request contains an Accept header with value text/xml or other plausible MIME type for HOCR, then it tries to return the OCR result in HOCR format. For multi-page documents, I would have needed to figure out how to encode multiple XML files in the return, and didn't really want to deal with using multipart/form-data as a response type (or something like Zip for that matter.) Since Tesseract can return HOCR files with multiple pages, I wanted to do this as well when given multi-image document formats such as PDF and TIFF. However, the export_as_xml() code in io.elements.Page isn't the right grain size for this. I ended up refactoring export_as_xml into two methods--one responsible for creating the HOCR XML document-level ElementTree structure and one for creating the page/carea/line/word elements. For now I have to duplicate the document-level ElementTree code in the ocr endpoint, but then can call Page.export_page_as_xml() on each page of the OCR'd document. I was considering making a PR for this. |
I agree that in some cases it could be comfortable to provide the "complete" multipage hOCR output. And I think from the current output it would be easier to drop the duplicated head elements (html, title, meta) instead of doing the opposite. Wdyt ? |
I did consider post-processing the result from Document.export_as_xml() and combining the pages into a single document tree, and then running .tostring on the result. It was an expedient to refactor Page.export_as_xml() as a way to avoid having to deal with ElementTree surgery. I'll go think about it some more. I have used ocrmypdf in the past, but only at the application level, where it's able to deal with scanned-image PDFs and multi-page tiffs. I haven't used its Python library directly so haven't noticed that it's a page-at-time there. This isn't surprising, since ocrmypdf was written for Tesseract, which doesn't natively read PDF and requires some external machinery like pdftoppm to burst and render each page separately. |
For now, I'd say it's fine to go ahead with the PR I mentioned, but if you have something smarter in mind, a PR is always welcome 👍 |
Bug description
HOCR output creates a sequence of ocr_carea elements that are immediate children of the body, rather than immediate children of the page on which they appear. The page element itself has no children. I've read the HOCR spec, and this is arguably unclear. I can't find a normative specification of the typesetting element hierarchy. However, there is an example here from the version 1.2 spec that suggests that ocr_carea should be contained withing ocr_page:
The hierarchy above is consistent with Tesseract's output, for example. Any code written to interpret Tesseract output is going to loop over pages, within which it will loop over ocr_carea elements. However, doctr produces this hierarchy:
Code written to interpret Tesseract output. will fail and return nothing, because there are no children of the ocr_page elements.
The relevant code is in io/elements.py. I think the SubElement call that creates ocr_carea elements should use the SubElement that created the ocr_page element, but which today is not captured in a variable binding.
Code snippet to reproduce the bug
Error traceback
None
Environment
This is current main branch code.
Deep Learning backend
The text was updated successfully, but these errors were encountered: