XML element tree for export_as_xml is nested incorrectly? #1869

ejschoen · 2025-02-10T15:37:57Z

Bug description

HOCR output creates a sequence of ocr_carea elements that are immediate children of the body, rather than immediate children of the page on which they appear. The page element itself has no children. I've read the HOCR spec, and this is arguably unclear. I can't find a normative specification of the typesetting element hierarchy. However, there is an example here from the version 1.2 spec that suggests that ocr_carea should be contained withing ocr_page:

<div class="ocr_page" id="page_1">
  <div class="ocr_carea" id="column_2" title="bbox 313 324 733 1922">
    <div class="ocr_par" id="par_7"> ... </div>
    <div class="ocr_par" id="par_19"> ... </div>
  </div>
</div>

The hierarchy above is consistent with Tesseract's output, for example. Any code written to interpret Tesseract output is going to loop over pages, within which it will loop over ocr_carea elements. However, doctr produces this hierarchy:

<body>
  <div class="ocr_page" id="page_1" title="image; bbox 0 0 1617 1579; ppageno 0" />
  <div class="ocr_carea" id="block_1" title="bbox 0 44 1533 1579">
     <p class="ocr_par" id="par_1" title="bbox 0 44 1533 1579">
	 <span class="ocr_line" id="line_1" title="bbox 1429 44  1451 76;  baseline 0 0; x_size 0; x_descenders 0; x_ascenders 0">
	    <span class="ocrx_word" id="word_1" title="bbox 1429 44 1451 76;  x_wconf 61">-</span>
	</span>
     </p>
  </div>
</body>

Code written to interpret Tesseract output. will fail and return nothing, because there are no children of the ocr_page elements.

The relevant code is in io/elements.py. I think the SubElement call that creates ocr_carea elements should use the SubElement that created the ocr_page element, but which today is not captured in a variable binding.

        # Create the body
        body = SubElement(page_hocr, "body")
        SubElement(
            body,
            "div",
            attrib={
                "class": "ocr_page",
                "id": f"page_{p_idx + 1}",
                "title": f"image; bbox 0 0 {width} {height}; ppageno 0",
            },
        )
        # iterate over the blocks / lines / words and create the XML elements in body line by line with the attributes
        for class_name, predictions in self.predictions.items():
            for prediction in predictions:
                if len(prediction.geometry) != 2:
                    raise TypeError("XML export is only available for straight bounding boxes for now.")
                (xmin, ymin), (xmax, ymax) = prediction.geometry
                prediction_div = SubElement(
                    body,
                    "div",
                    attrib={
                        "class": "ocr_carea",
                        "id": f"{class_name}_prediction_{prediction_count}",
                        "title": f"bbox {int(round(xmin * width))} {int(round(ymin * height))} \
                        {int(round(xmax * width))} {int(round(ymax * height))}",
                    },
                )
                prediction_div.text = prediction.value
                prediction_count += 1

        return ET.tostring(page_hocr, encoding="utf-8", method="xml"), ET.ElementTree(page_hocr)

Code snippet to reproduce the bug

print(prediction.export_as_xml[0][0])

Error traceback

None

Environment

This is current main branch code.

(.venv) i2kdevops@btc2:~/doctr/api$ python collect_env.py
Collecting environment information...

DocTR version: 0.11.1a0
TensorFlow version: N/A
PyTorch version: 2.6.0+cu124 (torchvision 0.21.0+cu124)
OpenCV version: 4.11.0
OS: Ubuntu 22.04.3 LTS
Python version: 3.10.12
Is CUDA available (TensorFlow): N/A
Is CUDA available (PyTorch): Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4080
Nvidia driver version: 535.183.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7

Deep Learning backend

>>> print(f"is_tf_available: {is_tf_available()}")
is_tf_available: False
>>> print(f"is_torch_available: {is_torch_available()}")
is_torch_available: True
>>>

The text was updated successfully, but these errors were encountered:

felixdittrich92 · 2025-02-11T09:40:00Z

Hi @ejschoen 👋 ,

Thanks for reporting this! I’ve opened a quick PR to fix it: #1870
Could you confirm whether these changes fix your issue?

One thing I noticed is that we need to keep the <p> tag for the ocr_par element; otherwise, ocrmypdf fails to transform the XML or create the PDF/A correctly.

ejschoen · 2025-02-11T13:50:52Z

Yes, by inspection, the change would solve the nesting issue. At present, our code ignores the specific element tags in the HOCR and just pays attention to the nesting itself.

For what it's worth, the code as written creates a complete HOCR document for each page. This is understandable if the common use case is OCR for single images. But if you supply a multi-page PDF (or perhaps a multi-image TIFF--I haven't tried that), you'll end up multiple XML documents.

I had been working on a slight alteration to the ocr endpoint in api/app/routes/ocr.py. If the request contains an Accept header with value text/xml or other plausible MIME type for HOCR, then it tries to return the OCR result in HOCR format. For multi-page documents, I would have needed to figure out how to encode multiple XML files in the return, and didn't really want to deal with using multipart/form-data as a response type (or something like Zip for that matter.)

Since Tesseract can return HOCR files with multiple pages, I wanted to do this as well when given multi-image document formats such as PDF and TIFF. However, the export_as_xml() code in io.elements.Page isn't the right grain size for this. I ended up refactoring export_as_xml into two methods--one responsible for creating the HOCR XML document-level ElementTree structure and one for creating the page/carea/line/word elements. For now I have to duplicate the document-level ElementTree code in the ocr endpoint, but then can call Page.export_page_as_xml() on each page of the OCR'd document. I was considering making a PR for this.

https://github.com/ejschoen/doctr.git

felixdittrich92 · 2025-02-11T15:13:07Z

I agree that in some cases it could be comfortable to provide the "complete" multipage hOCR output.
On the other hand from my knowledge our prefered lib ocrmypdf does not provide multipage support --> we have also a tutorial.

And I think from the current output it would be easier to drop the duplicated head elements (html, title, meta) instead of doing the opposite.

Wdyt ?

ejschoen · 2025-02-11T15:24:00Z

I did consider post-processing the result from Document.export_as_xml() and combining the pages into a single document tree, and then running .tostring on the result. It was an expedient to refactor Page.export_as_xml() as a way to avoid having to deal with ElementTree surgery. I'll go think about it some more.

I have used ocrmypdf in the past, but only at the application level, where it's able to deal with scanned-image PDFs and multi-page tiffs. I haven't used its Python library directly so haven't noticed that it's a page-at-time there. This isn't surprising, since ocrmypdf was written for Tesseract, which doesn't natively read PDF and requires some external machinery like pdftoppm to burst and render each page separately.

felixdittrich92 · 2025-02-11T15:30:46Z

For now, I'd say it's fine to go ahead with the PR I mentioned, but if you have something smarter in mind, a PR is always welcome 👍

ejschoen added the type: bug Something isn't working label Feb 10, 2025

felixdittrich92 self-assigned this Feb 11, 2025

felixdittrich92 added the module: io Related to doctr.io label Feb 11, 2025

felixdittrich92 added this to the 0.12.0 milestone Feb 11, 2025

felixdittrich92 linked a pull request Feb 11, 2025 that will close this issue

[Fix] Fix invalid hOCR format & PDF/A compatiblity with the kie preditor #1870

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XML element tree for export_as_xml is nested incorrectly? #1869

XML element tree for export_as_xml is nested incorrectly? #1869

ejschoen commented Feb 10, 2025

felixdittrich92 commented Feb 11, 2025

ejschoen commented Feb 11, 2025

felixdittrich92 commented Feb 11, 2025

ejschoen commented Feb 11, 2025

felixdittrich92 commented Feb 11, 2025

XML element tree for export_as_xml is nested incorrectly? #1869

XML element tree for export_as_xml is nested incorrectly? #1869

Comments

ejschoen commented Feb 10, 2025

Bug description

Code snippet to reproduce the bug

Error traceback

Environment

Deep Learning backend

felixdittrich92 commented Feb 11, 2025

ejschoen commented Feb 11, 2025

felixdittrich92 commented Feb 11, 2025

ejschoen commented Feb 11, 2025

felixdittrich92 commented Feb 11, 2025