Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XML element tree for export_as_xml is nested incorrectly? #1869

Open
ejschoen opened this issue Feb 10, 2025 · 5 comments · May be fixed by #1870
Open

XML element tree for export_as_xml is nested incorrectly? #1869

ejschoen opened this issue Feb 10, 2025 · 5 comments · May be fixed by #1870
Assignees
Labels
module: io Related to doctr.io type: bug Something isn't working
Milestone

Comments

@ejschoen
Copy link

Bug description

HOCR output creates a sequence of ocr_carea elements that are immediate children of the body, rather than immediate children of the page on which they appear. The page element itself has no children. I've read the HOCR spec, and this is arguably unclear. I can't find a normative specification of the typesetting element hierarchy. However, there is an example here from the version 1.2 spec that suggests that ocr_carea should be contained withing ocr_page:

<div class="ocr_page" id="page_1">
  <div class="ocr_carea" id="column_2" title="bbox 313 324 733 1922">
    <div class="ocr_par" id="par_7"> ... </div>
    <div class="ocr_par" id="par_19"> ... </div>
  </div>
</div>

The hierarchy above is consistent with Tesseract's output, for example. Any code written to interpret Tesseract output is going to loop over pages, within which it will loop over ocr_carea elements. However, doctr produces this hierarchy:

<body>
  <div class="ocr_page" id="page_1" title="image; bbox 0 0 1617 1579; ppageno 0" />
  <div class="ocr_carea" id="block_1" title="bbox 0 44 1533 1579">
     <p class="ocr_par" id="par_1" title="bbox 0 44 1533 1579">
	 <span class="ocr_line" id="line_1" title="bbox 1429 44  1451 76;  baseline 0 0; x_size 0; x_descenders 0; x_ascenders 0">
	    <span class="ocrx_word" id="word_1" title="bbox 1429 44 1451 76;  x_wconf 61">-</span>
	</span>
     </p>
  </div>
</body>

Code written to interpret Tesseract output. will fail and return nothing, because there are no children of the ocr_page elements.

The relevant code is in io/elements.py. I think the SubElement call that creates ocr_carea elements should use the SubElement that created the ocr_page element, but which today is not captured in a variable binding.

        # Create the body
        body = SubElement(page_hocr, "body")
        SubElement(
            body,
            "div",
            attrib={
                "class": "ocr_page",
                "id": f"page_{p_idx + 1}",
                "title": f"image; bbox 0 0 {width} {height}; ppageno 0",
            },
        )
        # iterate over the blocks / lines / words and create the XML elements in body line by line with the attributes
        for class_name, predictions in self.predictions.items():
            for prediction in predictions:
                if len(prediction.geometry) != 2:
                    raise TypeError("XML export is only available for straight bounding boxes for now.")
                (xmin, ymin), (xmax, ymax) = prediction.geometry
                prediction_div = SubElement(
                    body,
                    "div",
                    attrib={
                        "class": "ocr_carea",
                        "id": f"{class_name}_prediction_{prediction_count}",
                        "title": f"bbox {int(round(xmin * width))} {int(round(ymin * height))} \
                        {int(round(xmax * width))} {int(round(ymax * height))}",
                    },
                )
                prediction_div.text = prediction.value
                prediction_count += 1

        return ET.tostring(page_hocr, encoding="utf-8", method="xml"), ET.ElementTree(page_hocr)

Code snippet to reproduce the bug

print(prediction.export_as_xml[0][0])

Error traceback

None

Environment

This is current main branch code.

(.venv) i2kdevops@btc2:~/doctr/api$ python collect_env.py
Collecting environment information...

DocTR version: 0.11.1a0
TensorFlow version: N/A
PyTorch version: 2.6.0+cu124 (torchvision 0.21.0+cu124)
OpenCV version: 4.11.0
OS: Ubuntu 22.04.3 LTS
Python version: 3.10.12
Is CUDA available (TensorFlow): N/A
Is CUDA available (PyTorch): Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4080
Nvidia driver version: 535.183.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7

Deep Learning backend

>>> print(f"is_tf_available: {is_tf_available()}")
is_tf_available: False
>>> print(f"is_torch_available: {is_torch_available()}")
is_torch_available: True
>>>
@ejschoen ejschoen added the type: bug Something isn't working label Feb 10, 2025
@felixdittrich92
Copy link
Contributor

Hi @ejschoen 👋 ,

Thanks for reporting this! I’ve opened a quick PR to fix it: #1870
Could you confirm whether these changes fix your issue?

One thing I noticed is that we need to keep the <p> tag for the ocr_par element; otherwise, ocrmypdf fails to transform the XML or create the PDF/A correctly.

@felixdittrich92 felixdittrich92 self-assigned this Feb 11, 2025
@felixdittrich92 felixdittrich92 added the module: io Related to doctr.io label Feb 11, 2025
@felixdittrich92 felixdittrich92 added this to the 0.12.0 milestone Feb 11, 2025
@ejschoen
Copy link
Author

Yes, by inspection, the change would solve the nesting issue. At present, our code ignores the specific element tags in the HOCR and just pays attention to the nesting itself.

For what it's worth, the code as written creates a complete HOCR document for each page. This is understandable if the common use case is OCR for single images. But if you supply a multi-page PDF (or perhaps a multi-image TIFF--I haven't tried that), you'll end up multiple XML documents.

I had been working on a slight alteration to the ocr endpoint in api/app/routes/ocr.py. If the request contains an Accept header with value text/xml or other plausible MIME type for HOCR, then it tries to return the OCR result in HOCR format. For multi-page documents, I would have needed to figure out how to encode multiple XML files in the return, and didn't really want to deal with using multipart/form-data as a response type (or something like Zip for that matter.)

Since Tesseract can return HOCR files with multiple pages, I wanted to do this as well when given multi-image document formats such as PDF and TIFF. However, the export_as_xml() code in io.elements.Page isn't the right grain size for this. I ended up refactoring export_as_xml into two methods--one responsible for creating the HOCR XML document-level ElementTree structure and one for creating the page/carea/line/word elements. For now I have to duplicate the document-level ElementTree code in the ocr endpoint, but then can call Page.export_page_as_xml() on each page of the OCR'd document. I was considering making a PR for this.

https://github.com/ejschoen/doctr.git

@felixdittrich92
Copy link
Contributor

I agree that in some cases it could be comfortable to provide the "complete" multipage hOCR output.
On the other hand from my knowledge our prefered lib ocrmypdf does not provide multipage support --> we have also a tutorial.

And I think from the current output it would be easier to drop the duplicated head elements (html, title, meta) instead of doing the opposite.

Wdyt ?

@ejschoen
Copy link
Author

I did consider post-processing the result from Document.export_as_xml() and combining the pages into a single document tree, and then running .tostring on the result. It was an expedient to refactor Page.export_as_xml() as a way to avoid having to deal with ElementTree surgery. I'll go think about it some more.

I have used ocrmypdf in the past, but only at the application level, where it's able to deal with scanned-image PDFs and multi-page tiffs. I haven't used its Python library directly so haven't noticed that it's a page-at-time there. This isn't surprising, since ocrmypdf was written for Tesseract, which doesn't natively read PDF and requires some external machinery like pdftoppm to burst and render each page separately.

@felixdittrich92
Copy link
Contributor

For now, I'd say it's fine to go ahead with the PR I mentioned, but if you have something smarter in mind, a PR is always welcome 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: io Related to doctr.io type: bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants