-
Notifications
You must be signed in to change notification settings - Fork 622
Description
Bug description
HOCR output creates a sequence of ocr_carea elements that are immediate children of the body, rather than immediate children of the page on which they appear. The page element itself has no children. I've read the HOCR spec, and this is arguably unclear. I can't find a normative specification of the typesetting element hierarchy. However, there is an example here from the version 1.2 spec that suggests that ocr_carea should be contained withing ocr_page:
<div class="ocr_page" id="page_1">
<div class="ocr_carea" id="column_2" title="bbox 313 324 733 1922">
<div class="ocr_par" id="par_7"> ... </div>
<div class="ocr_par" id="par_19"> ... </div>
</div>
</div>The hierarchy above is consistent with Tesseract's output, for example. Any code written to interpret Tesseract output is going to loop over pages, within which it will loop over ocr_carea elements. However, doctr produces this hierarchy:
<body>
<div class="ocr_page" id="page_1" title="image; bbox 0 0 1617 1579; ppageno 0" />
<div class="ocr_carea" id="block_1" title="bbox 0 44 1533 1579">
<p class="ocr_par" id="par_1" title="bbox 0 44 1533 1579">
<span class="ocr_line" id="line_1" title="bbox 1429 44 1451 76; baseline 0 0; x_size 0; x_descenders 0; x_ascenders 0">
<span class="ocrx_word" id="word_1" title="bbox 1429 44 1451 76; x_wconf 61">-</span>
</span>
</p>
</div>
</body>Code written to interpret Tesseract output. will fail and return nothing, because there are no children of the ocr_page elements.
The relevant code is in io/elements.py. I think the SubElement call that creates ocr_carea elements should use the SubElement that created the ocr_page element, but which today is not captured in a variable binding.
# Create the body
body = SubElement(page_hocr, "body")
SubElement(
body,
"div",
attrib={
"class": "ocr_page",
"id": f"page_{p_idx + 1}",
"title": f"image; bbox 0 0 {width} {height}; ppageno 0",
},
)
# iterate over the blocks / lines / words and create the XML elements in body line by line with the attributes
for class_name, predictions in self.predictions.items():
for prediction in predictions:
if len(prediction.geometry) != 2:
raise TypeError("XML export is only available for straight bounding boxes for now.")
(xmin, ymin), (xmax, ymax) = prediction.geometry
prediction_div = SubElement(
body,
"div",
attrib={
"class": "ocr_carea",
"id": f"{class_name}_prediction_{prediction_count}",
"title": f"bbox {int(round(xmin * width))} {int(round(ymin * height))} \
{int(round(xmax * width))} {int(round(ymax * height))}",
},
)
prediction_div.text = prediction.value
prediction_count += 1
return ET.tostring(page_hocr, encoding="utf-8", method="xml"), ET.ElementTree(page_hocr)Code snippet to reproduce the bug
print(prediction.export_as_xml[0][0])Error traceback
None
Environment
This is current main branch code.
(.venv) i2kdevops@btc2:~/doctr/api$ python collect_env.py
Collecting environment information...
DocTR version: 0.11.1a0
TensorFlow version: N/A
PyTorch version: 2.6.0+cu124 (torchvision 0.21.0+cu124)
OpenCV version: 4.11.0
OS: Ubuntu 22.04.3 LTS
Python version: 3.10.12
Is CUDA available (TensorFlow): N/A
Is CUDA available (PyTorch): Yes
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 4080
Nvidia driver version: 535.183.01
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7
Deep Learning backend
>>> print(f"is_tf_available: {is_tf_available()}")
is_tf_available: False
>>> print(f"is_torch_available: {is_torch_available()}")
is_torch_available: True
>>>