Docling Produces Unreadable Text Output for PDF with non-standard Font Encoding, OCR Appears Not to be Applied #185

imene-swaan · 2024-10-29T00:36:44Z

Description:

I'm using Docling to parse a PDF that contains text. The PDF appears to use a non-standard font or encoding, as copying text directly from it also yields garbled characters. Despite setting do_ocr=True and specifying Tesseract as the OCR engine, Docling's output remains unreadable. Testing with Docling v1 produces a different, but similarly unreadable, output containing placeholder glyphs.

Here’s an example of the output generated by the current Docling version:

'()* +,- .+..  /01 02034567638469:; 4<8:=> -                 '()* +,- .+..  /01 02034567638469:; 4<8:=> 4-                 '()* +,- .+..

When using Docling v1, the output looks like this instead:

GLYPH<38> GLYPH<39> GLYPH<40> GLYPH<41> GLYPH<i255> GLYPH<43> GLYPH<44> GLYPH<45> GLYPH<i255> GLYPH<46> GLYPH<43> GLYPH<46> GLYPH<46>
## GLYPH<47> GLYPH<48> GLYPH<49>GLYPH<i255> GLYPH<48> GLYPH<51> GLYPH<48> GLYPH<52> ...

Steps to Reproduce:

Initialize PdfPipelineOptions with do_ocr=True to enable OCR.
Set ocr_options to use TesseractCliOcrOptions.
Attempt to parse a PDF with non-standard text encoding.
Observe the output, which contains garbled symbols or glyph placeholders, instead of the correct text content.

Expected Behavior:

Docling should apply OCR, yielding readable output.

Observed Behavior:

The output consists of unreadable characters or placeholder glyphs, suggesting that Docling is not applying OCR despite do_ocr=True.

Environment:

Docling Version: 2.2.1 (v1: 1.20.0)
Python Version: 3.11.10
Operating System: MacOS 14.3.1
Tesseract 5.4.1

Troubleshooting Steps Taken:

Verified that Tesseract is correctly installed and functional by running it independently on the same PDF, which produced readable text output.

Additional Information:

When copying text directly from the PDF, it appears garbled, as follows:

WX?6469Y>ÿZ28:>
[ELAÿ'(OU-ÿPBAMÿAM*ÿ\]^ÿ_`aÿbbÿQEDcEC*-ÿAM*ÿV

When examining the PDF's font properties, I found that it uses Type3 fonts. The code for inspecting the font:

import fitz  # PyMuPDF
doc = fitz.open(pdf_path)
page = doc[0]
fonts = page.get_fonts(full=True)
print(fonts)

The output:

[(821, 'n/a', 'Type3', 'T1', 'T1', '', 0)]

The fact that Tesseract works independently implies that Docling might not be applying OCR correctly, even though do_ocr=True and Tesseract is specified as the engine. The differing outputs between Docling v1 and the current version may also indicate a change in how Docling handles such PDFs. Any insights or solutions for handling PDFs with embedded fonts would be greatly appreciated.

The text was updated successfully, but these errors were encountered:

PeterStaar-IBM · 2024-10-29T07:50:03Z

@imene-swaan Thanks for bringing this up! Could you attach the pdf (even a single page is enough)? We will look into it asap!

cau-git · 2024-10-29T08:29:48Z

@imene-swaan thanks for the detailed report!

To clarify, OCR can not help you in this case, because docling does not run OCR unless there is an actual bitmap resource detected in the PDF. Hence, OCR will never trigger on programmatic text, even if the font is unknown.

As a temporary workaround, you can choose a different PDF backend for the case, e.g. PyPdfiumDocumentBackend, to see if this helps (Note that this workaround may come with other issues, such as merged table rows).

doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(backend=PyPdfiumDocumentBackend)
        }
    )

imene-swaan · 2024-10-29T16:05:40Z

@cau-git It could be beneficial to save the pdf pages as images and then trigger the OCR in such cases.

Also, I tried PyPdfiumDocumentBackend and the results are the same.

PeterStaar-IBM · 2024-10-29T16:35:40Z

@imene-swaan I think it is a particular font we do not (yet) support. If you can provide us a simple pdf sample, I might fix it for you.

imene-swaan · 2024-11-01T09:06:32Z

@PeterStaar-IBM here's an example:
https://content.influencemap.org//site/data/000/982/Enel_corporate_website_energy_mix_June_2022_June_2022.pdf

PeterStaar-IBM · 2024-11-05T05:54:10Z

@imene-swaan I tracked it down. Bottom line, some web-browsers have very bad pdf-printers (meaning that they dont encode the text). You can test it yourself by trying to copy the text and then paste it into a text-file. What you see is mangled characters, because they only care about the printing of it.

This gives you two options:

Try using OCR: We have several OCR options (easyOCR and tesserocr).
Leverage our native HTML: I think this is the preferred option. If you are anyway printing a webpage, it might be much faster to parse the HTML directly.

I tested it directly with this command,

poetry run docling --from html --to md "https://www.enel.com/company/stories/articles/2022/06/projects-innovative-electrification-renewables" --output ./scratch/

I found some issues and have fixed them in this PR (#240). The output is pretty good (from the PR),

I will review the PR with my colleagues and make sure it get in asap!

Thanks for pointing this issue out!

imene-swaan · 2024-11-05T11:51:25Z

@PeterStaar-IBM As I've mentioned in my issue description, the main issue seems to be the OCR not being applied even if I specifiy do_ocr=True. @cau-git mentioned that OCR is not triggered unless there is an actual bitmap resource element.

An ideal solution would be to force trigger OCR if the font is unknown and do_ocr=True.

PeterStaar-IBM · 2024-11-05T12:14:34Z

@imene-swaan Yes, we are adding indeed the forced OCR feature!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docling Produces Unreadable Text Output for PDF with non-standard Font Encoding, OCR Appears Not to be Applied #185

Docling Produces Unreadable Text Output for PDF with non-standard Font Encoding, OCR Appears Not to be Applied #185

imene-swaan commented Oct 29, 2024

PeterStaar-IBM commented Oct 29, 2024

cau-git commented Oct 29, 2024 •

edited

Loading

imene-swaan commented Oct 29, 2024

PeterStaar-IBM commented Oct 29, 2024

imene-swaan commented Nov 1, 2024

PeterStaar-IBM commented Nov 5, 2024

imene-swaan commented Nov 5, 2024

PeterStaar-IBM commented Nov 5, 2024

Docling Produces Unreadable Text Output for PDF with non-standard Font Encoding, OCR Appears Not to be Applied #185

Docling Produces Unreadable Text Output for PDF with non-standard Font Encoding, OCR Appears Not to be Applied #185

Comments

imene-swaan commented Oct 29, 2024

Description:

Steps to Reproduce:

Expected Behavior:

Observed Behavior:

Environment:

Troubleshooting Steps Taken:

Additional Information:

PeterStaar-IBM commented Oct 29, 2024

cau-git commented Oct 29, 2024 • edited Loading

imene-swaan commented Oct 29, 2024

PeterStaar-IBM commented Oct 29, 2024

imene-swaan commented Nov 1, 2024

PeterStaar-IBM commented Nov 5, 2024

imene-swaan commented Nov 5, 2024

PeterStaar-IBM commented Nov 5, 2024

cau-git commented Oct 29, 2024 •

edited

Loading