Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docling Produces Unreadable Text Output for PDF with non-standard Font Encoding, OCR Appears Not to be Applied #185

Open
imene-swaan opened this issue Oct 29, 2024 · 8 comments

Comments

@imene-swaan
Copy link

Description:

I'm using Docling to parse a PDF that contains text. The PDF appears to use a non-standard font or encoding, as copying text directly from it also yields garbled characters. Despite setting do_ocr=True and specifying Tesseract as the OCR engine, Docling's output remains unreadable. Testing with Docling v1 produces a different, but similarly unreadable, output containing placeholder glyphs.

Here’s an example of the output generated by the current Docling version:

'()* +,- .+..  /01 02034567638469:; 4<8:=> -                 '()* +,- .+..  /01 02034567638469:; 4<8:=> 4-                 '()* +,- .+..

When using Docling v1, the output looks like this instead:

GLYPH<38> GLYPH<39> GLYPH<40> GLYPH<41> GLYPH<i255> GLYPH<43> GLYPH<44> GLYPH<45> GLYPH<i255> GLYPH<46> GLYPH<43> GLYPH<46> GLYPH<46>
## GLYPH<47> GLYPH<48> GLYPH<49>GLYPH<i255> GLYPH<48> GLYPH<51> GLYPH<48> GLYPH<52> ...

Steps to Reproduce:

  • Initialize PdfPipelineOptions with do_ocr=True to enable OCR.
  • Set ocr_options to use TesseractCliOcrOptions.
  • Attempt to parse a PDF with non-standard text encoding.
  • Observe the output, which contains garbled symbols or glyph placeholders, instead of the correct text content.

Expected Behavior:

Docling should apply OCR, yielding readable output.

Observed Behavior:

The output consists of unreadable characters or placeholder glyphs, suggesting that Docling is not applying OCR despite do_ocr=True.

Environment:

  • Docling Version: 2.2.1 (v1: 1.20.0)
  • Python Version: 3.11.10
  • Operating System: MacOS 14.3.1
  • Tesseract 5.4.1

Troubleshooting Steps Taken:

  • Verified that Tesseract is correctly installed and functional by running it independently on the same PDF, which produced readable text output.

Additional Information:

When copying text directly from the PDF, it appears garbled, as follows:

WX?6469Y>ÿZ28:>
[ELAÿ'(OU-ÿPBAMÿAM*ÿ\]^ÿ_`aÿbbÿQEDcEC*-ÿAM*ÿV

When examining the PDF's font properties, I found that it uses Type3 fonts. The code for inspecting the font:

import fitz  # PyMuPDF
doc = fitz.open(pdf_path)
page = doc[0]
fonts = page.get_fonts(full=True)
print(fonts)

The output:

[(821, 'n/a', 'Type3', 'T1', 'T1', '', 0)]

The fact that Tesseract works independently implies that Docling might not be applying OCR correctly, even though do_ocr=True and Tesseract is specified as the engine. The differing outputs between Docling v1 and the current version may also indicate a change in how Docling handles such PDFs. Any insights or solutions for handling PDFs with embedded fonts would be greatly appreciated.

@PeterStaar-IBM
Copy link
Contributor

@imene-swaan Thanks for bringing this up! Could you attach the pdf (even a single page is enough)? We will look into it asap!

@cau-git
Copy link
Contributor

cau-git commented Oct 29, 2024

@imene-swaan thanks for the detailed report!

To clarify, OCR can not help you in this case, because docling does not run OCR unless there is an actual bitmap resource detected in the PDF. Hence, OCR will never trigger on programmatic text, even if the font is unknown.

As a temporary workaround, you can choose a different PDF backend for the case, e.g. PyPdfiumDocumentBackend, to see if this helps (Note that this workaround may come with other issues, such as merged table rows).

doc_converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(backend=PyPdfiumDocumentBackend)
        }
    )

@imene-swaan
Copy link
Author

@cau-git It could be beneficial to save the pdf pages as images and then trigger the OCR in such cases.

Also, I tried PyPdfiumDocumentBackend and the results are the same.

@PeterStaar-IBM
Copy link
Contributor

@imene-swaan I think it is a particular font we do not (yet) support. If you can provide us a simple pdf sample, I might fix it for you.

@PeterStaar-IBM
Copy link
Contributor

@imene-swaan I tracked it down. Bottom line, some web-browsers have very bad pdf-printers (meaning that they dont encode the text). You can test it yourself by trying to copy the text and then paste it into a text-file. What you see is mangled characters, because they only care about the printing of it.

This gives you two options:

  1. Try using OCR: We have several OCR options (easyOCR and tesserocr).
  2. Leverage our native HTML: I think this is the preferred option. If you are anyway printing a webpage, it might be much faster to parse the HTML directly.

I tested it directly with this command,

poetry run docling --from html --to md "https://www.enel.com/company/stories/articles/2022/06/projects-innovative-electrification-renewables" --output ./scratch/

I found some issues and have fixed them in this PR (#240). The output is pretty good (from the PR),

Screenshot 2024-11-05 at 06 53 27

I will review the PR with my colleagues and make sure it get in asap!

Thanks for pointing this issue out!

@imene-swaan
Copy link
Author

@PeterStaar-IBM As I've mentioned in my issue description, the main issue seems to be the OCR not being applied even if I specifiy do_ocr=True. @cau-git mentioned that OCR is not triggered unless there is an actual bitmap resource element.

An ideal solution would be to force trigger OCR if the font is unknown and do_ocr=True.

@PeterStaar-IBM
Copy link
Contributor

@imene-swaan Yes, we are adding indeed the forced OCR feature!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants