-
Notifications
You must be signed in to change notification settings - Fork 291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docling Produces Unreadable Text Output for PDF with non-standard Font Encoding, OCR Appears Not to be Applied #185
Comments
@imene-swaan Thanks for bringing this up! Could you attach the pdf (even a single page is enough)? We will look into it asap! |
@imene-swaan thanks for the detailed report! To clarify, OCR can not help you in this case, because docling does not run OCR unless there is an actual bitmap resource detected in the PDF. Hence, OCR will never trigger on programmatic text, even if the font is unknown. As a temporary workaround, you can choose a different PDF backend for the case, e.g. doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(backend=PyPdfiumDocumentBackend)
}
) |
@cau-git It could be beneficial to save the pdf pages as images and then trigger the OCR in such cases. Also, I tried |
@imene-swaan I think it is a particular font we do not (yet) support. If you can provide us a simple pdf sample, I might fix it for you. |
@imene-swaan I tracked it down. Bottom line, some web-browsers have very bad pdf-printers (meaning that they dont encode the text). You can test it yourself by trying to copy the text and then paste it into a text-file. What you see is mangled characters, because they only care about the printing of it. This gives you two options:
I tested it directly with this command,
I found some issues and have fixed them in this PR (#240). The output is pretty good (from the PR), I will review the PR with my colleagues and make sure it get in asap! Thanks for pointing this issue out! |
@PeterStaar-IBM As I've mentioned in my issue description, the main issue seems to be the OCR not being applied even if I specifiy An ideal solution would be to force trigger OCR if the font is unknown and |
@imene-swaan Yes, we are adding indeed the forced OCR feature! |
Description:
I'm using Docling to parse a PDF that contains text. The PDF appears to use a non-standard font or encoding, as copying text directly from it also yields garbled characters. Despite setting
do_ocr=True
and specifyingTesseract
as the OCR engine, Docling's output remains unreadable. Testing with Docling v1 produces a different, but similarly unreadable, output containing placeholder glyphs.Here’s an example of the output generated by the current Docling version:
When using Docling v1, the output looks like this instead:
Steps to Reproduce:
PdfPipelineOptions
withdo_ocr=True
to enable OCR.ocr_options
to useTesseractCliOcrOptions
.Expected Behavior:
Docling should apply OCR, yielding readable output.
Observed Behavior:
The output consists of unreadable characters or placeholder glyphs, suggesting that Docling is not applying OCR despite
do_ocr=True
.Environment:
Troubleshooting Steps Taken:
Additional Information:
When copying text directly from the PDF, it appears garbled, as follows:
When examining the PDF's font properties, I found that it uses
Type3
fonts. The code for inspecting the font:The output:
The fact that Tesseract works independently implies that Docling might not be applying OCR correctly, even though
do_ocr=True
and Tesseract is specified as the engine. The differing outputs between Docling v1 and the current version may also indicate a change in how Docling handles such PDFs. Any insights or solutions for handling PDFs with embedded fonts would be greatly appreciated.The text was updated successfully, but these errors were encountered: