Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding issue on default backend DoclingParseV2DocumentBackend for PDF #663

Open
Seigneurhol opened this issue Dec 30, 2024 · 6 comments
Open
Assignees
Labels
bug Something isn't working PDF parsing

Comments

@Seigneurhol
Copy link

Bug

When I use the default parser (DoclingParseV2DocumentBackend) for parsing a PDF I have encoding issue : "ao\u00fbt, facturation \u00e0". But it works fine with PyPdfiumDocumentBackend.

Steps to reproduce

Use the default DocumentConverter without specifying a backend.

    pipeline_options.generate_picture_images = True
    pipeline_options.do_ocr = True
    pipeline_options.ocr_options = EasyOcrOptions()
    pipeline_options.do_table_structure = True
    pipeline_options.table_structure_options.do_cell_matching = True
    pipeline_options.ocr_options.lang = ["fr", "en"]
    pipeline_options.accelerator_options = AcceleratorOptions(
        num_threads=4, device=AcceleratorDevice.AUTO
    )
    
    converter = DocumentConverter(
        format_options={
            InputFormat.PDF: PdfFormatOption(
                pipeline_options=pipeline_options,
            )
        }
    )

Then read a PDF and convert it to markdown.

    doc_stream = io.BytesIO(content)
    input_name = filename if filename else "document.pdf"
    
    # Convert document
    result = converter.convert(
        DocumentStream(name=input_name, stream=doc_stream)
    ) 
    result.document.export_to_markdown()

Docling version

Docling version: 2.14.0

Python version

Python 3.12.3

@Seigneurhol Seigneurhol added the bug Something isn't working label Dec 30, 2024
@trinanjan12
Copy link

trinanjan12 commented Dec 30, 2024

@Seigneurhol I tested this code, this seems working for me with docling 2.14 and python 3.11

from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.datamodel.pipeline_options import EasyOcrOptions
from docling.datamodel.pipeline_options import AcceleratorDevice, AcceleratorOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.base_models import InputFormat


pipeline_options = PdfPipelineOptions()
pipeline_options.generate_picture_images = True
pipeline_options.do_ocr = True
pipeline_options.ocr_options = EasyOcrOptions()
pipeline_options.do_table_structure = True
# pipeline_options.table_structure_options.do_cell_matching = True
pipeline_options.ocr_options.lang = ["fr", "en"]
pipeline_options.accelerator_options = AcceleratorOptions(
    num_threads=4, device=AcceleratorDevice.AUTO)

converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    })
x = converter.convert(source="./tests/data/2305.03393v1-pg9.pdf")

x.document

@Seigneurhol
Copy link
Author

You don't have any problem with accent or special characters ?

@trinanjan12
Copy link

Screenshot from 2024-12-30 22-27-22

@Seigneurhol
Copy link
Author

Yes you are right. On some document it works fine. But on other there are some encoding issue that don't happen in PyPdfiumDocumentBackend

@PeterStaar-IBM
Copy link
Contributor

@Seigneurhol Can you provide the PDF that gives you problems? I am trying to fix all font related issues.

@zvictor
Copy link

zvictor commented Jan 11, 2025

I get encoding issues on a simple command:

docling --from pdf --to md --image-export-mode placeholder \
    https://venda-imoveis.caixa.gov.br/editais/EL00820224CPARE.PDF

That produces artifacts such as Leilªo Pœblico and Alienaçªo FiduciÆria instead of Leilão Público
and Alienação Fiduciária.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working PDF parsing
Projects
None yet
Development

No branches or pull requests

5 participants