Skip to content

ConversionError doesn't preserve underlying exception details for error classification (PDF Encrypted #1920

Open
@SpencerReyka

Description

@SpencerReyka

Bug

When DocumentConverter.convert() fails, it raises a generic ConversionError with a vague message like "Input document file.pdf is not valid" regardless of the actual underlying cause. This makes it impossible for applications to properly classify and handle different types of errors (password protection, corruption, format issues, etc.).

Additionally, docling prints detailed error information directly to stderr instead of making it programmatically accessible, forcing applications to choose between suppressing all error output or parsing stderr.

Current Behavior:

try:
    result = doc_converter.convert(doc_stream)
except ConversionError as e:
    print(e)  # "Input document file.pdf is not valid."
    # No way to distinguish between password protection vs corruption vs other issues

Meanwhile, the actual useful error details are printed to stderr:

An unexpected error occurred while opening the document file.pdf
Traceback (most recent call last):
  ...
pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: Incorrect password error).

Issues:

  1. Generic exceptions: All failures result in the same vague ConversionError message
  2. No exception chaining: Original exceptions (like PdfiumError) are not preserved via cause or context
  3. Stderr pollution: Detailed error info goes to stderr instead of being programmatically accessible
  4. Binary choice: Applications must either suppress all docling output or deal with unstructured stderr text

Expected Behavior:
Applications need to provide specific user feedback:

  • Password-protected PDFs → "Please remove password and try again"
  • Corrupted files → "File appears corrupted, please try a different file"
  • Unsupported formats → "File format not supported"

Either preserve the exception chain, add error classification to the ConversionError, or make error details programmatically accessible instead of printing to stderr.

Steps to reproduce

import io
from docling.datamodel.base_models import DocumentStream, InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.exceptions import ConversionError

# Setup docling
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = False
pipeline_options.do_table_structure = False

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)

# Create a password-protected PDF or use invalid PDF content
invalid_content = b'not a valid pdf file'
file_input = io.BytesIO(invalid_content)
doc_stream = DocumentStream(name='file.pdf', stream=file_input)

try:
    result = doc_converter.convert(doc_stream)
except ConversionError as e:
    print(f'Exception message: {e}')        # Generic message
    print(f'Exception cause: {e.__cause__}')  # None
    print(f'Exception context: {e.__context__}')  # None
    # Detailed error information only goes to stderr

Docling version

Docling Version: 2.31.0

Python version

Python Version: 3.10+

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions