Description
Bug
When DocumentConverter.convert()
fails, it raises a generic ConversionError
with a vague message like "Input document file.pdf is not valid" regardless of the actual underlying cause. This makes it impossible for applications to properly classify and handle different types of errors (password protection, corruption, format issues, etc.).
Additionally, docling prints detailed error information directly to stderr instead of making it programmatically accessible, forcing applications to choose between suppressing all error output or parsing stderr.
Current Behavior:
try:
result = doc_converter.convert(doc_stream)
except ConversionError as e:
print(e) # "Input document file.pdf is not valid."
# No way to distinguish between password protection vs corruption vs other issues
Meanwhile, the actual useful error details are printed to stderr:
An unexpected error occurred while opening the document file.pdf
Traceback (most recent call last):
...
pypdfium2._helpers.misc.PdfiumError: Failed to load document (PDFium: Incorrect password error).
Issues:
- Generic exceptions: All failures result in the same vague ConversionError message
- No exception chaining: Original exceptions (like PdfiumError) are not preserved via cause or context
- Stderr pollution: Detailed error info goes to stderr instead of being programmatically accessible
- Binary choice: Applications must either suppress all docling output or deal with unstructured stderr text
Expected Behavior:
Applications need to provide specific user feedback:
- Password-protected PDFs → "Please remove password and try again"
- Corrupted files → "File appears corrupted, please try a different file"
- Unsupported formats → "File format not supported"
Either preserve the exception chain, add error classification to the ConversionError, or make error details programmatically accessible instead of printing to stderr.
Steps to reproduce
import io
from docling.datamodel.base_models import DocumentStream, InputFormat
from docling.datamodel.pipeline_options import PdfPipelineOptions
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.exceptions import ConversionError
# Setup docling
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = False
pipeline_options.do_table_structure = False
doc_converter = DocumentConverter(
format_options={
InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
}
)
# Create a password-protected PDF or use invalid PDF content
invalid_content = b'not a valid pdf file'
file_input = io.BytesIO(invalid_content)
doc_stream = DocumentStream(name='file.pdf', stream=file_input)
try:
result = doc_converter.convert(doc_stream)
except ConversionError as e:
print(f'Exception message: {e}') # Generic message
print(f'Exception cause: {e.__cause__}') # None
print(f'Exception context: {e.__context__}') # None
# Detailed error information only goes to stderr
Docling version
Docling Version: 2.31.0
Python version
Python Version: 3.10+