Skip to content

Unable to parse PDFs: unknown type in init_ws #75

@wwwslinger

Description

@wwwslinger

With docling-parse version 3.0.0, I receive the attached exception when attempting to convert the attached PDF and many others like it. I don't have this error with prior versions. This means I can't use docling 2.10.0+. Is there a workaround?

Here is the code:

from docling.datamodel.base_models import InputFormat
from docling.document_converter import DocumentConverter, PdfFormatOption
from docling.datamodel.pipeline_options import PdfPipelineOptions, TableFormerMode

def get_docling_converter(method='fast'):
    pipeline_options = PdfPipelineOptions(do_table_structure=True, generate_picture_images=True)
    if method == 'accurate':
        pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
        
    elif method == 'predicted':
        pipeline_options.table_structure_options.mode = TableFormerMode.ACCURATE
        pipeline_options.table_structure_options.do_cell_matching = False  # uses text cells predicted from table structure model
        
    return DocumentConverter(
            format_options={
                InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
            }
        )

converter = ct.get_docling_converter('predicted')
result = converter.convert('Prot_001.pdf')

Versions:
Python 3.11.10
docling==2.10.0
docling-core==2.9.0
docling-ibm-models==2.0.8
docling-parse==3.0.0

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions