Skip to content

How to improve PDF conversion speed for large documents? #2699

@Blue-Ladder

Description

@Blue-Ladder
pipeline_options = PdfPipelineOptions()
pipeline_options.images_scale = 1.0
pipeline_options.do_ocr = False
pipeline_options.generate_page_images = False
pipeline_options.generate_picture_images = True
pipeline_options.generate_table_images = True
pipeline_options.do_table_structure = True

doc_converter = DocumentConverter(
    format_options={
        InputFormat.PDF: PdfFormatOption(pipeline_options=pipeline_options)
    }
)
conv_res = doc_converter.convert(input_doc_path).document 

I'm working on extracting images and tables from PDFs, chunking them using HybridChunker, and then linking each chunk with its corresponding images and tables. However, when dealing with large PDFs (over 500 pages), the document conversion process takes too long. What settings should I configure to improve the conversion speed? My environment has a T4 GPU and 2 CPU cores.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions