Skip to content

Reference issue: Document Parsing capabilities performance improvement #980

@iirorahkonen

Description

@iirorahkonen

Hi,

I'm using PdfPig through Kernel Memory (https://github.com/microsoft/kernel-memory) and ingesting PDF files for text embeddings. Only the text in the documents is relevant to my use cases, and I don't need the graphics. As I understand, I could technically read only the text from PDF files, but it seems that PdfPig doesn't support that.

The issue I face is a huge memory usage, about 4-6 gigs for a 15-megabyte file text and graphics. This is probably mostly caused by the huge list of operations - see the screenshot.

Image

Does anyone have any insight on how I could get around the issue of the massive memory usage, skip parsing the graphics, or any information related to my assumptions of not needing the graphics?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions