Reference issue: Document Parsing capabilities performance improvement

Hi,

I'm using PdfPig through Kernel Memory (https://github.com/microsoft/kernel-memory) and ingesting PDF files for text embeddings. Only the text in the documents is relevant to my use cases, and I don't need the graphics. As I understand, I could technically read only the text from PDF files, but it seems that PdfPig doesn't support that.

The issue I face is a huge memory usage, about 4-6 gigs for a 15-megabyte file text and graphics. This is probably mostly caused by the huge list of operations - see the screenshot. 

<img width="596" alt="Image" src="https://github.com/user-attachments/assets/996f8bc9-031d-4959-ad8b-de3e00cb9a1f" />

Does anyone have any insight on how I could get around the issue of the massive memory usage, skip parsing the graphics, or any information related to my assumptions of not needing the graphics?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reference issue: Document Parsing capabilities performance improvement #980

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reference issue: Document Parsing capabilities performance improvement #980

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions