-
Notifications
You must be signed in to change notification settings - Fork 271
Open
Labels
document-readingRelated to reading documentsRelated to reading documentsperformanceperformance optimizationsperformance optimizations
Description
Hi,
I'm using PdfPig through Kernel Memory (https://github.com/microsoft/kernel-memory) and ingesting PDF files for text embeddings. Only the text in the documents is relevant to my use cases, and I don't need the graphics. As I understand, I could technically read only the text from PDF files, but it seems that PdfPig doesn't support that.
The issue I face is a huge memory usage, about 4-6 gigs for a 15-megabyte file text and graphics. This is probably mostly caused by the huge list of operations - see the screenshot.

Does anyone have any insight on how I could get around the issue of the massive memory usage, skip parsing the graphics, or any information related to my assumptions of not needing the graphics?
Metadata
Metadata
Assignees
Labels
document-readingRelated to reading documentsRelated to reading documentsperformanceperformance optimizationsperformance optimizations