-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
Describe the bug
I have a highly compressed PDF (~6500 pages, ~180 MB), which is primarily text content. When forcing OCR on this file with ocrmypdf, the output file size grows to more than 3 GB.
I suspect the issue is that the original file stores most of its content using a specialized text structure (very compact representation). As a result, the average per-page size is only about 0.02 MB. After OCR, however, ocrmypdf stores each page as an image layer + text layer, which increases the average per-page size to ~0.3–0.8 MB. This leads to massive file size inflation compared to the original file.
I am not certain whether my assumption is correct, but it seems ocrmypdf cannot preserve the compact text-only storage in such cases. I would like to know if there are any options or workflows in ocrmypdf that could reduce the output file size when OCRing this kind of special text-heavy, highly compressed PDF.
Steps to reproduce
1.Take a large, highly compressed text-based PDF (~6500 pages, ~180 MB).
2.Run ocrmypdf --force-ocr input.pdf output.pdf.
3.Observe the output file grows to >3 GB.
Files
Unfortunately, I cannot provide the PDF file due to privacy concerns.
How did you download and install the software?
Homebrew
OCRmyPDF version
16.11.0
Relevant log output