Skip to content

[Bug]: OCR of highly compressed text-based PDF results in extremely large output file size #1579

@YuchengLiKayson

Description

@YuchengLiKayson

Describe the bug

I have a highly compressed PDF (~6500 pages, ~180 MB), which is primarily text content. When forcing OCR on this file with ocrmypdf, the output file size grows to more than 3 GB.

I suspect the issue is that the original file stores most of its content using a specialized text structure (very compact representation). As a result, the average per-page size is only about 0.02 MB. After OCR, however, ocrmypdf stores each page as an image layer + text layer, which increases the average per-page size to ~0.3–0.8 MB. This leads to massive file size inflation compared to the original file.

I am not certain whether my assumption is correct, but it seems ocrmypdf cannot preserve the compact text-only storage in such cases. I would like to know if there are any options or workflows in ocrmypdf that could reduce the output file size when OCRing this kind of special text-heavy, highly compressed PDF.

Steps to reproduce

1.Take a large, highly compressed text-based PDF (~6500 pages, ~180 MB).

2.Run ocrmypdf --force-ocr input.pdf output.pdf.

3.Observe the output file grows to >3 GB.

Files

Unfortunately, I cannot provide the PDF file due to privacy concerns.

How did you download and install the software?

Homebrew

OCRmyPDF version

16.11.0

Relevant log output


Metadata

Metadata

Assignees

Labels

triageIssue needs triage

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions