[Bug]: OCR of highly compressed text-based PDF results in extremely large output file size

### Describe the bug

I have a highly compressed PDF (~6500 pages, ~180 MB), which is primarily text content. When forcing OCR on this file with ocrmypdf, the output file size grows to more than 3 GB.

I suspect the issue is that the original file stores most of its content using a specialized text structure (very compact representation). As a result, the average per-page size is only about 0.02 MB. After OCR, however, ocrmypdf stores each page as an image layer + text layer, which increases the average per-page size to ~0.3–0.8 MB. This leads to massive file size inflation compared to the original file.

I am not certain whether my assumption is correct, but it seems ocrmypdf cannot preserve the compact text-only storage in such cases. I would like to know if there are any options or workflows in ocrmypdf that could reduce the output file size when OCRing this kind of special text-heavy, highly compressed PDF.

### Steps to reproduce

```plain text
1.Take a large, highly compressed text-based PDF (~6500 pages, ~180 MB).

2.Run ocrmypdf --force-ocr input.pdf output.pdf.

3.Observe the output file grows to >3 GB.
```

### Files

Unfortunately, I cannot provide the PDF file due to privacy concerns.

### How did you download and install the software?

Homebrew

### OCRmyPDF version

16.11.0

### Relevant log output

```plain text

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: OCR of highly compressed text-based PDF results in extremely large output file size #1579

Describe the bug

Steps to reproduce

Files

How did you download and install the software?

OCRmyPDF version

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: OCR of highly compressed text-based PDF results in extremely large output file size #1579

Description

Describe the bug

Steps to reproduce

Files

How did you download and install the software?

OCRmyPDF version

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions