Skip to content

[Bug]: Conversion to PDF/A strips Acrobat-produced CJK text layer #1561

@nisbet-hubbard

Description

@nisbet-hubbard

Describe the bug

We rely on adobe acrobat for CJK texts because of tesseract’s spacing issues (tesseract-ocr/tesseract#2702).

Currently, however, the pdf/a mode of ocrmypdf strips such an acrobat-produced pdf of its OCR layer.

Steps to reproduce

1. Run `ocrmypdf --skip-text input.pdf output.pdf`
2. `pdffonts input.pdf`
3. `pdffonts output.pdf`

The results of 2 and 3 are the same if `--output-type pdf` is used.

Files

No response

How did you download and install the software?

PyPI (pip, poetry, pipx, etc.)

OCRmyPDF version

16.10.1

Relevant log output


Metadata

Metadata

Assignees

Labels

triageIssue needs triage

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions