-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Description
Describe the bug
OCRmyPDF crashes on a PDF file with a DocumentInfo entry key containing what looks to be Latin-1
and/or ISO/IEC 8859-1
-encoded text.
0xe5 in position 5
is from what I can see referring to the Saksår
entry, as 0xe5
is an å
in Latin-1
.
Uncertain whether the issue lies in OCRmyPDF's expectations of output encoding from pikepdf, in pikepdf itself (or qpdf), and whether this is the expected output from those libraries with this kind of input. However, as far as I can tell the expected behavior for repair_docinfo_nuls
would either way be to continue/skip when encountering errors, so thought I'd start by filing an issue here.
Currently working around this temporarily in my use case by patching ocrmypdf/_metadata.py
to catch the UnicodeDecodeError
:
def repair_docinfo_nuls(pdf):
"""If the DocumentInfo block contains NUL characters, remove them.
If the DocumentInfo block is malformed, log an error and continue.
"""
modified = False
try:
if not isinstance(pdf.docinfo, Dictionary):
raise TypeError("DocumentInfo is not a dictionary")
try:
for k, v in pdf.docinfo.items():
if isinstance(v, str) and b'\x00' in bytes(v):
pdf.docinfo[k] = bytes(v).replace(b'\x00', b'')
modified = True
except UnicodeDecodeError:
log.exception("Unable to decode a DocumentInfo field")
except TypeError:
# TypeError can also be raised if dictionary items are unexpected types
log.error("File contains a malformed DocumentInfo block - continuing anyway.")
return modified
Nested try
/except
isn't exactly the prettiest, but it works as a quick fix in case anybody else is encountering this issue.
Steps to reproduce
1. Run ocrmypdf test.pdf test-output.pdf
Files
test.pdf is a PDF file created using pikepdf, with one blank page and some docinfo copied over from another file using test_pdf.copy_foreign(input_pdf.docinfo)
.
How did you download and install the software?
Docker container
OCRmyPDF version
16.10.2
Relevant log output
root@04ab3d2fde5f:/usr/src/paperless/src# ocrmypdf -v1 /usr/src/paperless/media/documents/test-pdf/test.pdf /usr/src/paperless/media/documents/test-pdf/test-output.pdf
ocrmypdf 16.10.2 __main__.py:59
Running: ['tesseract', '--version'] __init__.py:133
Found tesseract 5.3.0 __init__.py:345
Running: ['tesseract', '--version'] __init__.py:133
Running: ['tesseract', '--version'] __init__.py:133
Running: ['gs', '--version'] __init__.py:133
Found gs 10.3.1 __init__.py:345
Running: ['gs', '--version'] __init__.py:133
Running: ['tesseract', '--list-langs'] __init__.py:133
stdout/stderr = List of available languages in __init__.py:73
"/usr/share/tesseract-ocr/5/tessdata/" (8):
deu
eng
fra
ita
nor
osd
spa
swe
No language specified; assuming --language eng _validation.py:60
pikepdf mmap enabled helpers.py:328
os.symlink(/usr/src/paperless/media/documents/test-pdf/test.pdf, helpers.py:179
/tmp/ocrmypdf.io.l0au_b4z/origin)
Gathering info with 1 thread workers info.py:816
pikepdf mmap enabled helpers.py:328
Scanning contents ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Using Tesseract OpenMP thread limit 3 tesseract_ocr.py:199
pikepdf mmap enabled helpers.py:328
1 Rasterize with pngmono, rotation 0 _pipeline.py:552
1 Running: ['gs', '-dSAFER', '-dBATCH', '-dNOPAUSE', __init__.py:133
'-dInterpolateControl=-1', '-sDEVICE=pngmono', '-dFirstPage=1',
'-dLastPage=1', '-r400.000000x400.000000', '-dPDFSTOPONERROR', '-o',
'/tmp/ocrmypdf.io.l0au_b4z/000001_rasterize.png', '-sstdout=%stderr',
'-dAutoRotatePages=/None', '-f',
'/tmp/ocrmypdf.io.l0au_b4z/origin.pdf']
1 stderr = GPL Ghostscript 10.03.1 (2024-05-02) __init__.py:75
Copyright (C) 2024 Artifex Software, Inc. All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO
WARRANTY:
see the file COPYING for details.
Processing pages 1 through 1.
Page 1
1 Rotating output by 0 ghostscript.py:161
1 resolution (399.9992, 399.9992) _pipeline.py:631
1 Running: ['tesseract', '-l', 'eng', __init__.py:133
'/tmp/ocrmypdf.io.l0au_b4z/000001_ocr.png',
'/tmp/ocrmypdf.io.l0au_b4z/000001_ocr_hocr', 'hocr', 'txt']
1 pikepdf.Matrix(0.18, 0, 0, -0.18, 0, 792) _hocr.py:219
1 Text rotation: (text, autorotate, content) -> text misalignment = _graft.py:152
(0, 0, 0) -> 0
1 Grafting _graft.py:263
1 Grafting with ctm pikepdf.Matrix(1, 0, 0, 1, 0, 0) _graft.py:306
1 Page rotation: (content, auto) -> page = (0, 0) -> 0 _graft.py:177
OCR ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Postprocessing... ocr.py:144
An exception occurred while executing the pipeline _common.py:296
Traceback (most recent call last):
File
"/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/_common.py
", line 261, in cli_exception_handler
return fn(options, plugin_manager)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py",
line 181, in _run_pipeline
optimize_messages = exec_concurrent(context, executor)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py",
line 145, in exec_concurrent
pdf, messages = postprocess(pdf, context, executor)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File
"/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/_common.py
", line 453, in postprocess
pdf_out = convert_to_pdfa(pdf_out, ps_stub_out, context)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipeline.py",
line 907, in convert_to_pdfa
if repair_docinfo_nuls(pdf_file):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_metadata.py",
line 86, in repair_docinfo_nuls
for k, v in pdf.docinfo.items():
^^^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 5:
invalid continuation byte