Skip to content

[Bug]: Uncaught UnicodeDecodeError in repair_docinfo_nuls #1540

@AudunVN

Description

@AudunVN

Describe the bug

OCRmyPDF crashes on a PDF file with a DocumentInfo entry key containing what looks to be Latin-1 and/or ISO/IEC 8859-1-encoded text.

0xe5 in position 5 is from what I can see referring to the Saksår entry, as 0xe5 is an å in Latin-1.

Uncertain whether the issue lies in OCRmyPDF's expectations of output encoding from pikepdf, in pikepdf itself (or qpdf), and whether this is the expected output from those libraries with this kind of input. However, as far as I can tell the expected behavior for repair_docinfo_nuls would either way be to continue/skip when encountering errors, so thought I'd start by filing an issue here.

Currently working around this temporarily in my use case by patching ocrmypdf/_metadata.py to catch the UnicodeDecodeError:

def repair_docinfo_nuls(pdf):
    """If the DocumentInfo block contains NUL characters, remove them.

    If the DocumentInfo block is malformed, log an error and continue.
    """
    modified = False
    try:
        if not isinstance(pdf.docinfo, Dictionary):
            raise TypeError("DocumentInfo is not a dictionary")
        try:
            for k, v in pdf.docinfo.items():
                if isinstance(v, str) and b'\x00' in bytes(v):
                    pdf.docinfo[k] = bytes(v).replace(b'\x00', b'')
                    modified = True
        except UnicodeDecodeError:
            log.exception("Unable to decode a DocumentInfo field")
    except TypeError:
        # TypeError can also be raised if dictionary items are unexpected types
        log.error("File contains a malformed DocumentInfo block - continuing anyway.")
    return modified

Nested try/except isn't exactly the prettiest, but it works as a quick fix in case anybody else is encountering this issue.

Steps to reproduce

1. Run ocrmypdf test.pdf test-output.pdf

Files

test.pdf

test.pdf is a PDF file created using pikepdf, with one blank page and some docinfo copied over from another file using test_pdf.copy_foreign(input_pdf.docinfo).

How did you download and install the software?

Docker container

OCRmyPDF version

16.10.2

Relevant log output

root@04ab3d2fde5f:/usr/src/paperless/src# ocrmypdf -v1 /usr/src/paperless/media/documents/test-pdf/test.pdf /usr/src/paperless/media/documents/test-pdf/test-output.pdf
ocrmypdf 16.10.2                                                        __main__.py:59
Running: ['tesseract', '--version']                                    __init__.py:133
Found tesseract 5.3.0                                                  __init__.py:345
Running: ['tesseract', '--version']                                    __init__.py:133
Running: ['tesseract', '--version']                                    __init__.py:133
Running: ['gs', '--version']                                           __init__.py:133
Found gs 10.3.1                                                        __init__.py:345
Running: ['gs', '--version']                                           __init__.py:133
Running: ['tesseract', '--list-langs']                                 __init__.py:133
stdout/stderr = List of available languages in                          __init__.py:73
"/usr/share/tesseract-ocr/5/tessdata/" (8):                                           
deu                                                                                   
eng                                                                                   
fra                                                                                   
ita                                                                                   
nor                                                                                   
osd                                                                                   
spa                                                                                   
swe                                                                                   
                                                                                      
No language specified; assuming --language eng                       _validation.py:60
pikepdf mmap enabled                                                    helpers.py:328
os.symlink(/usr/src/paperless/media/documents/test-pdf/test.pdf,        helpers.py:179
/tmp/ocrmypdf.io.l0au_b4z/origin)                                                     
Gathering info with 1 thread workers                                       info.py:816
pikepdf mmap enabled                                                    helpers.py:328
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Using Tesseract OpenMP thread limit 3                             tesseract_ocr.py:199
pikepdf mmap enabled                                                    helpers.py:328
    1 Rasterize with pngmono, rotation 0                              _pipeline.py:552
    1 Running: ['gs', '-dSAFER', '-dBATCH', '-dNOPAUSE',               __init__.py:133
'-dInterpolateControl=-1', '-sDEVICE=pngmono', '-dFirstPage=1',                       
'-dLastPage=1', '-r400.000000x400.000000', '-dPDFSTOPONERROR', '-o',                  
'/tmp/ocrmypdf.io.l0au_b4z/000001_rasterize.png', '-sstdout=%stderr',                 
'-dAutoRotatePages=/None', '-f',                                                      
'/tmp/ocrmypdf.io.l0au_b4z/origin.pdf']                                               
    1 stderr = GPL Ghostscript 10.03.1 (2024-05-02)                     __init__.py:75
Copyright (C) 2024 Artifex Software, Inc.  All rights reserved.                       
This software is supplied under the GNU AGPLv3 and comes with NO                      
WARRANTY:                                                                             
see the file COPYING for details.                                                     
Processing pages 1 through 1.                                                         
Page 1                                                                                
                                                                                      
    1 Rotating output by 0                                          ghostscript.py:161
    1 resolution (399.9992, 399.9992)                                 _pipeline.py:631
    1 Running: ['tesseract', '-l', 'eng',                              __init__.py:133
'/tmp/ocrmypdf.io.l0au_b4z/000001_ocr.png',                                           
'/tmp/ocrmypdf.io.l0au_b4z/000001_ocr_hocr', 'hocr', 'txt']                           
    1 pikepdf.Matrix(0.18, 0, 0, -0.18, 0, 792)                           _hocr.py:219
    1 Text rotation: (text, autorotate, content) -> text misalignment =  _graft.py:152
(0, 0, 0) -> 0                                                                        
    1 Grafting                                                           _graft.py:263
    1 Grafting with ctm pikepdf.Matrix(1, 0, 0, 1, 0, 0)                 _graft.py:306
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0               _graft.py:177
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Postprocessing...                                                           ocr.py:144
An exception occurred while executing the pipeline                      _common.py:296
Traceback (most recent call last):                                                    
  File                                                                                
"/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/_common.py               
", line 261, in cli_exception_handler                                                 
    return fn(options, plugin_manager)                                                
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                
  File                                                                                
"/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py",                 
line 181, in _run_pipeline                                                            
    optimize_messages = exec_concurrent(context, executor)                            
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                            
  File                                                                                
"/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py",                 
line 145, in exec_concurrent                                                          
    pdf, messages = postprocess(pdf, context, executor)                               
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                               
  File                                                                                
"/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/_common.py               
", line 453, in postprocess                                                           
    pdf_out = convert_to_pdfa(pdf_out, ps_stub_out, context)                          
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                          
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipeline.py",               
line 907, in convert_to_pdfa                                                          
    if repair_docinfo_nuls(pdf_file):                                                 
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                  
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_metadata.py",               
line 86, in repair_docinfo_nuls                                                       
    for k, v in pdf.docinfo.items():                                                  
                ^^^^^^^^^^^^^^^^^^^                                                   
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 5:               
invalid continuation byte

Metadata

Metadata

Assignees

Labels

triageIssue needs triage

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions