[Bug]: Uncaught UnicodeDecodeError in repair_docinfo_nuls

### Describe the bug

OCRmyPDF crashes on a PDF file with a DocumentInfo entry key containing what looks to be `Latin-1` and/or `ISO/IEC 8859-1`-encoded text.

`0xe5 in position 5` is from what I can see referring to the `Saksår` entry, as `0xe5` is an `å` in `Latin-1`.

Uncertain whether the issue lies in OCRmyPDF's expectations of output encoding from pikepdf, in pikepdf itself (or qpdf), and whether this is the expected output from those libraries with this kind of input. However, as far as I can tell the expected behavior for `repair_docinfo_nuls` would either way be to continue/skip when encountering errors, so thought I'd start by filing an issue here.

Currently working around this temporarily in my use case by patching `ocrmypdf/_metadata.py` to catch the `UnicodeDecodeError`:

```py
def repair_docinfo_nuls(pdf):
    """If the DocumentInfo block contains NUL characters, remove them.

    If the DocumentInfo block is malformed, log an error and continue.
    """
    modified = False
    try:
        if not isinstance(pdf.docinfo, Dictionary):
            raise TypeError("DocumentInfo is not a dictionary")
        try:
            for k, v in pdf.docinfo.items():
                if isinstance(v, str) and b'\x00' in bytes(v):
                    pdf.docinfo[k] = bytes(v).replace(b'\x00', b'')
                    modified = True
        except UnicodeDecodeError:
            log.exception("Unable to decode a DocumentInfo field")
    except TypeError:
        # TypeError can also be raised if dictionary items are unexpected types
        log.error("File contains a malformed DocumentInfo block - continuing anyway.")
    return modified
```

Nested `try`/`except` isn't exactly the prettiest, but it works as a quick fix in case anybody else is encountering this issue.

### Steps to reproduce

```plain text
1. Run ocrmypdf test.pdf test-output.pdf
```

### Files

[test.pdf](https://github.com/user-attachments/files/20837369/test.pdf)

test.pdf is a PDF file created using pikepdf, with one blank page and some docinfo copied over from another file using `test_pdf.copy_foreign(input_pdf.docinfo)`.

### How did you download and install the software?

Docker container

### OCRmyPDF version

16.10.2

### Relevant log output

```plain text
root@04ab3d2fde5f:/usr/src/paperless/src# ocrmypdf -v1 /usr/src/paperless/media/documents/test-pdf/test.pdf /usr/src/paperless/media/documents/test-pdf/test-output.pdf
ocrmypdf 16.10.2                                                        __main__.py:59
Running: ['tesseract', '--version']                                    __init__.py:133
Found tesseract 5.3.0                                                  __init__.py:345
Running: ['tesseract', '--version']                                    __init__.py:133
Running: ['tesseract', '--version']                                    __init__.py:133
Running: ['gs', '--version']                                           __init__.py:133
Found gs 10.3.1                                                        __init__.py:345
Running: ['gs', '--version']                                           __init__.py:133
Running: ['tesseract', '--list-langs']                                 __init__.py:133
stdout/stderr = List of available languages in                          __init__.py:73
"/usr/share/tesseract-ocr/5/tessdata/" (8):                                           
deu                                                                                   
eng                                                                                   
fra                                                                                   
ita                                                                                   
nor                                                                                   
osd                                                                                   
spa                                                                                   
swe                                                                                   
                                                                                      
No language specified; assuming --language eng                       _validation.py:60
pikepdf mmap enabled                                                    helpers.py:328
os.symlink(/usr/src/paperless/media/documents/test-pdf/test.pdf,        helpers.py:179
/tmp/ocrmypdf.io.l0au_b4z/origin)                                                     
Gathering info with 1 thread workers                                       info.py:816
pikepdf mmap enabled                                                    helpers.py:328
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Using Tesseract OpenMP thread limit 3                             tesseract_ocr.py:199
pikepdf mmap enabled                                                    helpers.py:328
    1 Rasterize with pngmono, rotation 0                              _pipeline.py:552
    1 Running: ['gs', '-dSAFER', '-dBATCH', '-dNOPAUSE',               __init__.py:133
'-dInterpolateControl=-1', '-sDEVICE=pngmono', '-dFirstPage=1',                       
'-dLastPage=1', '-r400.000000x400.000000', '-dPDFSTOPONERROR', '-o',                  
'/tmp/ocrmypdf.io.l0au_b4z/000001_rasterize.png', '-sstdout=%stderr',                 
'-dAutoRotatePages=/None', '-f',                                                      
'/tmp/ocrmypdf.io.l0au_b4z/origin.pdf']                                               
    1 stderr = GPL Ghostscript 10.03.1 (2024-05-02)                     __init__.py:75
Copyright (C) 2024 Artifex Software, Inc.  All rights reserved.                       
This software is supplied under the GNU AGPLv3 and comes with NO                      
WARRANTY:                                                                             
see the file COPYING for details.                                                     
Processing pages 1 through 1.                                                         
Page 1                                                                                
                                                                                      
    1 Rotating output by 0                                          ghostscript.py:161
    1 resolution (399.9992, 399.9992)                                 _pipeline.py:631
    1 Running: ['tesseract', '-l', 'eng',                              __init__.py:133
'/tmp/ocrmypdf.io.l0au_b4z/000001_ocr.png',                                           
'/tmp/ocrmypdf.io.l0au_b4z/000001_ocr_hocr', 'hocr', 'txt']                           
    1 pikepdf.Matrix(0.18, 0, 0, -0.18, 0, 792)                           _hocr.py:219
    1 Text rotation: (text, autorotate, content) -> text misalignment =  _graft.py:152
(0, 0, 0) -> 0                                                                        
    1 Grafting                                                           _graft.py:263
    1 Grafting with ctm pikepdf.Matrix(1, 0, 0, 1, 0, 0)                 _graft.py:306
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0               _graft.py:177
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Postprocessing...                                                           ocr.py:144
An exception occurred while executing the pipeline                      _common.py:296
Traceback (most recent call last):                                                    
  File                                                                                
"/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/_common.py               
", line 261, in cli_exception_handler                                                 
    return fn(options, plugin_manager)                                                
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                
  File                                                                                
"/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py",                 
line 181, in _run_pipeline                                                            
    optimize_messages = exec_concurrent(context, executor)                            
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                            
  File                                                                                
"/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/ocr.py",                 
line 145, in exec_concurrent                                                          
    pdf, messages = postprocess(pdf, context, executor)                               
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                               
  File                                                                                
"/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipelines/_common.py               
", line 453, in postprocess                                                           
    pdf_out = convert_to_pdfa(pdf_out, ps_stub_out, context)                          
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                          
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_pipeline.py",               
line 907, in convert_to_pdfa                                                          
    if repair_docinfo_nuls(pdf_file):                                                 
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                  
  File "/usr/local/lib/python3.12/site-packages/ocrmypdf/_metadata.py",               
line 86, in repair_docinfo_nuls                                                       
    for k, v in pdf.docinfo.items():                                                  
                ^^^^^^^^^^^^^^^^^^^                                                   
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe5 in position 5:               
invalid continuation byte
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: Uncaught UnicodeDecodeError in repair_docinfo_nuls #1540

Describe the bug

Steps to reproduce

Files

How did you download and install the software?

OCRmyPDF version

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: Uncaught UnicodeDecodeError in repair_docinfo_nuls #1540

Description

Describe the bug

Steps to reproduce

Files

How did you download and install the software?

OCRmyPDF version

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions