Skip to content

[Bug]: macOS: trouble with Spotlight indexation of some files #1522

@macdeport

Description

@macdeport

Describe the bug

Why does ocrmypdf on certain PDF files produce words that are recognized as concatenated for Spotlight indexing, thereby this document is no searchable?

testredocr.pdf
Image

testredocr_.pdf
Image

However, copying and pasting text is normal...

Steps to reproduce

Run `ocrmypdf --language fra --redo-ocr 'testredocr.pdf' 'testredocr_.pdf'`

`MacPorts` installed

Files

testredocr.pdf

testredocr_.pdf

How did you download and install the software?

No response

OCRmyPDF version

16.10.1

Relevant log output

ocrmypdf --language fra --redo-ocr -v1 'testredocr.pdf' 'testredocr_.pdf'
ocrmypdf 16.10.1                                                                                                                 __main__.py:59
Running: ['tesseract', '--version']                                                                                             __init__.py:133
Found tesseract 5.4.1                                                                                                           __init__.py:345
Running: ['tesseract', '--version']                                                                                             __init__.py:133
Running: ['tesseract', '--version']                                                                                             __init__.py:133
Running: ['gs', '--version']                                                                                                    __init__.py:133
Found gs 10.5.0                                                                                                                 __init__.py:345
Running: ['gs', '--version']                                                                                                    __init__.py:133
Running: ['tesseract', '--list-langs']                                                                                          __init__.py:133
stdout/stderr = List of available languages in "/opt/local/share/tessdata/" (4):                                                 __init__.py:73
deu                                                                                                                                            
eng                                                                                                                                            
fra                                                                                                                                            
osd                                                                                                                                            
                                                                                                                                               
pikepdf mmap enabled                                                                                                             helpers.py:328
os.symlink(testredocr.pdf, /var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/origin)                         helpers.py:179
Gathering info with 1 thread workers                                                                                                info.py:816
pikepdf mmap enabled                                                                                                             helpers.py:328
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2/2 0:00:00
Using Tesseract OpenMP thread limit 3                                                                                      tesseract_ocr.py:199
Start processing 2 pages concurrently                                                                                                 ocr.py:96
pikepdf mmap enabled                                                                                                             helpers.py:328
pikepdf mmap enabled                                                                                                             helpers.py:328
    1 redoing OCR                                                                                                              _pipeline.py:340
    2 redoing OCR                                                                                                              _pipeline.py:340
    1 Rasterize with png16m, rotation 0                                                                                        _pipeline.py:552
    2 Rasterize with pngmono, rotation 0                                                                                       _pipeline.py:552
    1 Running: ['gs', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=png16m', '-dFirstPage=1',         __init__.py:133
'-dLastPage=1', '-r400.000000x400.000000', '-dPDFSTOPONERROR', '-o',                                                                           
'/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/000001_rasterize.png', '-sstdout=%stderr',                              
'-dAutoRotatePages=/None', '-f', '/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/origin.pdf']                           
    2 Running: ['gs', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=pngmono', '-dFirstPage=2',        __init__.py:133
'-dLastPage=2', '-r400.000000x400.000000', '-dPDFSTOPONERROR', '-o',                                                                           
'/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/000002_rasterize.png', '-sstdout=%stderr',                              
'-dAutoRotatePages=/None', '-f', '/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/origin.pdf']                           
    2 stderr = GPL Ghostscript 10.05.0 (2025-03-12)                                                                              __init__.py:75
Copyright (C) 2025 Artifex Software, Inc.  All rights reserved.                                                                                
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:                                                                     
see the file COPYING for details.                                                                                                              
Processing pages 2 through 2.                                                                                                                  
Page 2                                                                                                                                         
Loading font Helvetica (or substitute) from /opt/local/share/ghostscript/10.05.0/Resource/Font/NimbusSans-Regular                              
                                                                                                                                               
    2 Rotating output by 0                                                                                                   ghostscript.py:161
    2 resolution (399.9992, 399.9992)                                                                                          _pipeline.py:631
    2 blanking (560.7051563652223, 324.00398378722275, 738.5048007652222, 390.6705171205558)                                   _pipeline.py:655
    2 blanking (1254.061769649222, 324.00398378722275, 1383.7281769825554, 390.6705171205558)                                  _pipeline.py:655
    2 blanking (1943.751723599889, 324.00398378722275, 2032.618212533222, 390.6705171205558)                                   _pipeline.py:655
    2 blanking (2592.675092417222, 324.00398378722275, 2722.274833217222, 390.6705171205558)                                   _pipeline.py:655
    2 blanking (520.8385694318888, 428.84821854277743, 778.3713876985554, 495.5147518761114)                                   _pipeline.py:655
    2 blanking (1222.9284985825554, 428.84821854277743, 1371.1948687158888, 495.5147518761114)                                 _pipeline.py:655
    2 blanking (1873.6851970665555, 428.84821854277743, 2059.018159733222, 495.5147518761114)                                  _pipeline.py:655
    2 blanking (2542.975191817222, 428.84821854277743, 2728.3081544838888, 495.5147518761114)                                  _pipeline.py:655
    2 blanking (314.9599811898889, 617.0256199649998, 920.5587699898888, 683.6921532983333)                                    _pipeline.py:655
    2 blanking (2763.1204181925555, 617.0256199649998, 2948.453380859222, 683.6921532983333)                                   _pipeline.py:655
    2 blanking (1053.2745045578888, 805.2030213872226, 2253.8054368245557, 948.9360672538887)                                  _pipeline.py:655
    2 blanking (314.9599811898889, 959.3360464538891, 2504.8222681232223, 1026.0025797872227)                                  _pipeline.py:655
    2 blanking (2899.4534788592223, 4294.1734878767775, 2992.119960192556, 4360.840021210111)                                  _pipeline.py:655
    2 Running: ['tesseract', '-l', 'fra',                                                                                       __init__.py:133
'/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/000002_ocr.png',                                                        
'/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/000002_ocr_hocr', 'hocr', 'txt']                                        
    1 stderr = GPL Ghostscript 10.05.0 (2025-03-12)                                                                              __init__.py:75
Copyright (C) 2025 Artifex Software, Inc.  All rights reserved.                                                                                
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:                                                                     
see the file COPYING for details.                                                                                                              
Processing pages 1 through 1.                                                                                                                  
Page 1                                                                                                                                         
Loading font Helvetica (or substitute) from /opt/local/share/ghostscript/10.05.0/Resource/Font/NimbusSans-Regular                              
Loading font Helvetica-Bold (or substitute) from /opt/local/share/ghostscript/10.05.0/Resource/Font/NimbusSans-Bold                            
                                                                                                                                               
    2 [tesseract] Empty page!!                                                                                                 tesseract.py:259
    2 [tesseract] Empty page!!                                                                                                 tesseract.py:259
    2 pikepdf.Matrix(0.18, 0, 0, -0.18, 0, 841.86)                                                                                 _hocr.py:219
    2 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0                                            _graft.py:152
    2 Grafting                                                                                                                    _graft.py:263
    2 Grafting with ctm pikepdf.Matrix(1.00003, 0, 0, 1.00004, 0, -5.68434e-14)                                                   _graft.py:306
    2 Page rotation: (content, auto) -> page = (0, 0) -> 0                                                                        _graft.py:177
    1 resolution (399.9992, 399.9992)                                                                                          _pipeline.py:631
    1 blanking (407.3873518903333, 326.0706463205561, 1114.919270157, 628.6700411205561)                                       _pipeline.py:655
    1 blanking (2175.1273719587775, 441.67041512055584, 2916.7258887587777, 744.0698103205564)                                 _pipeline.py:655
    1 blanking (314.9599811898889, 639.0700203205565, 1078.0251217232221, 782.8030661872231)                                   _pipeline.py:655
    1 blanking (394.4873776903333, 793.2030453872226, 1127.8192443569997, 859.8695787205561)                                   _pipeline.py:655
    1 blanking (1045.8745193578889, 981.3804468094445, 2261.2054220245554, 1125.113492676111)                                  _pipeline.py:655
    1 blanking (1092.1744267578888, 1248.6910232983332, 2214.9055146245555, 1700.8901188983332)                                _pipeline.py:655
    1 blanking (314.9599811898889, 1824.4676495205563, 463.0930182565555, 1891.1341828538893)                                  _pipeline.py:655
    1 blanking (1685.6627397788886, 1824.4676495205563, 1956.0621989788888, 1891.1341828538893)                                _pipeline.py:655
    1 blanking (2090.6742630875556, 1824.4676495205563, 2220.3406704208887, 1891.1341828538893)                                _pipeline.py:655
    1 blanking (2427.1525901295554, 1824.4676495205563, 2553.1523381295556, 1891.1341828538893)                                _pipeline.py:655
    1 blanking (2767.364243038222, 1824.4676495205563, 2882.230679971555, 1891.1341828538893)                                  _pipeline.py:655
    1 blanking (2394.1859893962223, 1981.9340012538892, 2542.4523595295555, 2048.6005345872227)                                _pipeline.py:655
    1 blanking (2800.7308429715554, 1981.9340012538892, 2848.8640800382223, 2048.6005345872227)                                _pipeline.py:655
    1 blanking (314.9599811898889, 1957.2896060983335, 1256.0247657232221, 2101.022651965)                                     _pipeline.py:655
    1 blanking (577.5127338611111, 2111.422631165, 833.2455557277777, 2178.0891644983335)                                      _pipeline.py:655
    1 blanking (911.809176378, 2111.422631165, 1248.3389477615556, 2178.0891644983335)                                         _pipeline.py:655
    1 blanking (2394.1859893962223, 2202.3780048094445, 2542.4523595295555, 2269.0445381427776)                                _pipeline.py:655
    1 blanking (2800.7308429715554, 2202.3780048094445, 2848.8640800382223, 2269.0445381427776)                                _pipeline.py:655
    1 blanking (314.9599811898889, 2216.2668659205556, 518.7595735898889, 2282.9333992538886)                                  _pipeline.py:655
    1 blanking (577.5127338611111, 2293.333378453889, 833.2455557277777, 2359.999911787222)                                    _pipeline.py:655
    1 blanking (911.809176378, 2293.333378453889, 1248.3389477615556, 2359.999911787222)                                       _pipeline.py:655
    1 blanking (2394.1859893962223, 2384.2887520983336, 2542.4523595295555, 2450.9552854316667)                                _pipeline.py:655
    1 blanking (2800.7308429715554, 2384.2887520983336, 2848.8640800382223, 2450.9552854316667)                                _pipeline.py:655
    1 blanking (314.9599811898889, 2398.1776132094446, 533.5595439898889, 2464.8441465427777)                                  _pipeline.py:655
    1 blanking (577.5127338611111, 2475.244125742778, 833.2455557277777, 2541.910659076111)                                    _pipeline.py:655
    1 blanking (911.809176378, 2475.244125742778, 1248.3389477615556, 2541.910659076111)                                       _pipeline.py:655
    1 blanking (2394.1859893962223, 2566.199499387222, 2542.4523595295555, 2632.8660327205553)                                 _pipeline.py:655
    1 blanking (2800.7308429715554, 2566.199499387222, 2848.8640800382223, 2632.8660327205553)                                 _pipeline.py:655
    1 blanking (314.9599811898889, 2580.0883604983333, 833.2455557277777, 2723.8214063650003)                                  _pipeline.py:655
    1 blanking (911.809176378, 2657.1548730316667, 1248.3389477615556, 2723.8214063650003)                                     _pipeline.py:655
    1 blanking (2394.1859893962223, 2748.1102466761113, 2542.4523595295555, 2814.7767800094443)                                _pipeline.py:655
    1 blanking (2800.7308429715554, 2748.1102466761113, 2848.8640800382223, 2814.7767800094443)                                _pipeline.py:655
    1 blanking (314.9599811898889, 2761.9991077872223, 774.4257289232221, 2828.665641120556)                                   _pipeline.py:655
    1 blanking (577.5127338611111, 2839.065620320556, 833.2455557277777, 2905.7321536538893)                                   _pipeline.py:655
    1 blanking (911.809176378, 2839.065620320556, 1248.3389477615556, 2905.7321536538893)                                      _pipeline.py:655
    1 blanking (2394.1859893962223, 2930.020993965, 2542.4523595295555, 2996.6875272983334)                                    _pipeline.py:655
    1 blanking (2800.7308429715554, 2930.020993965, 2848.8640800382223, 2996.6875272983334)                                    _pipeline.py:655
    1 blanking (314.9599811898889, 2943.909855076111, 533.5595439898889, 3010.5763884094445)                                   _pipeline.py:655
    1 blanking (577.5127338611111, 3020.9763676094444, 833.2455557277777, 3087.6429009427775)                                  _pipeline.py:655
    1 blanking (911.809176378, 3020.9763676094444, 1248.3389477615556, 3087.6429009427775)                                     _pipeline.py:655
    1 blanking (1802.329173112222, 3111.931741253889, 1839.3957656455555, 3178.5982745872225)                                  _pipeline.py:655
    1 blanking (2059.540992020889, 3111.931741253889, 2207.807362154222, 3178.5982745872225)                                   _pipeline.py:655
    1 blanking (2394.1859893962223, 3111.931741253889, 2542.4523595295555, 3178.5982745872225)                                 _pipeline.py:655
    1 blanking (2800.7308429715554, 3111.931741253889, 2848.8640800382223, 3178.5982745872225)                                 _pipeline.py:655
    1 blanking (314.9599811898889, 3125.820602365, 1163.4916174565553, 3192.487135698333)                                      _pipeline.py:655
    1 blanking (1155.3743003578886, 3316.0646663205557, 2151.7056410245555, 3382.7311996538892)                                _pipeline.py:655
    1 blanking (1481.3069818245554, 3506.508729876111, 1825.7729595578885, 3573.1752632094444)                                 _pipeline.py:655
    1 blanking (2763.1204181925555, 3694.8861308983337, 2948.453380859222, 3915.6856892983333)                                 _pipeline.py:655
    1 blanking (314.9599811898889, 3694.8861308983337, 918.8254401232223, 3992.7522018316668)                                  _pipeline.py:655
    1 blanking (2955.053367659222, 3926.0856684983332, 2992.1199601925555, 3992.7522018316668)                                 _pipeline.py:655
    1 blanking (2899.4534788592223, 4294.1734878767775, 2992.119960192556, 4360.840021210111)                                  _pipeline.py:655
    1 Running: ['tesseract', '-l', 'fra',                                                                                       __init__.py:133
'/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/000001_ocr.png',                                                        
'/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/000001_ocr_hocr', 'hocr', 'txt']                                        
    1 pikepdf.Matrix(0.18, 0, 0, -0.18, 0, 841.86)                                                                                 _hocr.py:219
    1 fra                                                                                                                          _hocr.py:287
    1 fra                                                                                                                          _hocr.py:287
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0                                            _graft.py:152
    1 Grafting                                                                                                                    _graft.py:263
    1 Grafting with ctm pikepdf.Matrix(1.00003, 0, 0, 1.00004, 0, -5.68434e-14)                                                   _graft.py:306
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0                                                                        _graft.py:177
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2/2 0:00:00
Postprocessing...                                                                                                                    ocr.py:144
os.symlink(/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/graft_layers.pdf,                               helpers.py:179
/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/fix_docinfo.pdf)                                                         
Running: ['gs', '--version']                                                                                                    __init__.py:133
Running: ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None',  __init__.py:133
'-sColorConversionStrategy=LeaveColorUnchanged', '-dPDFSTOPONERROR', '-dAutoFilterColorImages=true',                                           
'-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o',                                                  
'/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/pdfa.pdf', '-sstdout=%stderr',                                          
'/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/pdfa.ps',                                                               
'/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/fix_docinfo.pdf']                                                       
GPL Ghostscript 10.05.0 (2025-03-12)                                                                                            __init__.py:108
Copyright (C) 2025 Artifex Software, Inc.  All rights reserved.                                                                 __init__.py:108
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:                                                      __init__.py:108
see the file COPYING for details.                                                                                               __init__.py:108
Processing pages 1 through 2.                                                                                                   __init__.py:108
Page 1                                                                                                                          __init__.py:108
Loading font Helvetica (or substitute) from /opt/local/share/ghostscript/10.05.0/Resource/Font/NimbusSans-Regular               __init__.py:108
Loading font Helvetica-Bold (or substitute) from /opt/local/share/ghostscript/10.05.0/Resource/Font/NimbusSans-Bold             __init__.py:108
Page 2                                                                                                                          __init__.py:108
PDF/A conversion      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2/2 0:00:00
Running: ['tesseract', '--version']                                                                                             __init__.py:133
Linearizing           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00
xref 27: treating as an optimization candidate                                                                                  optimize.py:290
XrefExt(xref=27, ext='.png')                                                                                                    optimize.py:355
Optimizable images: JPEGs: 0 PNGs: 1                                                                                            optimize.py:360
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
xref 27: treating as an optimization candidate                                                                                  optimize.py:290
xref 27: marking this JPEG as deflatable                                                                                        optimize.py:555
Deflating JPEGs       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
xref 27: treating as an optimization candidate                                                                                  optimize.py:290
xref 27: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                                        optimize.py:103
Optimizable images: JBIG2 groups: 0                                                                                             optimize.py:371
JBIG2                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
os.symlink(/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/optimize.opt.pdf,                               helpers.py:179
/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/optimize.pdf)                                                            
Running: ['jbig2', '--version']                                                                                                 __init__.py:133
Running: ['pngquant', '--version']                                                                                              __init__.py:133
Image optimization ratio: 1.26 savings: 20.7%                                                                                 _pipeline.py:1002
Total file size ratio: 0.82 savings: -21.6%                                                                                   _pipeline.py:1005
/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/optimize.pdf -> testredocr_.pdf                         _pipeline.py:1077
Output file is a PDF/A-2B (as expected)                                                                                          _common.py:474

Metadata

Metadata

Assignees

Labels

triageIssue needs triage

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions