-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Open
Labels
triageIssue needs triageIssue needs triage
Description
Describe the bug
Why does ocrmypdf on certain PDF files produce words that are recognized as concatenated for Spotlight indexing, thereby this document is no searchable?
However, copying and pasting text is normal...
Steps to reproduce
Run `ocrmypdf --language fra --redo-ocr 'testredocr.pdf' 'testredocr_.pdf'`
`MacPorts` installed
Files
How did you download and install the software?
No response
OCRmyPDF version
16.10.1
Relevant log output
ocrmypdf --language fra --redo-ocr -v1 'testredocr.pdf' 'testredocr_.pdf'
ocrmypdf 16.10.1 __main__.py:59
Running: ['tesseract', '--version'] __init__.py:133
Found tesseract 5.4.1 __init__.py:345
Running: ['tesseract', '--version'] __init__.py:133
Running: ['tesseract', '--version'] __init__.py:133
Running: ['gs', '--version'] __init__.py:133
Found gs 10.5.0 __init__.py:345
Running: ['gs', '--version'] __init__.py:133
Running: ['tesseract', '--list-langs'] __init__.py:133
stdout/stderr = List of available languages in "/opt/local/share/tessdata/" (4): __init__.py:73
deu
eng
fra
osd
pikepdf mmap enabled helpers.py:328
os.symlink(testredocr.pdf, /var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/origin) helpers.py:179
Gathering info with 1 thread workers info.py:816
pikepdf mmap enabled helpers.py:328
Scanning contents ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2/2 0:00:00
Using Tesseract OpenMP thread limit 3 tesseract_ocr.py:199
Start processing 2 pages concurrently ocr.py:96
pikepdf mmap enabled helpers.py:328
pikepdf mmap enabled helpers.py:328
1 redoing OCR _pipeline.py:340
2 redoing OCR _pipeline.py:340
1 Rasterize with png16m, rotation 0 _pipeline.py:552
2 Rasterize with pngmono, rotation 0 _pipeline.py:552
1 Running: ['gs', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=png16m', '-dFirstPage=1', __init__.py:133
'-dLastPage=1', '-r400.000000x400.000000', '-dPDFSTOPONERROR', '-o',
'/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/000001_rasterize.png', '-sstdout=%stderr',
'-dAutoRotatePages=/None', '-f', '/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/origin.pdf']
2 Running: ['gs', '-dSAFER', '-dBATCH', '-dNOPAUSE', '-dInterpolateControl=-1', '-sDEVICE=pngmono', '-dFirstPage=2', __init__.py:133
'-dLastPage=2', '-r400.000000x400.000000', '-dPDFSTOPONERROR', '-o',
'/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/000002_rasterize.png', '-sstdout=%stderr',
'-dAutoRotatePages=/None', '-f', '/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/origin.pdf']
2 stderr = GPL Ghostscript 10.05.0 (2025-03-12) __init__.py:75
Copyright (C) 2025 Artifex Software, Inc. All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 2 through 2.
Page 2
Loading font Helvetica (or substitute) from /opt/local/share/ghostscript/10.05.0/Resource/Font/NimbusSans-Regular
2 Rotating output by 0 ghostscript.py:161
2 resolution (399.9992, 399.9992) _pipeline.py:631
2 blanking (560.7051563652223, 324.00398378722275, 738.5048007652222, 390.6705171205558) _pipeline.py:655
2 blanking (1254.061769649222, 324.00398378722275, 1383.7281769825554, 390.6705171205558) _pipeline.py:655
2 blanking (1943.751723599889, 324.00398378722275, 2032.618212533222, 390.6705171205558) _pipeline.py:655
2 blanking (2592.675092417222, 324.00398378722275, 2722.274833217222, 390.6705171205558) _pipeline.py:655
2 blanking (520.8385694318888, 428.84821854277743, 778.3713876985554, 495.5147518761114) _pipeline.py:655
2 blanking (1222.9284985825554, 428.84821854277743, 1371.1948687158888, 495.5147518761114) _pipeline.py:655
2 blanking (1873.6851970665555, 428.84821854277743, 2059.018159733222, 495.5147518761114) _pipeline.py:655
2 blanking (2542.975191817222, 428.84821854277743, 2728.3081544838888, 495.5147518761114) _pipeline.py:655
2 blanking (314.9599811898889, 617.0256199649998, 920.5587699898888, 683.6921532983333) _pipeline.py:655
2 blanking (2763.1204181925555, 617.0256199649998, 2948.453380859222, 683.6921532983333) _pipeline.py:655
2 blanking (1053.2745045578888, 805.2030213872226, 2253.8054368245557, 948.9360672538887) _pipeline.py:655
2 blanking (314.9599811898889, 959.3360464538891, 2504.8222681232223, 1026.0025797872227) _pipeline.py:655
2 blanking (2899.4534788592223, 4294.1734878767775, 2992.119960192556, 4360.840021210111) _pipeline.py:655
2 Running: ['tesseract', '-l', 'fra', __init__.py:133
'/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/000002_ocr.png',
'/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/000002_ocr_hocr', 'hocr', 'txt']
1 stderr = GPL Ghostscript 10.05.0 (2025-03-12) __init__.py:75
Copyright (C) 2025 Artifex Software, Inc. All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 1.
Page 1
Loading font Helvetica (or substitute) from /opt/local/share/ghostscript/10.05.0/Resource/Font/NimbusSans-Regular
Loading font Helvetica-Bold (or substitute) from /opt/local/share/ghostscript/10.05.0/Resource/Font/NimbusSans-Bold
2 [tesseract] Empty page!! tesseract.py:259
2 [tesseract] Empty page!! tesseract.py:259
2 pikepdf.Matrix(0.18, 0, 0, -0.18, 0, 841.86) _hocr.py:219
2 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0 _graft.py:152
2 Grafting _graft.py:263
2 Grafting with ctm pikepdf.Matrix(1.00003, 0, 0, 1.00004, 0, -5.68434e-14) _graft.py:306
2 Page rotation: (content, auto) -> page = (0, 0) -> 0 _graft.py:177
1 resolution (399.9992, 399.9992) _pipeline.py:631
1 blanking (407.3873518903333, 326.0706463205561, 1114.919270157, 628.6700411205561) _pipeline.py:655
1 blanking (2175.1273719587775, 441.67041512055584, 2916.7258887587777, 744.0698103205564) _pipeline.py:655
1 blanking (314.9599811898889, 639.0700203205565, 1078.0251217232221, 782.8030661872231) _pipeline.py:655
1 blanking (394.4873776903333, 793.2030453872226, 1127.8192443569997, 859.8695787205561) _pipeline.py:655
1 blanking (1045.8745193578889, 981.3804468094445, 2261.2054220245554, 1125.113492676111) _pipeline.py:655
1 blanking (1092.1744267578888, 1248.6910232983332, 2214.9055146245555, 1700.8901188983332) _pipeline.py:655
1 blanking (314.9599811898889, 1824.4676495205563, 463.0930182565555, 1891.1341828538893) _pipeline.py:655
1 blanking (1685.6627397788886, 1824.4676495205563, 1956.0621989788888, 1891.1341828538893) _pipeline.py:655
1 blanking (2090.6742630875556, 1824.4676495205563, 2220.3406704208887, 1891.1341828538893) _pipeline.py:655
1 blanking (2427.1525901295554, 1824.4676495205563, 2553.1523381295556, 1891.1341828538893) _pipeline.py:655
1 blanking (2767.364243038222, 1824.4676495205563, 2882.230679971555, 1891.1341828538893) _pipeline.py:655
1 blanking (2394.1859893962223, 1981.9340012538892, 2542.4523595295555, 2048.6005345872227) _pipeline.py:655
1 blanking (2800.7308429715554, 1981.9340012538892, 2848.8640800382223, 2048.6005345872227) _pipeline.py:655
1 blanking (314.9599811898889, 1957.2896060983335, 1256.0247657232221, 2101.022651965) _pipeline.py:655
1 blanking (577.5127338611111, 2111.422631165, 833.2455557277777, 2178.0891644983335) _pipeline.py:655
1 blanking (911.809176378, 2111.422631165, 1248.3389477615556, 2178.0891644983335) _pipeline.py:655
1 blanking (2394.1859893962223, 2202.3780048094445, 2542.4523595295555, 2269.0445381427776) _pipeline.py:655
1 blanking (2800.7308429715554, 2202.3780048094445, 2848.8640800382223, 2269.0445381427776) _pipeline.py:655
1 blanking (314.9599811898889, 2216.2668659205556, 518.7595735898889, 2282.9333992538886) _pipeline.py:655
1 blanking (577.5127338611111, 2293.333378453889, 833.2455557277777, 2359.999911787222) _pipeline.py:655
1 blanking (911.809176378, 2293.333378453889, 1248.3389477615556, 2359.999911787222) _pipeline.py:655
1 blanking (2394.1859893962223, 2384.2887520983336, 2542.4523595295555, 2450.9552854316667) _pipeline.py:655
1 blanking (2800.7308429715554, 2384.2887520983336, 2848.8640800382223, 2450.9552854316667) _pipeline.py:655
1 blanking (314.9599811898889, 2398.1776132094446, 533.5595439898889, 2464.8441465427777) _pipeline.py:655
1 blanking (577.5127338611111, 2475.244125742778, 833.2455557277777, 2541.910659076111) _pipeline.py:655
1 blanking (911.809176378, 2475.244125742778, 1248.3389477615556, 2541.910659076111) _pipeline.py:655
1 blanking (2394.1859893962223, 2566.199499387222, 2542.4523595295555, 2632.8660327205553) _pipeline.py:655
1 blanking (2800.7308429715554, 2566.199499387222, 2848.8640800382223, 2632.8660327205553) _pipeline.py:655
1 blanking (314.9599811898889, 2580.0883604983333, 833.2455557277777, 2723.8214063650003) _pipeline.py:655
1 blanking (911.809176378, 2657.1548730316667, 1248.3389477615556, 2723.8214063650003) _pipeline.py:655
1 blanking (2394.1859893962223, 2748.1102466761113, 2542.4523595295555, 2814.7767800094443) _pipeline.py:655
1 blanking (2800.7308429715554, 2748.1102466761113, 2848.8640800382223, 2814.7767800094443) _pipeline.py:655
1 blanking (314.9599811898889, 2761.9991077872223, 774.4257289232221, 2828.665641120556) _pipeline.py:655
1 blanking (577.5127338611111, 2839.065620320556, 833.2455557277777, 2905.7321536538893) _pipeline.py:655
1 blanking (911.809176378, 2839.065620320556, 1248.3389477615556, 2905.7321536538893) _pipeline.py:655
1 blanking (2394.1859893962223, 2930.020993965, 2542.4523595295555, 2996.6875272983334) _pipeline.py:655
1 blanking (2800.7308429715554, 2930.020993965, 2848.8640800382223, 2996.6875272983334) _pipeline.py:655
1 blanking (314.9599811898889, 2943.909855076111, 533.5595439898889, 3010.5763884094445) _pipeline.py:655
1 blanking (577.5127338611111, 3020.9763676094444, 833.2455557277777, 3087.6429009427775) _pipeline.py:655
1 blanking (911.809176378, 3020.9763676094444, 1248.3389477615556, 3087.6429009427775) _pipeline.py:655
1 blanking (1802.329173112222, 3111.931741253889, 1839.3957656455555, 3178.5982745872225) _pipeline.py:655
1 blanking (2059.540992020889, 3111.931741253889, 2207.807362154222, 3178.5982745872225) _pipeline.py:655
1 blanking (2394.1859893962223, 3111.931741253889, 2542.4523595295555, 3178.5982745872225) _pipeline.py:655
1 blanking (2800.7308429715554, 3111.931741253889, 2848.8640800382223, 3178.5982745872225) _pipeline.py:655
1 blanking (314.9599811898889, 3125.820602365, 1163.4916174565553, 3192.487135698333) _pipeline.py:655
1 blanking (1155.3743003578886, 3316.0646663205557, 2151.7056410245555, 3382.7311996538892) _pipeline.py:655
1 blanking (1481.3069818245554, 3506.508729876111, 1825.7729595578885, 3573.1752632094444) _pipeline.py:655
1 blanking (2763.1204181925555, 3694.8861308983337, 2948.453380859222, 3915.6856892983333) _pipeline.py:655
1 blanking (314.9599811898889, 3694.8861308983337, 918.8254401232223, 3992.7522018316668) _pipeline.py:655
1 blanking (2955.053367659222, 3926.0856684983332, 2992.1199601925555, 3992.7522018316668) _pipeline.py:655
1 blanking (2899.4534788592223, 4294.1734878767775, 2992.119960192556, 4360.840021210111) _pipeline.py:655
1 Running: ['tesseract', '-l', 'fra', __init__.py:133
'/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/000001_ocr.png',
'/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/000001_ocr_hocr', 'hocr', 'txt']
1 pikepdf.Matrix(0.18, 0, 0, -0.18, 0, 841.86) _hocr.py:219
1 fra _hocr.py:287
1 fra _hocr.py:287
1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0 _graft.py:152
1 Grafting _graft.py:263
1 Grafting with ctm pikepdf.Matrix(1.00003, 0, 0, 1.00004, 0, -5.68434e-14) _graft.py:306
1 Page rotation: (content, auto) -> page = (0, 0) -> 0 _graft.py:177
OCR ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2/2 0:00:00
Postprocessing... ocr.py:144
os.symlink(/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/graft_layers.pdf, helpers.py:179
/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/fix_docinfo.pdf)
Running: ['gs', '--version'] __init__.py:133
Running: ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', __init__.py:133
'-sColorConversionStrategy=LeaveColorUnchanged', '-dPDFSTOPONERROR', '-dAutoFilterColorImages=true',
'-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=2', '-dPDFACompatibilityPolicy=1', '-o',
'/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/pdfa.pdf', '-sstdout=%stderr',
'/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/pdfa.ps',
'/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/fix_docinfo.pdf']
GPL Ghostscript 10.05.0 (2025-03-12) __init__.py:108
Copyright (C) 2025 Artifex Software, Inc. All rights reserved. __init__.py:108
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY: __init__.py:108
see the file COPYING for details. __init__.py:108
Processing pages 1 through 2. __init__.py:108
Page 1 __init__.py:108
Loading font Helvetica (or substitute) from /opt/local/share/ghostscript/10.05.0/Resource/Font/NimbusSans-Regular __init__.py:108
Loading font Helvetica-Bold (or substitute) from /opt/local/share/ghostscript/10.05.0/Resource/Font/NimbusSans-Bold __init__.py:108
Page 2 __init__.py:108
PDF/A conversion ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2/2 0:00:00
Running: ['tesseract', '--version'] __init__.py:133
Linearizing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00
xref 27: treating as an optimization candidate optimize.py:290
XrefExt(xref=27, ext='.png') optimize.py:355
Optimizable images: JPEGs: 0 PNGs: 1 optimize.py:360
Recompressing JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
xref 27: treating as an optimization candidate optimize.py:290
xref 27: marking this JPEG as deflatable optimize.py:555
Deflating JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
xref 27: treating as an optimization candidate optimize.py:290
xref 27: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization optimize.py:103
Optimizable images: JBIG2 groups: 0 optimize.py:371
JBIG2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
os.symlink(/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/optimize.opt.pdf, helpers.py:179
/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/optimize.pdf)
Running: ['jbig2', '--version'] __init__.py:133
Running: ['pngquant', '--version'] __init__.py:133
Image optimization ratio: 1.26 savings: 20.7% _pipeline.py:1002
Total file size ratio: 0.82 savings: -21.6% _pipeline.py:1005
/var/folders/ps/z7flxvdj3b97p9_07lknl6dc0000gn/T/ocrmypdf.io.7jpmkds2/optimize.pdf -> testredocr_.pdf _pipeline.py:1077
Output file is a PDF/A-2B (as expected) _common.py:474
Metadata
Metadata
Assignees
Labels
triageIssue needs triageIssue needs triage