Skip to content

[Bug]: Text is garbled using --skip-text #1565

@albertmatyi

Description

@albertmatyi

Describe the bug

Garbled text after transformation

Before

Image

After

Image

Steps to reproduce

1. Run `ocrmypdf  --skip-text --output-type pdfa-1 ~/tmp/input.pdf ~/tmp/output.pdf`
2. `evince` or `google-chrome output.pdf`

Files

input.pdf

output.pdf

How did you download and install the software?

Linux package manager (apt, dnf, etc.)

OCRmyPDF version

v16.10.4 (but was present also in 13.x.x)

Relevant log output

ocrmypdf 0.0.0                                                                                                                                                                                         __main__.py:59
Running: ['tesseract', '--version']                                                                                                                                                                   __init__.py:133
Found tesseract 4.1.1                                                                                                                                                                                 __init__.py:343
Running: ['tesseract', '--version']                                                                                                                                                                   __init__.py:133
Running: ['tesseract', '--version']                                                                                                                                                                   __init__.py:133
Running: ['gs', '--version']                                                                                                                                                                          __init__.py:133
Found gs 9.55.0                                                                                                                                                                                       __init__.py:343
Running: ['gs', '--version']                                                                                                                                                                          __init__.py:133
Running: ['tesseract', '--list-langs']                                                                                                                                                                __init__.py:133
stdout/stderr = List of available languages (161):                                                                                                                                                     __init__.py:73
Arabic                                                                                                                                                                                                               
Armenian                                                                                                                                                                                                             
Bengali                                                                                                                                                                                                              
Canadian_Aboriginal                                                                                                                                                                                                  
Cherokee                                                                                                                                                                                                             
Cyrillic                                                                                                                                                                                                             
Devanagari                                                                                                                                                                                                           
Ethiopic                                                                                                                                                                                                             
Fraktur                                                                                                                                                                                                              
Georgian                                                                                                                                                                                                             
Greek                                                                                                                                                                                                                
Gujarati                                                                                                                                                                                                             
Gurmukhi                                                                                                                                                                                                             
HanS                                                                                                                                                                                                                 
HanS_vert                                                                                                                                                                                                            
HanT                                                                                                                                                                                                                 
HanT_vert                                                                                                                                                                                                            
Hangul                                                                                                                                                                                                               
Hangul_vert                                                                                                                                                                                                          
Hebrew                                                                                                                                                                                                               
Japanese                                                                                                                                                                                                             
Japanese_vert                                                                                                                                                                                                        
Kannada                                                                                                                                                                                                              
Khmer                                                                                                                                                                                                                
Lao                                                                                                                                                                                                                  
Latin                                                                                                                                                                                                                
Malayalam                                                                                                                                                                                                            
Myanmar                                                                                                                                                                                                              
Oriya                                                                                                                                                                                                                
Sinhala                                                                                                                                                                                                              
Syriac                                                                                                                                                                                                               
Tamil                                                                                                                                                                                                                
Telugu                                                                                                                                                                                                               
Thaana                                                                                                                                                                                                               
Thai                                                                                                                                                                                                                 
Tibetan                                                                                                                                                                                                              
Vietnamese                                                                                                                                                                                                           
afr                                                                                                                                                                                                                  
amh                                                                                                                                                                                                                  
ara                                                                                                                                                                                                                  
asm                                                                                                                                                                                                                  
aze                                                                                                                                                                                                                  
aze_cyrl                                                                                                                                                                                                             
bel                                                                                                                                                                                                                  
ben                                                                                                                                                                                                                  
bod                                                                                                                                                                                                                  
bos                                                                                                                                                                                                                  
bre                                                                                                                                                                                                                  
bul                                                                                                                                                                                                                  
cat                                                                                                                                                                                                                  
ceb                                                                                                                                                                                                                  
ces                                                                                                                                                                                                                  
chi_sim                                                                                                                                                                                                              
chi_sim_vert                                                                                                                                                                                                         
chi_tra                                                                                                                                                                                                              
chi_tra_vert                                                                                                                                                                                                         
chr                                                                                                                                                                                                                  
cos                                                                                                                                                                                                                  
cym                                                                                                                                                                                                                  
dan                                                                                                                                                                                                                  
deu                                                                                                                                                                                                                  
div                                                                                                                                                                                                                  
dzo                                                                                                                                                                                                                  
ell                                                                                                                                                                                                                  
eng                                                                                                                                                                                                                  
enm                                                                                                                                                                                                                  
epo                                                                                                                                                                                                                  
est                                                                                                                                                                                                                  
eus                                                                                                                                                                                                                  
fao                                                                                                                                                                                                                  
fas                                                                                                                                                                                                                  
fil                                                                                                                                                                                                                  
fin                                                                                                                                                                                                                  
fra                                                                                                                                                                                                                  
frk                                                                                                                                                                                                                  
frm                                                                                                                                                                                                                  
fry                                                                                                                                                                                                                  
gla                                                                                                                                                                                                                  
gle                                                                                                                                                                                                                  
glg                                                                                                                                                                                                                  
grc                                                                                                                                                                                                                  
guj                                                                                                                                                                                                                  
hat                                                                                                                                                                                                                  
heb                                                                                                                                                                                                                  
hin                                                                                                                                                                                                                  
hrv                                                                                                                                                                                                                  
hun                                                                                                                                                                                                                  
hye                                                                                                                                                                                                                  
iku                                                                                                                                                                                                                  
ind                                                                                                                                                                                                                  
isl                                                                                                                                                                                                                  
ita                                                                                                                                                                                                                  
ita_old                                                                                                                                                                                                              
jav                                                                                                                                                                                                                  
jpn                                                                                                                                                                                                                  
jpn_vert                                                                                                                                                                                                             
kan                                                                                                                                                                                                                  
kat                                                                                                                                                                                                                  
kat_old                                                                                                                                                                                                              
kaz                                                                                                                                                                                                                  
khm                                                                                                                                                                                                                  
kir                                                                                                                                                                                                                  
kmr                                                                                                                                                                                                                  
kor                                                                                                                                                                                                                  
kor_vert                                                                                                                                                                                                             
lao                                                                                                                                                                                                                  
lat                                                                                                                                                                                                                  
lav                                                                                                                                                                                                                  
lit                                                                                                                                                                                                                  
ltz                                                                                                                                                                                                                  
mal                                                                                                                                                                                                                  
mar                                                                                                                                                                                                                  
mkd                                                                                                                                                                                                                  
mlt                                                                                                                                                                                                                  
mon                                                                                                                                                                                                                  
mri                                                                                                                                                                                                                  
msa                                                                                                                                                                                                                  
mya                                                                                                                                                                                                                  
nep                                                                                                                                                                                                                  
nld                                                                                                                                                                                                                  
nor                                                                                                                                                                                                                  
oci                                                                                                                                                                                                                  
ori                                                                                                                                                                                                                  
osd                                                                                                                                                                                                                  
pan                                                                                                                                                                                                                  
pol                                                                                                                                                                                                                  
por                                                                                                                                                                                                                  
pus                                                                                                                                                                                                                  
que                                                                                                                                                                                                                  
ron                                                                                                                                                                                                                  
rus                                                                                                                                                                                                                  
san                                                                                                                                                                                                                  
sin                                                                                                                                                                                                                  
slk                                                                                                                                                                                                                  
slv                                                                                                                                                                                                                  
snd                                                                                                                                                                                                                  
spa                                                                                                                                                                                                                  
spa_old                                                                                                                                                                                                              
sqi                                                                                                                                                                                                                  
srp                                                                                                                                                                                                                  
srp_latn                                                                                                                                                                                                             
sun                                                                                                                                                                                                                  
swa                                                                                                                                                                                                                  
swe                                                                                                                                                                                                                  
syr                                                                                                                                                                                                                  
tam                                                                                                                                                                                                                  
tat                                                                                                                                                                                                                  
tel                                                                                                                                                                                                                  
tgk                                                                                                                                                                                                                  
tha                                                                                                                                                                                                                  
tir                                                                                                                                                                                                                  
ton                                                                                                                                                                                                                  
tur                                                                                                                                                                                                                  
uig                                                                                                                                                                                                                  
ukr                                                                                                                                                                                                                  
urd                                                                                                                                                                                                                  
uzb                                                                                                                                                                                                                  
uzb_cyrl                                                                                                                                                                                                             
vie                                                                                                                                                                                                                  
yid                                                                                                                                                                                                                  
yor                                                                                                                                                                                                                  
                                                                                                                                                                                                                     
pikepdf mmap enabled                                                                                                                                                                                   helpers.py:328
os.symlink(/home/matyas/tmp/outi2.pdf, /tmp/ocrmypdf.io.g8nswrgm/origin)                                                                                                                               helpers.py:179
os.symlink(/tmp/ocrmypdf.io.g8nswrgm/origin, /tmp/ocrmypdf.io.g8nswrgm/origin.pdf)                                                                                                                     helpers.py:179
Gathering info with 1 thread workers                                                                                                                                                                      info.py:788
pikepdf mmap enabled                                                                                                                                                                                   helpers.py:328
Scanning contents     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Using Tesseract OpenMP thread limit 3                                                                                                                                                            tesseract_ocr.py:199
pikepdf mmap enabled                                                                                                                                                                                   helpers.py:328
    1 skipping all processing on this page                                                                                                                                                           _pipeline.py:330
    1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0                                                                                                                  _graft.py:140
    1 Page rotation: (content, auto) -> page = (0, 0) -> 0                                                                                                                                              _graft.py:165
OCR                   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Postprocessing...                                                                                                                                                                                          ocr.py:144
os.symlink(/tmp/ocrmypdf.io.g8nswrgm/graft_layers.pdf, /tmp/ocrmypdf.io.g8nswrgm/fix_docinfo.pdf)                                                                                                      helpers.py:179
Running: ['gs', '--version']                                                                                                                                                                          __init__.py:133
Running: ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=LeaveColorUnchanged', '-dPDFSTOPONERROR',   __init__.py:133
'-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=1', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/ocrmypdf.io.g8nswrgm/pdfa.ps',                          
'/tmp/ocrmypdf.io.g8nswrgm/fix_docinfo.pdf']                                                                                                                                                                         
GPL Ghostscript 9.55.0 (2021-09-27)                                                                                                                                                                   __init__.py:108
Copyright (C) 2021 Artifex Software, Inc.  All rights reserved.                                                                                                                                       __init__.py:108
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:                                                                                                                            __init__.py:108
see the file COPYING for details.                                                                                                                                                                     __init__.py:108
Warning: the map file cidfmap was not found.                                                                                                                                                          __init__.py:108
Processing pages 1 through 1.                                                                                                                                                                         __init__.py:108
Page 1                                                                                                                                                                                                __init__.py:108
PDF/A conversion      ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Running: ['tesseract', '--version']                                                                                                                                                                   __init__.py:133
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata.                                                                  _metadata.py:62
The following metadata fields were not copied: {'{http://ns.adobe.com/xap/1.0/mm/}InstanceID', '{http://www.aiim.org/pdfa/ns/extension/}schemas', '{http://ns.adobe.com/xap/1.0/mm/}VersionID',       _metadata.py:67
'{http://purl.org/dc/elements/1.1/}subject', '{http://ns.adobe.com/xap/1.0/}MetadataDate'}                                                                                                                           
Linearizing           ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00
xref 15: treating as an optimization candidate                                                                                                                                                        optimize.py:282
xref 14: treating as an optimization candidate                                                                                                                                                        optimize.py:282
XrefExt(xref=14, ext='.png')                                                                                                                                                                          optimize.py:347
XrefExt(xref=15, ext='.png')                                                                                                                                                                          optimize.py:347
Optimizable images: JPEGs: 0 PNGs: 2                                                                                                                                                                  optimize.py:352
Recompressing JPEGs   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
xref 15: treating as an optimization candidate                                                                                                                                                        optimize.py:282
xref 14: treating as an optimization candidate                                                                                                                                                        optimize.py:282
xref 14: marking this JPEG as deflatable                                                                                                                                                              optimize.py:547
xref 15: marking this JPEG as deflatable                                                                                                                                                              optimize.py:547
Deflating JPEGs       ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2/2 0:00:00
xref 15: treating as an optimization candidate                                                                                                                                                        optimize.py:282
xref 14: treating as an optimization candidate                                                                                                                                                        optimize.py:282
xref 14: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                                                                                                               optimize.py:98
xref 15: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization                                                                                                               optimize.py:98
Optimizable images: JBIG2 groups: 0                                                                                                                                                                   optimize.py:363
JBIG2                 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━   0% 0/0 -:--:--
os.symlink(/tmp/ocrmypdf.io.g8nswrgm/optimize.opt.pdf, /tmp/ocrmypdf.io.g8nswrgm/optimize.pdf)                                                                                                         helpers.py:179
Running: ['jbig2', '--version']                                                                                                                                                                       __init__.py:133
Running: ['pngquant', '--version']                                                                                                                                                                    __init__.py:133
Image optimization ratio: 1.02 savings: 2.1%                                                                                                                                                         _pipeline.py:989
Total file size ratio: 11.15 savings: 91.0%                                                                                                                                                          _pipeline.py:992
/tmp/ocrmypdf.io.g8nswrgm/optimize.pdf -> /home/matyas/tmp/out.pdf 

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions