-
-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Open
Description
Describe the bug
Garbled text after transformation
Before

After

Steps to reproduce
1. Run `ocrmypdf --skip-text --output-type pdfa-1 ~/tmp/input.pdf ~/tmp/output.pdf`
2. `evince` or `google-chrome output.pdf`
Files
How did you download and install the software?
Linux package manager (apt, dnf, etc.)
OCRmyPDF version
v16.10.4 (but was present also in 13.x.x)
Relevant log output
ocrmypdf 0.0.0 __main__.py:59
Running: ['tesseract', '--version'] __init__.py:133
Found tesseract 4.1.1 __init__.py:343
Running: ['tesseract', '--version'] __init__.py:133
Running: ['tesseract', '--version'] __init__.py:133
Running: ['gs', '--version'] __init__.py:133
Found gs 9.55.0 __init__.py:343
Running: ['gs', '--version'] __init__.py:133
Running: ['tesseract', '--list-langs'] __init__.py:133
stdout/stderr = List of available languages (161): __init__.py:73
Arabic
Armenian
Bengali
Canadian_Aboriginal
Cherokee
Cyrillic
Devanagari
Ethiopic
Fraktur
Georgian
Greek
Gujarati
Gurmukhi
HanS
HanS_vert
HanT
HanT_vert
Hangul
Hangul_vert
Hebrew
Japanese
Japanese_vert
Kannada
Khmer
Lao
Latin
Malayalam
Myanmar
Oriya
Sinhala
Syriac
Tamil
Telugu
Thaana
Thai
Tibetan
Vietnamese
afr
amh
ara
asm
aze
aze_cyrl
bel
ben
bod
bos
bre
bul
cat
ceb
ces
chi_sim
chi_sim_vert
chi_tra
chi_tra_vert
chr
cos
cym
dan
deu
div
dzo
ell
eng
enm
epo
est
eus
fao
fas
fil
fin
fra
frk
frm
fry
gla
gle
glg
grc
guj
hat
heb
hin
hrv
hun
hye
iku
ind
isl
ita
ita_old
jav
jpn
jpn_vert
kan
kat
kat_old
kaz
khm
kir
kmr
kor
kor_vert
lao
lat
lav
lit
ltz
mal
mar
mkd
mlt
mon
mri
msa
mya
nep
nld
nor
oci
ori
osd
pan
pol
por
pus
que
ron
rus
san
sin
slk
slv
snd
spa
spa_old
sqi
srp
srp_latn
sun
swa
swe
syr
tam
tat
tel
tgk
tha
tir
ton
tur
uig
ukr
urd
uzb
uzb_cyrl
vie
yid
yor
pikepdf mmap enabled helpers.py:328
os.symlink(/home/matyas/tmp/outi2.pdf, /tmp/ocrmypdf.io.g8nswrgm/origin) helpers.py:179
os.symlink(/tmp/ocrmypdf.io.g8nswrgm/origin, /tmp/ocrmypdf.io.g8nswrgm/origin.pdf) helpers.py:179
Gathering info with 1 thread workers info.py:788
pikepdf mmap enabled helpers.py:328
Scanning contents ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Using Tesseract OpenMP thread limit 3 tesseract_ocr.py:199
pikepdf mmap enabled helpers.py:328
1 skipping all processing on this page _pipeline.py:330
1 Text rotation: (text, autorotate, content) -> text misalignment = (0, 0, 0) -> 0 _graft.py:140
1 Page rotation: (content, auto) -> page = (0, 0) -> 0 _graft.py:165
OCR ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Postprocessing... ocr.py:144
os.symlink(/tmp/ocrmypdf.io.g8nswrgm/graft_layers.pdf, /tmp/ocrmypdf.io.g8nswrgm/fix_docinfo.pdf) helpers.py:179
Running: ['gs', '--version'] __init__.py:133
Running: ['gs', '-dBATCH', '-dNOPAUSE', '-dSAFER', '-dCompatibilityLevel=1.6', '-sDEVICE=pdfwrite', '-dAutoRotatePages=/None', '-sColorConversionStrategy=LeaveColorUnchanged', '-dPDFSTOPONERROR', __init__.py:133
'-dAutoFilterColorImages=true', '-dAutoFilterGrayImages=true', '-dJPEGQ=95', '-dPDFA=1', '-dPDFACompatibilityPolicy=1', '-o', '-', '-sstdout=%stderr', '/tmp/ocrmypdf.io.g8nswrgm/pdfa.ps',
'/tmp/ocrmypdf.io.g8nswrgm/fix_docinfo.pdf']
GPL Ghostscript 9.55.0 (2021-09-27) __init__.py:108
Copyright (C) 2021 Artifex Software, Inc. All rights reserved. __init__.py:108
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY: __init__.py:108
see the file COPYING for details. __init__.py:108
Warning: the map file cidfmap was not found. __init__.py:108
Processing pages 1 through 1. __init__.py:108
Page 1 __init__.py:108
PDF/A conversion ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 1/1 0:00:00
Running: ['tesseract', '--version'] __init__.py:133
Some input metadata could not be copied because it is not permitted in PDF/A. You may wish to examine the output PDF's XMP metadata. _metadata.py:62
The following metadata fields were not copied: {'{http://ns.adobe.com/xap/1.0/mm/}InstanceID', '{http://www.aiim.org/pdfa/ns/extension/}schemas', '{http://ns.adobe.com/xap/1.0/mm/}VersionID', _metadata.py:67
'{http://purl.org/dc/elements/1.1/}subject', '{http://ns.adobe.com/xap/1.0/}MetadataDate'}
Linearizing ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 100/100 0:00:00
xref 15: treating as an optimization candidate optimize.py:282
xref 14: treating as an optimization candidate optimize.py:282
XrefExt(xref=14, ext='.png') optimize.py:347
XrefExt(xref=15, ext='.png') optimize.py:347
Optimizable images: JPEGs: 0 PNGs: 2 optimize.py:352
Recompressing JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
xref 15: treating as an optimization candidate optimize.py:282
xref 14: treating as an optimization candidate optimize.py:282
xref 14: marking this JPEG as deflatable optimize.py:547
xref 15: marking this JPEG as deflatable optimize.py:547
Deflating JPEGs ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 2/2 0:00:00
xref 15: treating as an optimization candidate optimize.py:282
xref 14: treating as an optimization candidate optimize.py:282
xref 14: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization optimize.py:98
xref 15: found image compressed as /FlateDecode /DCTDecode, marked for JPEG optimization optimize.py:98
Optimizable images: JBIG2 groups: 0 optimize.py:363
JBIG2 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0% 0/0 -:--:--
os.symlink(/tmp/ocrmypdf.io.g8nswrgm/optimize.opt.pdf, /tmp/ocrmypdf.io.g8nswrgm/optimize.pdf) helpers.py:179
Running: ['jbig2', '--version'] __init__.py:133
Running: ['pngquant', '--version'] __init__.py:133
Image optimization ratio: 1.02 savings: 2.1% _pipeline.py:989
Total file size ratio: 11.15 savings: 91.0% _pipeline.py:992
/tmp/ocrmypdf.io.g8nswrgm/optimize.pdf -> /home/matyas/tmp/out.pdf
Metadata
Metadata
Assignees
Labels
No labels