Skip to content

--user-patterns can cause assertion failure in UNICHARSET::get_isalpha #4425

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
krumelmonster opened this issue Jun 3, 2025 · 2 comments

Comments

@krumelmonster
Copy link

krumelmonster commented Jun 3, 2025

Current Behavior

tesseract -l eng --user-patterns patterns.txt in.png out.txt hocr txt causes an assertion failure only on a specific Document page regardless of the contents of patterns.txt.

The image is OCRd successfully when not using --user-patterns, even when using --user-words.

I cannot share the image.

It is reproducable and I have coredumps working in GDB, details below.

Expected Behavior

tesseract to work on any valid png image regardless of whether using a patterns file.

Suggested Fix

No response

tesseract -v

tesseract 5.5.1
 leptonica-1.85.0
  libgif 5.2.2 : libjpeg 8d (libjpeg-turbo 3.0.4) : libpng 1.6.48 : libtiff 4.7.0 : zlib 1.3.1 : libwebp 1.5.0 : libopenjp2 2.5.3
 Found AVX
 Found SSE4.1
 Found OpenMP 201511
 Found libarchive 3.8.0 zlib/1.3.1 liblzma/5.8.1 bz2lib/1.0.8 liblz4/1.10.0 libzstd/1.5.7 openssl/3.5.0 libb2/bundled libacl/2.3.2 libattr/2.3.2
 Found libcurl/8.13.0 OpenSSL/3.5.0 zlib/1.3.1 brotli/1.1.0 zstd/1.5.7 libidn2/2.3.7 libpsl/0.21.5 libssh2/1.11.1 nghttp2/1.65.0 nghttp3/1.9.0

Operating System

No response

Other Operating System

Arch Linux with tesseract system package 5.5.1-1

uname -a

6.14.7-arch2-1 #1 SMP PREEMPT_DYNAMIC Thu, 22 May 2025 05:37:49 +0000 x86_64 GNU/Linux

Compiler

No response

CPU

Intel Core i7-3520M CPU @ 2.90GHz

Virtualization / Containers

No response

Other Information

tesseract -l eng --user-patterns ocrpat /tmp/ocrmypdf.io.orgi4dfg/000007_ocr.png /tmp/ocrmypdf.io.orgi4dfg/000007_ocr_hocr hocr txt

contains_unichar_id(unichar_id):Error:Assert failed:in file ./src/ccutil/unicharset.h, line 501
zsh: IOT instruction (core dumped)  tesseract -l eng --user-patterns ocrpat   hocr txt
(gdb) bt
#0  __pthread_kill_implementation (threadid=<optimized out>, signo=signo@entry=6, no_tid=no_tid@entry=0) at pthread_kill.c:44
#1  0x00007f03e0faf813 in __pthread_kill_internal (threadid=<optimized out>, signo=6) at pthread_kill.c:89
#2  0x00007f03e0f55dc0 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#3  0x00007f03e0f3d57a in __GI_abort () at abort.c:73
#4  0x00007f03e1a8cfe2 in tesseract::ERRCODE::error (this=<optimized out>, caller=<optimized out>, action=tesseract::ABORT, format=<optimized out>)
    at src/ccutil/errcode.cpp:83
#5  0x00007f03e1baf003 in tesseract::UNICHARSET::get_isalpha (this=0x55e2617109c0, unichar_id=216) at ./src/ccutil/unicharset.h:501
#6  tesseract::Trie::unichar_id_to_patterns (this=0x55e2616ea8b0, unichar_id=216, unicharset=..., vec=0x7fffbbc5d9c0) at src/dict/trie.cpp:351
#7  0x00007f03e1ba2b3e in tesseract::Dict::ProcessPatternEdges (this=this@entry=0x55e261b5b120, dawg=dawg@entry=0x55e2616ea8b0, pos=...,
    unichar_id=unichar_id@entry=216, word_end=word_end@entry=true, dawg_args=dawg_args@entry=0x7fffbbc5db50, curr_perm=0x7fffbbc5da9c) at src/dict/dict.cpp:579
#8  0x00007f03e1ba471c in tesseract::Dict::def_letter_is_okay (this=0x55e261b5b120, void_dawg_args=<optimized out>, unicharset=..., unichar_id=216, word_end=true)
    at src/dict/dict.cpp:519
#9  0x00007f03e1ba8dca in tesseract::Dict::valid_word (this=0x55e261b5b120, word=..., numbers_ok=false) at ./src/ccstruct/ratngs.h:282
#10 0x00007f03e1b32b8f in tesseract::Tesseract::recog_word (this=0x7f03e1dc2010, word=0x55e262943e00) at src/ccmain/tfacepp.cpp:63
#11 0x00007f03e1b32e77 in tesseract::Tesseract::tess_segment_pass_n (this=0x7f03e1dc2010, pass_n=<optimized out>, word=0x55e262943e00) at src/ccmain/tessbox.cpp:47
#12 0x00007f03e1adcb42 in tesseract::Tesseract::match_word_pass_n (this=0x7f03e1dc2010, pass_n=1, word=0x55e262943e00, row=0x55e2628c71e0, block=<optimized out>)
    at src/ccmain/control.cpp:1600
#13 0x00007f03e1adccf2 in tesseract::Tesseract::classify_word_pass1 (this=0x7f03e1dc2010, word_data=..., in_word=0x55e262943fe0, out_words=<optimized out>)
    at src/ccmain/control.cpp:1420
#14 0x00007f03e1add21f in tesseract::Tesseract::RetryWithLanguage (this=0x7f03e1dc2010, word_data=...,
    recognizer=(void (tesseract::Tesseract::*)(tesseract::Tesseract * const, const tesseract::WordData &, tesseract::WERD_RES **, tesseract::PointerVector<tesseract::WERD_RES> *)) 0x7f03e1adcc80 <tesseract::Tesseract::classify_word_pass1(tesseract::WordData const&, tesseract::WERD_RES**, tesseract::PointerVector<tesseract::WERD_RES>*)>,
    debug=debug@entry=false, in_word=0x55e262943fe0, best_words=0x7fffbbc5df60) at src/ccmain/control.cpp:883
#15 0x00007f03e1ade0a5 in tesseract::Tesseract::classify_word_and_language (this=0x7f03e1dc2010, pass_n=<optimized out>, pr_it=0x7fffbbc5e0e0, word_data=0x55e262997870)
    at ./src/ccutil/genericvector.h:510
#16 0x00007f03e1ad8b2d in tesseract::Tesseract::RecogAllWordsPassN (this=0x7f03e1dc2010, pass_n=1, monitor=0x0, pr_it=0x7fffbbc5e0e0, words=0x7fffbbc5e0c0)
    at src/ccmain/control.cpp:255
#17 0x00007f03e1ae439f in tesseract::Tesseract::recog_all_words (this=0x7f03e1dc2010, page_res=0x55e2628d6ba0, monitor=0x0, target_word_box=0x0, word_config=0x0,
    dopasses=0) at src/ccmain/control.cpp:345
#18 0x00007f03e1a9e45a in tesseract::TessBaseAPI::Recognize (this=this@entry=0x7fffbbc5e980, monitor=monitor@entry=0x0) at src/api/baseapi.cpp:832
#19 0x00007f03e1aa1b73 in tesseract::TessBaseAPI::ProcessPage (this=0x7fffbbc5e980, pix=0x55e2628952c0, page_index=0, filename=<optimized out>, retry_config=0x0,
    timeout_millisec=<optimized out>, renderer=0x55e262895250) at src/api/baseapi.cpp:1217
#20 0x00007f03e1aa2fcc in tesseract::TessBaseAPI::ProcessPagesInternal (this=this@entry=0x7fffbbc5e980,
    filename=filename@entry=0x7fffbbc5f75b "/tmp/ocrmypdf.io.orgi4dfg/000007_ocr.png", retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0,
    renderer=0x55e262895250) at src/api/baseapi.cpp:1180
#21 0x00007f03e1aa3236 in tesseract::TessBaseAPI::ProcessPages (this=this@entry=0x7fffbbc5e980,
    filename=filename@entry=0x7fffbbc5f75b "/tmp/ocrmypdf.io.orgi4dfg/000007_ocr.png", retry_config=retry_config@entry=0x0, timeout_millisec=timeout_millisec@entry=0,
    renderer=<optimized out>) at src/api/baseapi.cpp:997
#22 0x000055e242442939 in main1 (argc=<optimized out>, argv=<optimized out>) at /usr/include/c++/15.1.1/bits/unique_ptr.h:193
#23 0x000055e24243f582 in main (argc=<optimized out>, argv=<optimized out>) at src/tesseract.cpp:858
(gdb) f 5
#5  0x00007f03e1baf003 in tesseract::UNICHARSET::get_isalpha (this=0x55e2617109c0, unichar_id=216) at ./src/ccutil/unicharset.h:501
501	    ASSERT_HOST(contains_unichar_id(unichar_id));
(gdb) l
496	  // Return the isalpha property of the given unichar.
497	  bool get_isalpha(UNICHAR_ID unichar_id) const {
498	    if (INVALID_UNICHAR_ID == unichar_id) {
499	      return false;
500	    }
501	    ASSERT_HOST(contains_unichar_id(unichar_id));
502	    return unichars[unichar_id].properties.isalpha;
503	  }
504	
505	  // Return the islower property of the given unichar.
(gdb) p unichar_id
$2 = 216
(gdb)
@stweil
Copy link
Member

stweil commented Jun 3, 2025

I'm afraid that we cannot do anything here unless there is some way how this can be reproduced by a Tesseract developer.
Can you share the image which triggers this assertion in a personal e-mail? If this is not possible, you will have to find a solution for yourself.

@krumelmonster
Copy link
Author

@stweil I believe this to be triggered by a certain unicode character/string which is (mis-)recognized by the OCR engine. So I guess it isn't too specific to the image, which I cannot share even privately.

I can look into this further myself and if needed probably craft an image that triggers the same issue, either by slicing my image and see which slice causes the issue, or by finding out what assumed character(s) cause it and then just craft an image containing those.

But I'll wait a bit before putting in this effort. Maybe someone recognizes the issue by just this or by just another GDB query.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants