Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault with certain language packs on certain images #4146

Open
cchadowitz opened this issue Oct 19, 2023 · 9 comments
Open

Segfault with certain language packs on certain images #4146

cchadowitz opened this issue Oct 19, 2023 · 9 comments
Labels

Comments

@cchadowitz
Copy link

cchadowitz commented Oct 19, 2023

Current Behavior

When running the tesseract binary with specific language packs enabled, the binary segfaults (different places for different combinations).

  1. Using the chi_sim and the mrz language packs together for this input image: chi_sim-mrz-segfault
root@78ce6cbf2cc5:/opt/tesseract/bin# gdb ./tesseract 
GNU gdb (Ubuntu 14.0.50.20230907-0ubuntu1) 14.0.50.20230907-git
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./tesseract...
(No debugging symbols found in ./tesseract)
(gdb) run sample_012631.jpg - --tessdata-dir <snip>/tessdata/ -l chi_sim+mrz
Starting program: /opt/tesseract/bin/tesseract sample_012631.jpg - --tessdata-dir <snip>/tessdata/ -l chi_sim+mrz
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Estimating resolution as 548
Detected 108 diacritics

Program received signal SIGSEGV, Segmentation fault.
0x00005619906db4e8 in tesseract::ELIST_ITERATOR::forward() ()
(gdb) bt
#0  0x00005619906db4e8 in tesseract::ELIST_ITERATOR::forward() ()
#1  0x00005619906bc1ff in tesseract::PAGE_RES_IT::ReplaceCurrentWord(tesseract::PointerVector<tesseract::WERD_RES>*) ()
#2  0x000056199074ea84 in tesseract::Tesseract::classify_word_and_language(int, tesseract::PAGE_RES_IT*, tesseract::WordData*) ()
#3  0x0000561990752adc in tesseract::Tesseract::RecogAllWordsPassN(int, tesseract::ETEXT_DESC*, tesseract::PAGE_RES_IT*, std::vector<tesseract::WordData, std::allocator<tesseract::WordData> >*) ()
#4  0x00005619907537ba in tesseract::Tesseract::recog_all_words(tesseract::PAGE_RES*, tesseract::ETEXT_DESC*, tesseract::TBOX const*, char const*, int) ()
#5  0x0000561990719b46 in tesseract::TessBaseAPI::Recognize(tesseract::ETEXT_DESC*) ()
#6  0x000056199071a3d2 in tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) ()
#7  0x000056199071b4ea in tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) ()
#8  0x000056199071bb92 in tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) ()
#9  0x00005619906a6331 in main ()
  1. Using the eng, ara, and ocrb_int language packs together for this input image:
    eng-ara-orcb_int-segfault
root@78ce6cbf2cc5:/opt/tesseract/bin# gdb ./tesseract 
GNU gdb (Ubuntu 14.0.50.20230907-0ubuntu1) 14.0.50.20230907-git
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./tesseract...
(No debugging symbols found in ./tesseract)
(gdb) run sample_000111.jpg - --tessdata-dir <snip>/tessdata/ -l eng+ara+ocrb_int
Starting program: /opt/tesseract/bin/tesseract sample_000111.jpg - --tessdata-dir <snip>/tessdata/ -l eng+ara+ocrb_int
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Estimating resolution as 975

Program received signal SIGSEGV, Segmentation fault.
0x000055e53cf153c6 in tesseract::PAGE_RES_IT::DeleteCurrentWord() ()
(gdb) bt
#0  0x000055e53cf153c6 in tesseract::PAGE_RES_IT::DeleteCurrentWord() ()
#1  0x000055e53cfaf5a8 in tesseract::Tesseract::recog_all_words(tesseract::PAGE_RES*, tesseract::ETEXT_DESC*, tesseract::TBOX const*, char const*, int) ()
#2  0x000055e53cf75b46 in tesseract::TessBaseAPI::Recognize(tesseract::ETEXT_DESC*) ()
#3  0x000055e53cf763d2 in tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) ()
#4  0x000055e53cf774ea in tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) ()
#5  0x000055e53cf77b92 in tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) ()
#6  0x000055e53cf02331 in main ()

Expected Behavior

No segfaults.

Suggested Fix

No response

tesseract -v

root@78ce6cbf2cc5:/opt/tesseract/bin# ./tesseract -v
tesseract 5.3.3-1-gdc22
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.2) : libpng 1.6.40 : libtiff 4.5.1 : zlib 1.2.13 : libwebp 1.2.4 : libopenjp2 2.5.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found libarchive 3.6.2 zlib/1.2.13 liblzma/5.4.0 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.2
 Found libcurl/8.2.1 OpenSSL/3.0.10 zlib/1.2.13 brotli/1.0.9 zstd/1.5.5 libidn2/2.3.4 libpsl/0.21.2 (+libidn2/2.3.3) libssh/0.10.5/openssl/zlib nghttp2/1.55.1 librtmp/2.3 OpenLDAP/2.6.6

Operating System

No response

Other Operating System

Ubuntu 23.10

uname -a

root@78ce6cbf2cc5:/opt/tesseract/bin# uname -a
Linux 78ce6cbf2cc5 5.4.0-150-generic #167~18.04.1-Ubuntu SMP Wed May 24 00:51:42 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Compiler

GCC 13.2.0

CPU

Intel Xeon CPU E5-2609 v4 @ 1.70GHz x16

Virtualization / Containers

I'm building and running tesseract inside a docker (v24.0.2) container, where the container is running the OS and compiler versions listed above.

Other Information

No response

@zdenop
Copy link
Contributor

zdenop commented Jan 13, 2024

Similar to #4148 (but crash on the different place):

  1. You should be able to demonstrate a problem with official data only, and it looks like you use custom-generated languages (mrz, ocrb_int).

@stweil
Copy link
Contributor

stweil commented Jan 13, 2024

@cchadowitz, please provide sample_012631.jpg, mrz.traineddata and ocrb_int.traineddata. Those files are needed to reproduce the issue.

@zdenop
Copy link
Contributor

zdenop commented Jan 13, 2024

@stweil : seems like ocrb_int is @Shreeshrii training from https://github.com/Shreeshrii/tessdata_ocrb and mrz is from https://github.com/DoubangoTelecom/tesseractMRZ

@stweil
Copy link
Contributor

stweil commented Jan 14, 2024

I can reproduce one of the segmentation faults:

tesseract sample_000111.jpg - -l tessdata_fast/eng+tessdata_fast/ara+ocrb_int
Estimating resolution as 975
Speicherzugriffsfehler

The segmentation fault does not occur with eng and ara from tessdata_best or tessdata.

But I got another issue with a non-existing 2nd language model:

tesseract sample_000111.jpg - -l ara+orcb_int
Error: LSTM requested, but not present!! Loading tesseract.
mgr->GetComponent(TESSDATA_INTTEMP, &fp):Error:Assert failed:in file ../../../src/classify/adaptmatch.cpp, line 539
Abgebrochen

@stweil stweil added the bug label Jan 14, 2024
@cchadowitz
Copy link
Author

cchadowitz commented Jan 17, 2024

Thanks for taking a look!

@stweil looks like you were able to reproduce, but just fyi the images I referenced are the same as the attached ones (Github just changed the filenames). I shared the example run after the image for each example, so the first image is the referenced sample_012631.jpg and the second is sample_000111.jpg. And @zdenop was correct with the sources of the language packs (originally I found them from https://tesseract-ocr.github.io/tessdoc/Data-Files-Contributions.html)

Anyways, let me know if I can provide anything else in the meantime.

@farfromrefug
Copy link

I am getting a lot of reports of crashes in recog_all_words (android app). I am trying to get better reports with exact call-stack .

@marcreichman-pfi
Copy link

@stweil are you able to reproduce the segfaults from this and from #4148 against 5.3.4 or current code? We tested against the Ubuntu 24.04-distributed 5.3.4 and it doesn't seem like we can anymore. However, I cannot seem to figure out what in the branch diffs would have changed this. If you have a moment to over it I'd love to know your thoughts.

@stweil
Copy link
Contributor

stweil commented Nov 8, 2024

This bug still occurs with the latest code from the main branch. It should be fixed before the next release is tagged.

@stweil
Copy link
Contributor

stweil commented Nov 8, 2024

Call stack with debug code (which stops earlier):

(gdb) i s
#0  tesseract::ERRCODE::error (this=0x5555559d03a0 <tesseract::STILL_LINKED>, caller=0x55555585bdc0 "ELIST2_ITERATOR::add_before_stay_put", action=tesseract::ABORT, format=0x0)
    at ../../../src/ccutil/errcode.cpp:78
#1  0x0000555555799340 in tesseract::ERRCODE::error (this=0x5555559d03a0 <tesseract::STILL_LINKED>, caller=0x55555585bdc0 "ELIST2_ITERATOR::add_before_stay_put", action=tesseract::ABORT)
    at ../../../src/ccutil/errcode.cpp:90
#2  0x00005555555eb919 in tesseract::ELIST2_ITERATOR::add_before_stay_put (this=0x7fffffffd5b0, new_element=0x555556ff5420) at ../../../src/ccutil/elst2.h:462
#3  0x0000555555694bf1 in tesseract::PAGE_RES_IT::ReplaceCurrentWord (this=0x7fffffffd860, words=0x7fffffffd6a0) at ../../../src/ccstruct/pageres.cpp:1469
#4  0x00005555555f9be2 in tesseract::Tesseract::classify_word_and_language (this=0x7ffff41b5010, pass_n=1, pr_it=0x7fffffffd860, word_data=0x555557006720) at ../../../src/ccmain/control.cpp:1367
#5  0x00005555555f441a in tesseract::Tesseract::RecogAllWordsPassN (this=0x7ffff41b5010, pass_n=1, monitor=0x0, pr_it=0x7fffffffd860, words=0x7fffffffd840) at ../../../src/ccmain/control.cpp:255
#6  0x00005555555f4876 in tesseract::Tesseract::recog_all_words (this=0x7ffff41b5010, page_res=0x555557001230, monitor=0x0, target_word_box=0x0, word_config=0x0, dopasses=0)
    at ../../../src/ccmain/control.cpp:345
#7  0x0000555555597cb0 in tesseract::TessBaseAPI::Recognize (this=0x7fffffffdf90, monitor=0x0) at ../../../src/api/baseapi.cpp:833
#8  0x0000555555599285 in tesseract::TessBaseAPI::ProcessPage (this=0x7fffffffdf90, pix=0x555556ed5130, page_index=0, filename=0x7fffffffe53e "sample_000111.jpg", retry_config=0x0, timeout_millisec=0, 
    renderer=0x555555a2a0d0) at ../../../src/api/baseapi.cpp:1218
#9  0x0000555555599003 in tesseract::TessBaseAPI::ProcessPagesInternal (this=0x7fffffffdf90, filename=0x7fffffffe53e "sample_000111.jpg", retry_config=0x0, timeout_millisec=0, renderer=0x555555a2a0d0)
    at ../../../src/api/baseapi.cpp:1181
#10 0x00005555555984d8 in tesseract::TessBaseAPI::ProcessPages (this=0x7fffffffdf90, filename=0x7fffffffe53e "sample_000111.jpg", retry_config=0x0, timeout_millisec=0, renderer=0x555555a2a0d0)
    at ../../../src/api/baseapi.cpp:998
#11 0x000055555558ab1f in main (argc=5, argv=0x7fffffffe248) at ../../../src/tesseract.cpp:868

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants