Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assert fail in src/ccstruct/pageres.cpp, line 1502 with specific image and language combination #4148

Open
marcreichman-pfi opened this issue Oct 23, 2023 · 8 comments
Labels

Comments

@marcreichman-pfi
Copy link

marcreichman-pfi commented Oct 23, 2023

Current Behavior

When running this command line:

$ tesseract sample_013741.jpg - --tessdata-dir <snip>/tessdata -l ara+ocrb_int

The following occurs:

Estimating resolution as 303
!w_it.cycled_list():Error:Assert failed:in file src/ccstruct/pageres.cpp, line 1502
Aborted

This was first detected with API usage from an internal app, but reproducible in tesseract commandline.
sample_013741

Backtrace:

__pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
44	./nptl/pthread_kill.c: No such file or directory.
(gdb) bt
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=<optimized out>) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=<optimized out>, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007f14a8257866 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007f14a823b8b7 in __GI_abort () at ./stdlib/abort.c:79
#5  0x000055f2706dac44 in tesseract::ERRCODE::error(char const*, tesseract::TessErrorLogCode, char const*, ...) const [clone .cold] ()
#6  0x000055f270703676 in tesseract::PAGE_RES_IT::DeleteCurrentWord() ()
#7  0x000055f27079d5a8 in tesseract::Tesseract::recog_all_words(tesseract::PAGE_RES*, tesseract::ETEXT_DESC*, tesseract::TBOX const*, char const*, int) ()
#8  0x000055f270763b46 in tesseract::TessBaseAPI::Recognize(tesseract::ETEXT_DESC*) ()
#9  0x000055f2707643d2 in tesseract::TessBaseAPI::ProcessPage(Pix*, int, char const*, char const*, int, tesseract::TessResultRenderer*) ()
#10 0x000055f2707654ea in tesseract::TessBaseAPI::ProcessPagesInternal(char const*, char const*, int, tesseract::TessResultRenderer*) ()
#11 0x000055f270765b92 in tesseract::TessBaseAPI::ProcessPages(char const*, char const*, int, tesseract::TessResultRenderer*) ()
#12 0x000055f2706f0331 in main ()

Expected Behavior

The expectation is that the frame would be analyzed for OCR data without aborting. Other language combinations which have run without crash are:

  • eng+ara+fra+spa+mrz+chi_sim+chi_sim_vert+chi_tra+chi_tra_vert+rus
  • ocrb_int+eng+fra+spa+mrz+chi_sim+chi_sim_vert+chi_tra+chi_tra_vert+rus
  • ocrb_int+eng
  • Many other similar combinations

The combination of ara and orcb_int (whether on their own or included in a larger list such as above) trigger the abort each time

Suggested Fix

No known suggested fixes at this time.

tesseract -v

$ tesseract -v
tesseract 5.3.0
 leptonica-1.82.0
  libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.2) : libpng 1.6.40 : libtiff 4.5.1 : zlib 1.2.13 : libwebp 1.2.4 : libopenjp2 2.5.0
 Found AVX2
 Found AVX
 Found FMA
 Found SSE4.1
 Found OpenMP 201511
 Found libarchive 3.6.2 zlib/1.2.13 liblzma/5.4.0 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.2
 Found libcurl/8.2.1 OpenSSL/3.0.10 zlib/1.2.13 brotli/1.0.9 zstd/1.5.5 libidn2/2.3.4 libpsl/0.21.2 (+libidn2/2.3.3) libssh/0.10.5/openssl/zlib nghttp2/1.55.1 librtmp/2.3 OpenLDAP/2.6.6

This has also been confirmed with the latest git main.

Operating System

No response

Other Operating System

Ubuntu 23.10-based docker image, running under CentOS 7 host (all amd64). This has been reproduced in development setups however, including Ubuntu 20.04, 22.04, and WSL2.

uname -a

Linux e04873eb47b5 3.10.0-1160.95.1.el7.x86_64 #1 SMP Mon Jul 24 13:59:37 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux (container view - host view is the same aside from the hostname)

Compiler

Using built-in specs.
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/libexec/gcc/x86_64-linux-gnu/13/lto-wrapper
OFFLOAD_TARGET_NAMES=nvptx-none:amdgcn-amdhsa
OFFLOAD_TARGET_DEFAULT=1
Target: x86_64-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu 13.2.0-4ubuntu3' --with-bugurl=file:///usr/share/doc/gcc-13/README.Bugs --enable-languages=c,ada,c++,go,d,fortran,objc,obj-c++,m2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-13 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/libexec --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-bootstrap --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --enable-libphobos-checking=release --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --enable-cet --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none=/build/gcc-13-XYspKM/gcc-13-13.2.0/debian/tmp-nvptx/usr,amdgcn-amdhsa=/build/gcc-13-XYspKM/gcc-13-13.2.0/debian/tmp-gcn/usr --enable-offload-defaulted --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu --with-build-config=bootstrap-lto-lean --enable-link-serialization=2
Thread model: posix
Supported LTO compression algorithms: zlib zstd
gcc version 13.2.0 (Ubuntu 13.2.0-4ubuntu3)

CPU

vendor_id       : GenuineIntel
cpu family      : 6
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz
stepping        : 1
microcode       : 0xb00002a
cpu MHz         : 1291.992
cache size      : 20480 KB
physical id     : 0
siblings        : 16
core id         : 0
cpu cores       : 8
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 20
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin rsb_ctxsw ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts spec_ctrl intel_stibp
bogomips        : 4200.12
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

Virtualization / Containers

Docker 24.0.6

Other Information

No response

@zdenop
Copy link
Contributor

zdenop commented Jan 13, 2024

I see several problems here:

  1. First of all: use the latest version (leptonica 1.84.1, tesseract 5.3.3) when reporting issue
  2. I agree Tesseract should not crash, but your input image is not suitable input.
  3. You should be able to demonstrate a problem with official data only, and it looks like you use custom-generated languages (mrz, ocrb_int). E.g. when I run tesseract i4148.jpg - -l eng+ara+fra+spa+chi_sim+chi_sim_vert+chi_tra+chi_tra_vert+rus or tesseract i4148.jpg - -l ara it does not crash - e.g. I can not reproduce an issue.

@stweil
Copy link
Contributor

stweil commented Jan 13, 2024

@marcreichman-pfi, it looks like at least ocrb_int.traineddata is needed to reproduce the issue. Can you provide more information about that model and add it to the issue?

I can reproduce the issue with ara from tessdata_fast, ocrb_int from https://github.com/Shreeshrii/tessdata_ocrb/raw/master/ocrb_int.traineddata (thanks @zdenop) and the image above.

It does not crash when I use ara from tessdata or tessdata_best.

@marcreichman-pfi
Copy link
Author

@GerHobbelt Is the fix you made in your fork something which will be merged into mainline at some point? I'm just trying to figure if it's worth it to build a custom package to include that one line change. Thanks for your time and attention!

@stweil
Copy link
Contributor

stweil commented Nov 8, 2024

Issue #4270 is a duplicate of this bug report and contains additional information.

@stweil
Copy link
Contributor

stweil commented Nov 8, 2024

Removing the assertion as in commit 407c165 is not a valid solution because this causes a later heap-use-after-free bug.

@marcreichman-pfi
Copy link
Author

@stweil Thanks for the activity on these tickets. I agree - this did sidestep the assertion issue (we're using it in a local fork) but there are still segfaults remaining - i assume because of this change.

Any idea on what to look at to fix the original issue?

@stweil
Copy link
Contributor

stweil commented Nov 8, 2024

No, I still have no idea. I am afraid that the planned next release won't fix this issue unless someone finds a solution fast.

@stweil
Copy link
Contributor

stweil commented Nov 8, 2024

The issue does not occur with all language pairs. It is even sufficient to replace tessdata_fast/ara by tessdata_best/ara. So there exist workarounds which can be used to avoid the crash.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants