-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Characters assigned to wrong RIL_WORD block, 0 % confidence. #4175
Comments
Please provide input images and example C++ code that demonstrate your problem. |
Input imageCode
OutputRemarksThe program above reproduces the error shown in the original issue post, but in a self-contained program. Hence coloring, fonts etc are deviating.
Each word or symbol comes with it's bounding box (green), the recognized text (red) and the confidence (blue). Dependencies
Setup To execute the program, you need to put the input image into the executable's current directory as "input.png". Discussion As you can see in the output image provided, the word "29M1" is recognized as "29M" with 0% confidence, albeit consisting of three characters '2', '9' and 'M' with above 90% confidence each. The 'M' is a misdetection of the actual printed "M1". Noticeably, the next character might screw things up: the first '1' of "10210A" gets detected as 3 different Symbols, '1', '1' and 'T', where the glitched '1' and 'T' seem to share the exact same location. They got a higher bounding box than the neighboring characters but are only 1 px wide. It seems, those glitched symbols screw up the word "29M110210A", divide it in two parts and subsequently set their confidences to zero. And just for funsies, on the left side the word "paper" is recognized from random cracks. With 16% confidence, which is infinitely more than the 0% for second line of the actual printed text. |
Thank you. As you guessed, text detection is what we aimed for. Just to reemphasize, I was neither being thrown off by the random junk being detected outside the obvious text label or by the inserted blank between '1' and '1'. What caught my attention was that
I think this is dangerous: there is a continuous sweep of angles the image can be rotated for good results. and then, amodst those, there is a discontinuity in the results where obvious recognition artifacts screw up the result. But I understand, there is machine learning behind the scenes and those models tend to have that kind of discontinuity issues. |
One last question: Do you have any educated guess on why this is happening? As far as I understand documentation, the image acquired by GetThresholdedImage() is the true image presented to the OCR. How come that there is a character, 'a', recognised in a pitch black area with not a single white pixel? To me this looks as if the character recognition model has not been trained with empty images as part of the rejection class(es). |
Current Behavior
Upon recognition with language model tessdata 4.1.0 eng.traineddata, characters that are a bit too wide apart sometimes get sorted into two words (false whitespace), depending on the ROI that was set with TessBaseAPI::SetRechtangle.
This is an expected edge case due to different segmentation, depending on the placement of the ROI. Thresholding method is Tiled Sauvola.
The recognized characters, however, seem not to be assigned to the correct word block (RIL_WORD):
Below you can see the total image and the ROI (orange rectangle). The black area in the center is a customer logo that had to be removed for uploading. During OCR, this area was not black, but printed like the other text.
Changing the ROI slightly (moving the bottom right corner a bit more outward) removes the split of the "29M110210A" line, though the confidence is low (22.3 %).
Side note:
And suddenly an additional "0" has sneaked in, making it "290M110210A"!
Layout analysis (TessBaseAPI::AnalyseLayout) shows a tiny fragment inside the "M". Is that the "0"?
Changing the ROI yet a bit more (moving the bottom right corner yet further outward) then gives a good confidence (80.4 %).
I thought that maybe I am observing an edge case where the segmentation goes from OK to really bad, but this seems not to be the case. TessBaseAPI::GetThresholdedImage gives these nearly identical images for the bad and the good case:
Expected Behavior
This should either read "20M1" and "10210A" or "20M110210A" (neglect the "9" as "0" misrecognition) and not have zero confidence.
Suggested Fix
No response
tesseract -v
tesseract 5.3.2
leptonica-1.83.1 (Oct 27 2023, 14:15:52) [MSC v.1929 LIB Release x64]
(null)
Found AVX2
Found AVX
Found FMA
Found SSE4.1
Operating System
Windows 10
Other Operating System
No response
uname -a
No response
Compiler
MSVC 16.11.26
CPU
No response
Virtualization / Containers
No response
Other Information
C++ Application, TessBaseAPI. Custom GDI visualization, hence only cropped screenshots with slightly differing alignment.
The text was updated successfully, but these errors were encountered: