Characters assigned to wrong RIL_WORD block, 0 % confidence. #4175

MK-3PP · 2024-01-01T16:23:04Z

Current Behavior

Upon recognition with language model tessdata 4.1.0 eng.traineddata, characters that are a bit too wide apart sometimes get sorted into two words (false whitespace), depending on the ROI that was set with TessBaseAPI::SetRechtangle.
This is an expected edge case due to different segmentation, depending on the placement of the ROI. Thresholding method is Tiled Sauvola.

The recognized characters, however, seem not to be assigned to the correct word block (RIL_WORD):

Note how the recognized words are "20M" and "110210A" instead of the clearly boxed "29M1" and "10210A" substrings in the image (neglect the "9" as "0" misrecognition).
Note that the confidences are 0.0% on each of the words.

Below you can see the total image and the ROI (orange rectangle). The black area in the center is a customer logo that had to be removed for uploading. During OCR, this area was not black, but printed like the other text.

Changing the ROI slightly (moving the bottom right corner a bit more outward) removes the split of the "29M110210A" line, though the confidence is low (22.3 %).

Side note:
And suddenly an additional "0" has sneaked in, making it "290M110210A"!
Layout analysis (TessBaseAPI::AnalyseLayout) shows a tiny fragment inside the "M". Is that the "0"?

Changing the ROI yet a bit more (moving the bottom right corner yet further outward) then gives a good confidence (80.4 %).

I thought that maybe I am observing an edge case where the segmentation goes from OK to really bad, but this seems not to be the case. TessBaseAPI::GetThresholdedImage gives these nearly identical images for the bad and the good case:

Expected Behavior

This should either read "20M1" and "10210A" or "20M110210A" (neglect the "9" as "0" misrecognition) and not have zero confidence.

Suggested Fix

No response

tesseract -v

tesseract 5.3.2
leptonica-1.83.1 (Oct 27 2023, 14:15:52) [MSC v.1929 LIB Release x64]
(null)
Found AVX2
Found AVX
Found FMA
Found SSE4.1

Operating System

Windows 10

Other Operating System

No response

uname -a

No response

Compiler

MSVC 16.11.26

CPU

No response

Virtualization / Containers

No response

Other Information

C++ Application, TessBaseAPI. Custom GDI visualization, hence only cropped screenshots with slightly differing alignment.

zdenop · 2024-01-13T09:59:44Z

Please provide input images and example C++ code that demonstrate your problem.

MK-3PP · 2024-01-30T13:13:32Z

Input image

Code

#include "leptonica/allheaders.h"
#include "leptonica/pix_internal.h"
#include "tesseract/baseapi.h"
#include "opencv2/imgcodecs.hpp"
#include "opencv2/imgproc.hpp"
#include <memory>

int main() {
    cv::Mat in_img = cv::imread("./input.png", cv::ImreadModes::IMREAD_GRAYSCALE);
    tesseract::TessBaseAPI tess;

    // Set tesseract parameters.
    tess.Init(".", "eng");
    tess.SetVariable("thresholding_method", "2"); // Tiled Sauvola
    tess.SetPageSegMode(tesseract::PageSegMode::PSM_SINGLE_BLOCK);
    tess.SetImage(in_img.data, in_img.cols, in_img.rows, in_img.channels(), static_cast<int>(in_img.step1()));

    // Output thresholded image.
    std::unique_ptr<Pix, void(*)(Pix*)> thrs_pix(tess.GetThresholdedImage(), [](Pix* val) { pixDestroy(&val); });
    cv::Mat out_img(cv::Size(thrs_pix->w, thrs_pix->h), CV_8UC1);
    for (uint32_t y = 0; y < thrs_pix->h; ++y) {
        for (uint32_t x = 0; x < thrs_pix->w; ++x) {
            l_uint32 val;
            if (0 == pixGetPixel(thrs_pix.get(), x, y, &val)) {
                out_img.at<unsigned char>(y, x) = val ? 255 : 0;
            }
        }
    }
    cv::cvtColor(out_img, out_img, cv::COLOR_GRAY2BGR); // prepare colored output image

    // Perform recognition.
    if (0 == tess.Recognize(nullptr))
        return 1;

    std::unique_ptr<tesseract::ResultIterator> res_iter(tess.GetIterator());

    if (nullptr == res_iter)
        return 2;

    // Extract image information. Generate output image for symbols and words.
    for (auto block_level : { tesseract::PageIteratorLevel::RIL_SYMBOL , tesseract::PageIteratorLevel::RIL_WORD }) {
        cv::Mat curr_img;
        cv::cvtColor(in_img, curr_img, cv::COLOR_GRAY2BGR); // prepare colored current image
        res_iter->Begin();

        do {
            // Only text blocks.
            if (PTIsTextType(res_iter->BlockType())) {
                cv::Point2i p1, p2;

                if (res_iter->BoundingBox(block_level, &p1.x, &p1.y, &p2.x, &p2.y)) {
                    // Draw bounding box.
                    cv::rectangle(curr_img, cv::Rect(p1, p2), cv::Scalar(0, 255, 0));

                    // Prapare text output.
                    const int font = cv::HersheyFonts::FONT_HERSHEY_PLAIN;
                    cv::Size text_size;

                    // Write confidence.
                    std::stringstream conf;
                    conf.precision(0);
                    conf << std::fixed << res_iter->Confidence(block_level) << '%';
                    text_size = cv::getTextSize(conf.str(), font, 1.0, 1, nullptr);
                    cv::putText(curr_img, conf.str(), cv::Point2i(p2.x - text_size.width - 2, p2.y - 2), font, 1.0, cv::Scalar(255, 100, 0));

                    // Write detected text (OpenCV does only have ASCII, but close enough).
                    std::unique_ptr<const char[]> raw_text(res_iter->GetUTF8Text(block_level));
                    if (raw_text != nullptr) {
                        text_size = cv::getTextSize(raw_text.get(), font, 1.0, 1, nullptr);
                        cv::putText(curr_img, raw_text.get(), cv::Point2i(p1.x + 2, p1.y + text_size.height + 2), font, 1, cv::Scalar(0, 0, 255));
                    }
                }
            }
        } while (res_iter->Next(block_level));

        // Stack current image on top of output image.
        cv::vconcat(curr_img, out_img, out_img);
    }

    cv::imwrite("./output.png", out_img);

    return 0;
}

Output

Remarks

The program above reproduces the error shown in the original issue post, but in a self-contained program. Hence coloring, fonts etc are deviating.
The output consists of three stacked augmented verisons of the input image:

Recognized words
Recognized symbols
Threshold image (for visual proof of Tesseract's working space)

Each word or symbol comes with it's bounding box (green), the recognized text (red) and the confidence (blue).

Dependencies

Tesseract
Leptonica
OpenCV

Setup

To execute the program, you need to put the input image into the executable's current directory as "input.png".
Also, you need the english language model from here in the same folder.
The output will be saved as "output.png" in the same folder.

Discussion

As you can see in the output image provided, the word "29M1" is recognized as "29M" with 0% confidence, albeit consisting of three characters '2', '9' and 'M' with above 90% confidence each. The 'M' is a misdetection of the actual printed "M1".

Noticeably, the next character might screw things up: the first '1' of "10210A" gets detected as 3 different Symbols, '1', '1' and 'T', where the glitched '1' and 'T' seem to share the exact same location. They got a higher bounding box than the neighboring characters but are only 1 px wide. It seems, those glitched symbols screw up the word "29M110210A", divide it in two parts and subsequently set their confidences to zero.
Detail shot from our customer application (I can zoom in there, but the boxes are drawn 0.5 pixels off - it is just a quick debug view):

And just for funsies, on the left side the word "paper" is recognized from random cracks. With 16% confidence, which is infinitely more than the 0% for second line of the actual printed text.

zdenop · 2024-02-20T13:09:21Z

I just manually preprocess image based on documentation:

and the result is:

tesseract input4175p.png -
9200795018 -
20M110210A

=>

tesseract is not suitable for text detection (usually)
tesseract is OCR engine for good output there is a need to give a good input image.

MK-3PP · 2024-02-20T13:38:25Z

Thank you. As you guessed, text detection is what we aimed for.

Just to reemphasize, I was neither being thrown off by the random junk being detected outside the obvious text label or by the inserted blank between '1' and '1'.

What caught my attention was that

"M1" became "M"
"1" became "11" (and this was not a '1' being carried over the blank, it was a coincidentally occuring actual '1' that was detected with a very deformed bounding box)
The confidence dropped to 0 %
and the broken overlapping bounding boxes left of the second '1' glyph in the second line.
And all that while the same image rotated 1 ° or 2 ° to the left or right yielded OK results.

I think this is dangerous: there is a continuous sweep of angles the image can be rotated for good results. and then, amodst those, there is a discontinuity in the results where obvious recognition artifacts screw up the result.
Even for non-optimal inputs the reults should not glitch out like that.

But I understand, there is machine learning behind the scenes and those models tend to have that kind of discontinuity issues.

MK-3PP · 2024-02-20T13:41:41Z

One last question:

Do you have any educated guess on why this is happening?

As far as I understand documentation, the image acquired by GetThresholdedImage() is the true image presented to the OCR. How come that there is a character, 'a', recognised in a pitch black area with not a single white pixel?

To me this looks as if the character recognition model has not been trained with empty images as part of the rejection class(es).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Characters assigned to wrong RIL_WORD block, 0 % confidence. #4175

Characters assigned to wrong RIL_WORD block, 0 % confidence. #4175

MK-3PP commented Jan 1, 2024 •

edited

Loading

zdenop commented Jan 13, 2024

MK-3PP commented Jan 30, 2024 •

edited

Loading

zdenop commented Feb 20, 2024

MK-3PP commented Feb 20, 2024

MK-3PP commented Feb 20, 2024 •

edited

Loading

Characters assigned to wrong RIL_WORD block, 0 % confidence. #4175

Characters assigned to wrong RIL_WORD block, 0 % confidence. #4175

Comments

MK-3PP commented Jan 1, 2024 • edited Loading

Current Behavior

Expected Behavior

Suggested Fix

tesseract -v

Operating System

Other Operating System

uname -a

Compiler

CPU

Virtualization / Containers

Other Information

zdenop commented Jan 13, 2024

MK-3PP commented Jan 30, 2024 • edited Loading

Input image

Code

Output

Remarks

zdenop commented Feb 20, 2024

MK-3PP commented Feb 20, 2024

MK-3PP commented Feb 20, 2024 • edited Loading

MK-3PP commented Jan 1, 2024 •

edited

Loading

MK-3PP commented Jan 30, 2024 •

edited

Loading

MK-3PP commented Feb 20, 2024 •

edited

Loading