-
Notifications
You must be signed in to change notification settings - Fork 9.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Improve positioning of symbol bounding boxes #3787
Open
p12tic
wants to merge
2
commits into
tesseract-ocr:main
Choose a base branch
from
p12tic:improve-symbol-positions
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
@stweil Just a friendly ping :-) |
When using LSTM models the accuracy of character bounding boxes is low with many blobs assigned to wrong characters. This is caused by the fact that LSTM model output produces only approximate character positions without boundary data. As a result the input blobs cannot be accurately mapped to characters and which compromises the accuracy of character bounding boxes. Current this problem is solved as follows. The character boundaries are computed according to the character positions from the LSTM output by placing the boundaries at the middle between two character positions. The blobs are then assigned according to which character the center of the blob falls to. In other words the blobs are assigned to the nearest characters. This unfortunately produces a lot of errors because the character positions in the LSTM output have a tendency to drift, thus the nearest character is often not the right one. Fortunately while the LSTM model produces approximate positions, the blob boundaries produced by the regular segmenter are pretty good. Most of the time a single blob corresponds to a single character and vice-versa. The above is used to create an optimization algorithm that treats the output of the regular segmenter as a template to which LSTM model output is matched. The selection of best match is done by assigning each unwanted property of the outcome a cost and then minimizing the total cost of the solution. This reliably solves the most frequent error present in the current solution when blobs are simply assigned to wrong character. As a result the current algorithm produces up to 20 times less errors. Fixes tesseract-ocr#1712.
p12tic
force-pushed
the
improve-symbol-positions
branch
from
May 15, 2022 21:15
cbe83ec
to
51a3398
Compare
why still not merged |
I guess lack of testing capacity for core functionality. Since I discovered that and some other low level segmentation bugs my focus has shifted to EasyOCR.
Verzonden vanaf Outlook voor Android<https://aka.ms/AAb9ysg>
…________________________________
From: 海刚 ***@***.***>
Sent: Thursday, October 26, 2023 7:52:16 AM
To: tesseract-ocr/tesseract ***@***.***>
Cc: rmast ***@***.***>; Comment ***@***.***>
Subject: Re: [tesseract-ocr/tesseract] RFC: Improve positioning of symbol bounding boxes (PR #3787)
why still not merged
―
Reply to this email directly, view it on GitHub<#3787 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAZPZ5RPVKU5GD3FJM2I7HDYBH3BBAVCNFSM5TBLNP7KU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZYGA2DKNZUGI4Q>.
You are receiving this because you commented.Message ID: ***@***.***>
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR improves the positions of symbol bounding boxes in cases when LSTM model is used. Up to 20 times less errors have been observed in sample images.
This PR still requires potentially significant amount of work, please let me know whether the approach is sensible in principle and if the PR makes sense I will spend time polishing it.
When using LSTM models the accuracy of character bounding boxes is low with many blobs assigned to wrong characters. This is caused by the fact that LSTM model output produces only approximate character positions without boundary data. As a result the input blobs cannot be accurately mapped to characters and which compromises the accuracy of character bounding boxes.
Currently this problem is solved as follows. The character boundaries are computed according to the character positions from the LSTM output by placing the boundaries at the middle between two character positions. The blobs are then assigned according to which character the center of the blob falls to. In other words the blobs are assigned to the nearest characters.
This unfortunately produces a lot of errors because the character positions in the LSTM output have a tendency to drift, thus the nearest character is often not the right one.
Fortunately while the LSTM model produces approximate positions, the blob boundaries produced by the regular segmenter are pretty good. Most of the time a single blob corresponds to a single character and vice-versa.
The above is used to create an optimization algorithm that treats the output of the regular segmenter as a template to which LSTM model output is matched. The selection of best match is done by assigning each unwanted property of the outcome a cost and then minimizing the total cost of the solution.
This reliably solves the most frequent error present in the current solution when blobs are simply assigned to wrong character. As a result the current algorithm produces up to 20 times less errors.
This can be further improved because the root cause of most of the remaining errors is the segmenter producing single blobs for multiple characters. The algorithm could be improved by biasing it to split the blobs in the places where the segmenter makes errors often, such as near character "t".
Fixes #1712.
One of the example images I've used:
Before this PR, tesseract produced 116 errors in determining character bounding boxes (could be inaccurate due to manual counting)
After this PR, tesseract produced only 5 errors in determining character bounding boxes.