-
Notifications
You must be signed in to change notification settings - Fork 381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Telugu unicode ambiguities #32
Comments
1. Please also test with tessdata_fast.
2. Check tel.lstm-unicharset in both tessdata_best and tessdata_fast to
ensure that rarely used letters are included.
3. Take a look at the training source files in langdata_lstm repo under tel.
4. Verify that the indic/telugu validation rules are correct.
…On Thu 13 Sep, 2018, 8:53 AM Manas Marthi, ***@***.***> wrote:
Hi,
I created a test text data mostly (made up individual characters. see
attachment) and converted it to tiff file using 'jTessBoxEditorFX'. I then
ran it using the the testdata_best telugu language trained data.
I noticed a few errors in recognizing them. I believe this are due to
ambiguous glyphs'.
*Ambiguity 1*: Telugu has three vowels that are similar to another
consonant (There is another consonant that looks close enough)
vowel 1) ఒ (pronounced as 'o' in 'so')
vowel 2) ఓ (pronounced as 'oa' in 'goal' )
vowel 3) ఔ (pronounced as 'ou' in 'ounce' or 'pound')
similar looking consonant 1) బ (pronounced as 'bu' in 'bus')
consonant 2) భ (this is same as above but uttered with stress and
aspiration. Imagine saying 'bus' as 'bhus')
*Ambiguity 2*: Consonant చ (pronounced as 'ch' as in 'church') is similar
to another rarely used consonant ౘ (closest transliteration 'tsa')
*Ambiguity 3*: Consonant ర (pronounced as 'ru' as in 'run') is similar to
another consonant ఠ ( hard 't' - close to the 't' in 'stone')
*Ambiguity 4*: Consonant జ (pronounced as 'ju' as in 'justice') is
similar to another rarely used consonant ౙ (closest trasilteration 'za')
and also similar to ఙ ('jna')
*Ambiguity 5*: consonant ఝ (pronounced as 'jha' - hard జ with aspiration
) was interpreted as 'రు' (pronounced as 'ru' in 'rupee' ) which is a
combination of Consonant ర ('ru') + vowel ఉ(pronounced as 'u' in 'push')
*Ambiguity 6*: vowel ఇ ( pronounced as 'i' in 'ink') is close to
consonant ఞ (pronounced as 'inya'). The 'inya' did not get recognized at
all in my test data.
*Ambiguity 7*: కౄ ( 'cru' as in 'cruel') and గౄ ('grue' as in 'gruesome')
were converted to క్య ('kya') and 'గ్యూ' (gyoo).
*Ambiguity 8*: ౠ ('rroo') became బూ ('boo')
I guess some of them could be due to my poor tiff. But I think some of the
ambiguities are genuine and need to be handled.
Please help to address these ambiguity resolutions.
tesseract-telugu.txt
<https://github.com/tesseract-ocr/tessdata_best/files/2377575/tesseract-telugu.txt>
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#32>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AE2_ow6-Hp5u_rar7PuPyzPF2xepLL3Nks5uac-xgaJpZM4Wmghi>
.
|
Please test with real text not just syllables.
On Thu 13 Sep, 2018, 12:22 PM Shree Devi Kumar, <[email protected]>
wrote:
… 1. Please also test with tessdata_fast.
2. Check tel.lstm-unicharset in both tessdata_best and tessdata_fast to
ensure that rarely used letters are included.
3. Take a look at the training source files in langdata_lstm repo under
tel.
4. Verify that the indic/telugu validation rules are correct.
On Thu 13 Sep, 2018, 8:53 AM Manas Marthi, ***@***.***>
wrote:
> Hi,
> I created a test text data mostly (made up individual characters. see
> attachment) and converted it to tiff file using 'jTessBoxEditorFX'. I then
> ran it using the the testdata_best telugu language trained data.
> I noticed a few errors in recognizing them. I believe this are due to
> ambiguous glyphs'.
>
> *Ambiguity 1*: Telugu has three vowels that are similar to another
> consonant (There is another consonant that looks close enough)
> vowel 1) ఒ (pronounced as 'o' in 'so')
> vowel 2) ఓ (pronounced as 'oa' in 'goal' )
> vowel 3) ఔ (pronounced as 'ou' in 'ounce' or 'pound')
>
> similar looking consonant 1) బ (pronounced as 'bu' in 'bus')
> consonant 2) భ (this is same as above but uttered with stress and
> aspiration. Imagine saying 'bus' as 'bhus')
>
> *Ambiguity 2*: Consonant చ (pronounced as 'ch' as in 'church') is
> similar to another rarely used consonant ౘ (closest transliteration 'tsa')
>
> *Ambiguity 3*: Consonant ర (pronounced as 'ru' as in 'run') is similar
> to another consonant ఠ ( hard 't' - close to the 't' in 'stone')
>
> *Ambiguity 4*: Consonant జ (pronounced as 'ju' as in 'justice') is
> similar to another rarely used consonant ౙ (closest trasilteration 'za')
> and also similar to ఙ ('jna')
>
> *Ambiguity 5*: consonant ఝ (pronounced as 'jha' - hard జ with aspiration
> ) was interpreted as 'రు' (pronounced as 'ru' in 'rupee' ) which is a
> combination of Consonant ర ('ru') + vowel ఉ(pronounced as 'u' in 'push')
>
> *Ambiguity 6*: vowel ఇ ( pronounced as 'i' in 'ink') is close to
> consonant ఞ (pronounced as 'inya'). The 'inya' did not get recognized at
> all in my test data.
>
> *Ambiguity 7*: కౄ ( 'cru' as in 'cruel') and గౄ ('grue' as in
> 'gruesome') were converted to క్య ('kya') and 'గ్యూ' (gyoo).
>
> *Ambiguity 8*: ౠ ('rroo') became బూ ('boo')
>
> I guess some of them could be due to my poor tiff. But I think some of
> the ambiguities are genuine and need to be handled.
>
> Please help to address these ambiguity resolutions.
>
> tesseract-telugu.txt
> <https://github.com/tesseract-ocr/tessdata_best/files/2377575/tesseract-telugu.txt>
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> <#32>, or mute the
> thread
> <https://github.com/notifications/unsubscribe-auth/AE2_ow6-Hp5u_rar7PuPyzPF2xepLL3Nks5uac-xgaJpZM4Wmghi>
> .
>
|
Thank you. I will try and update |
I created a word doc with valid text and converted it to pdf and then tiff using imagemagick and ran tesseract with training data fast. I was able to scan mostly okay. News paper clipping had some errors..But that's fine. That said, the ambiguity stated in item 1,7 are still a problem. |
I will do more testing and update here |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hi,
I created a test text data mostly (made up individual characters. see attachment) and converted it to tiff file using 'jTessBoxEditorFX' with font 'noto sans telugu 8pt'. I then ran it using the the testdata_best telugu language trained data.
I noticed a few errors in recognizing them. I believe this are due to ambiguous glyphs'.
Ambiguity 1: Telugu has three vowels that are similar to another consonant (There is another consonant that looks close enough)
vowel 1) ఒ (pronounced as 'o' in 'so')
vowel 2) ఓ (pronounced as 'oa' in 'goal' )
vowel 3) ఔ (pronounced as 'ou' in 'ounce' or 'pound')
similar looking consonant 1) బ (pronounced as 'bu' in 'bus')
consonant 2) భ (this is same as above but uttered with stress and aspiration. Imagine saying 'bus' as 'bhus')
Ambiguity 2: Consonant చ (pronounced as 'ch' as in 'church') is similar to another rarely used consonant ౘ (closest transliteration 'tsa')
Ambiguity 3: Consonant ర (pronounced as 'ru' as in 'run') is similar to another consonant ఠ ( hard 't' - close to the 't' in 'stone')
Ambiguity 4: Consonant జ (pronounced as 'ju' as in 'justice') is similar to another rarely used consonant ౙ (closest trasilteration 'za') and also similar to ఙ ('jna')
Ambiguity 5: consonant ఝ (pronounced as 'jha' - hard జ with aspiration ) was interpreted as 'రు' (pronounced as 'ru' in 'rupee' ) which is a combination of Consonant ర ('ru') + vowel ఉ(pronounced as 'u' in 'push')
Ambiguity 6: vowel ఇ ( pronounced as 'i' in 'ink') is close to consonant ఞ (pronounced as 'inya'). The 'inya' did not get recognized at all in my test data.
Ambiguity 7: కౄ ( 'cru' as in 'cruel') and గౄ ('grue' as in 'gruesome') were converted to క్య ('kya') and 'గ్యూ' (gyoo).
Ambiguity 8: ౠ ('rroo') became బూ ('boo')
I guess some of them could be due to my poor tiff. But I think some of the ambiguities are genuine and need to be handled.
Please help to address these ambiguity resolutions.
tesseract-telugu.txt
The text was updated successfully, but these errors were encountered: