Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support optional stopword list #17

Open
cneud opened this issue Jun 9, 2020 · 3 comments
Open

Support optional stopword list #17

cneud opened this issue Jun 9, 2020 · 3 comments
Assignees
Labels
enhancement New feature or request

Comments

@cneud
Copy link
Member

cneud commented Jun 9, 2020

A common use case for OCR evaluation (e.g. for search engine indexing, text- and data mining, asf.) is to omit stopwords from the word evaluation to get an understanding of the correctness of "significant words" only.

It would therefore be useful if dinglehopper would also support the optional use of a stopword list provided via parameter/config file. This is already supported in ocrevalUAtion.

@cneud cneud added the enhancement New feature or request label Jun 9, 2020
@mikegerber mikegerber self-assigned this Jun 10, 2020
@mikegerber
Copy link
Member

Given

GT: the quick brown fox jumps over the lazy dog
OCR: the quick brown fox jumps over they lazer dog

with a stop word list:

the

would only count 1 error (lazy vs lazer).

@cneud
Copy link
Member Author

cneud commented Sep 25, 2020

Just leaving this here for documentation - this is also sometimes referred to as "significant words" evaluation.

The number of occurrences of content words for which users might be interested in searching, excluding stop-listed words, such as "the", "he", "it", etc.
Measuring Mass Text Digitization Quality and Usefulness
Measuring the OCR Accuracy across The British Library’s Newspaper Archive

@cneud
Copy link
Member Author

cneud commented Sep 25, 2020

would only count 1 error (lazy vs lazer).

Exactly. Any words appearing in the GT and also in the stopword list are ignored when computing the "significant words" accuracy rate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants