Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add batch processing and report summaries #83

Merged
merged 1 commit into from
May 26, 2023

Conversation

rfdj
Copy link
Contributor

@rfdj rfdj commented May 12, 2023

In this PR, we added several features:

  • batch processing of entire directories
  • produce a list of the differences between GT and OCR and their number of occurrences. This list is shown in both the JSON and the HTML reports (sortable).
  • parsing sets of generated JSON reports, calculating average scores and aggregating a final list of differences. A threshold option was added to prevent the HTML report becoming too large to open.

The README has been updated accordingly.
This allows us to easily evaluate OCR results in a more systematic way. Please let us know if it's of any interest to you.

@mikegerber mikegerber self-assigned this May 24, 2023
@mikegerber mikegerber added the enhancement New feature or request label May 24, 2023
@mikegerber
Copy link
Member

Hi Ruud,

thanks for the PR! Very good ideas!

I think I am going to add this, but I also believe it needs some more work:

A)

  • The average CER calculated here is what I would call a macro average and, depending on the intended use case, does not weigh in the length of the texts ("micro average"). This is the reason why the JSON output includes the length so this can be calculated.
  • It also does not account for possible "infinite CER" cases (may happen when len(GT)==0)
  • We also need support for summarizing inside and over OCR-D workspaces

B)

I'll work on A first (possibly leaving out OCR-D for now)

@mikegerber
Copy link
Member

Working in https://github.com/qurator-spk/dinglehopper/tree/pr-83 (can't push to INL:feat/batch-processing, it seems).

@mikegerber
Copy link
Member

@rfdj Could you check if you can "Allow edits from maintainers" for this PR? (It may not be possible due to https://github.com/orgs/community/discussions/5634, though)

@mikegerber
Copy link
Member

mikegerber commented May 25, 2023

I've fixed a bug in dinglehopper-summarize when the reports do not contain any difference statistics here:

7c323e1

@mikegerber mikegerber merged commit 35be58c into qurator-spk:master May 26, 2023
@mikegerber
Copy link
Member

@rfdj Because I'm going on vacation I decided to merge the PR as is (to not block it for more weeks) and will take care of the points I mentioned after my vacation :)

Thanks for the contribution, I think this will be useful for the users!

@mikegerber mikegerber mentioned this pull request May 26, 2023
5 tasks
@rfdj
Copy link
Contributor Author

rfdj commented May 30, 2023

I was away for a few days myself, but let me know if I can still be of assistance somewhere when you come back.

@mikegerber
Copy link
Member

mikegerber commented Oct 27, 2023

I've noticed that I had a small fix for this in 7c323e1 (= f077ce2 in master) that I hadn't merged yet and did so today. (Summarizing threw an Exception if the reports didn't have the difference stats.)

@mikegerber mikegerber mentioned this pull request Jan 2, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants