Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent OCR #350

Open
equationcrunchor opened this issue Apr 29, 2019 · 5 comments
Open

Inconsistent OCR #350

equationcrunchor opened this issue Apr 29, 2019 · 5 comments
Labels
bug Something isn't working

Comments

@equationcrunchor
Copy link
Collaborator

equationcrunchor commented Apr 29, 2019

Example:

  • Searching document text for meme. http://127.0.0.1:8000/archives/doc/3_19_pmm_memo_re_709_1960_04_29_1_19 is first result.
  • Looking at PDF preview online, there is no meme in text, only memo. Highlighting the sentence Status of programming memo and revision of machine shut-down date to late July. and copy pasting elsewhere gives correct text.
  • Check OCR text in data/processed_pdfs folder. It says Status of programming meme, probably due to OCR error.

Seems like PDF preview and search have different opinions on the OCR?

@equationcrunchor equationcrunchor added the bug Something isn't working label Apr 29, 2019
@srisi
Copy link
Member

srisi commented Apr 30, 2019

That's unfortunately correct.
Tesseract is non-deterministic, meaning that ocr-ing the same document twice will lead to subtly different results (e.g. reading an "o" as an "e".)
In our current pipeline, we ocr each document twice: the first time to generate the pdf, the second time to generate the text file for the search. I hadn't thought about that difference between text and pdf before but clearly, we should do it all in one. @samimak37 : Ideas? We could use Py2PDF to extract the text once the document has been ocred (https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file)

@samimak37
Copy link
Collaborator

I think that's the best option at the moment. Because of how Tesseract handles the OCR, there is no way to guarantee the same result with different scans (although it is usually very consistent). I can work on the fix today

@samimak37
Copy link
Collaborator

samimak37 commented Apr 30, 2019

I’ve been poking around with ocr.py and I’ve found a few things.

  1. Tesseract apparently encodes the text within the PDF in a way that PyPDF2 cannot read. This results in a string that is nothing but newlines, and is therefore not very helpful.

  2. Other packages exist that can pull text from PDF files, but most of them are very Unix-centric. I have found the most success with pdftotext, although it has Poppler as a dependency. This is not a terrible problem for macOS and Linux users, but there is no "easy" way to install Poppler with Windows.

Is this a good route to pursue? Other packages include textract and tika, but similar problems are found (it should also be noted that tika runs through a server, which massively increases runtime).

@mscuthbert
Copy link
Member

@erica02139 has been editing some important handwritten document's OCR text by hand to get these important documents included. How do we ensure that they don't get overwritten next time we run the OCR mechanism? I've made a few edits myself.

@srisi
Copy link
Member

srisi commented May 2, 2019

@erica02139 has been editing some important handwritten document's OCR text by hand to get these important documents included. How do we ensure that they don't get overwritten next time we run the OCR mechanism? I've made a few edits myself.

I've thought of that briefly. Erica's documents should be easy to track and we could add a column to the metadata sheet that's checked if we have hand-corrected text for the document so it doesn't get overwritten when it's changed. We could maybe even store the hand-corrected ocr in the google sheet.
Your edits would be harder to track because (I think) they were more dispersed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants