Inconsistent OCR #350

equationcrunchor · 2019-04-29T21:41:03Z

Example:

Searching document text for meme. http://127.0.0.1:8000/archives/doc/3_19_pmm_memo_re_709_1960_04_29_1_19 is first result.
Looking at PDF preview online, there is no meme in text, only memo. Highlighting the sentence Status of programming memo and revision of machine shut-down date to late July. and copy pasting elsewhere gives correct text.
Check OCR text in data/processed_pdfs folder. It says Status of programming meme, probably due to OCR error.

Seems like PDF preview and search have different opinions on the OCR?

The text was updated successfully, but these errors were encountered:

srisi · 2019-04-30T13:43:10Z

That's unfortunately correct.
Tesseract is non-deterministic, meaning that ocr-ing the same document twice will lead to subtly different results (e.g. reading an "o" as an "e".)
In our current pipeline, we ocr each document twice: the first time to generate the pdf, the second time to generate the text file for the search. I hadn't thought about that difference between text and pdf before but clearly, we should do it all in one. @samimak37 : Ideas? We could use Py2PDF to extract the text once the document has been ocred (https://stackoverflow.com/questions/34837707/how-to-extract-text-from-a-pdf-file)

samimak37 · 2019-04-30T14:59:49Z

I think that's the best option at the moment. Because of how Tesseract handles the OCR, there is no way to guarantee the same result with different scans (although it is usually very consistent). I can work on the fix today

samimak37 · 2019-04-30T17:50:11Z

I’ve been poking around with ocr.py and I’ve found a few things.

Tesseract apparently encodes the text within the PDF in a way that PyPDF2 cannot read. This results in a string that is nothing but newlines, and is therefore not very helpful.
Other packages exist that can pull text from PDF files, but most of them are very Unix-centric. I have found the most success with pdftotext, although it has Poppler as a dependency. This is not a terrible problem for macOS and Linux users, but there is no "easy" way to install Poppler with Windows.

Is this a good route to pursue? Other packages include textract and tika, but similar problems are found (it should also be noted that tika runs through a server, which massively increases runtime).

mscuthbert · 2019-05-01T17:12:55Z

@erica02139 has been editing some important handwritten document's OCR text by hand to get these important documents included. How do we ensure that they don't get overwritten next time we run the OCR mechanism? I've made a few edits myself.

srisi · 2019-05-02T13:44:31Z

@erica02139 has been editing some important handwritten document's OCR text by hand to get these important documents included. How do we ensure that they don't get overwritten next time we run the OCR mechanism? I've made a few edits myself.

I've thought of that briefly. Erica's documents should be easy to track and we could add a column to the metadata sheet that's checked if we have hand-corrected text for the document so it doesn't get overwritten when it's changed. We could maybe even store the hand-corrected ocr in the google sheet.
Your edits would be harder to track because (I think) they were more dispersed.

equationcrunchor added the bug Something isn't working label Apr 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent OCR #350

Inconsistent OCR #350

equationcrunchor commented Apr 29, 2019 •

edited

Loading

srisi commented Apr 30, 2019

samimak37 commented Apr 30, 2019

samimak37 commented Apr 30, 2019 •

edited

Loading

mscuthbert commented May 1, 2019

srisi commented May 2, 2019

Inconsistent OCR #350

Inconsistent OCR #350

Comments

equationcrunchor commented Apr 29, 2019 • edited Loading

srisi commented Apr 30, 2019

samimak37 commented Apr 30, 2019

samimak37 commented Apr 30, 2019 • edited Loading

mscuthbert commented May 1, 2019

srisi commented May 2, 2019

equationcrunchor commented Apr 29, 2019 •

edited

Loading

samimak37 commented Apr 30, 2019 •

edited

Loading