Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

autodetect pdf type #343

Open
gregoribic opened this issue May 10, 2021 · 4 comments
Open

autodetect pdf type #343

gregoribic opened this issue May 10, 2021 · 4 comments

Comments

@gregoribic
Copy link

Is there already a solution to check/detect if the pdf is searchable (pdftotext) or it is an image (ocr, tesseract) and use appropriate method for text extraction.

@nayyhah
Copy link

nayyhah commented Jan 23, 2022

It can simply be done by using if-else conditions.
You can put a condition and check if pdf can be extracted by using pdftotext, and if result is False then the else condition will try it again with OCR(tesseract)

result = extract_data(filename,templates=templates)
if not result:
result = extract_data(filename, templates=templates, input_module=tesseract)

@manuel-barreiro
Copy link

Hello, I'm trying to apply what you are saying, but I'm getting the following error: "NameError: name 'tesseract' is not defined"

It also happens when I fill the input_module with "pdftotext" and the other ones. Invoice2data is working good for me with normal PDFs, but in this case I'm trying to process a scanned pdf, that's why I need to specify tesseract as input_module.

Hope you can help me.

@bosd
Copy link
Collaborator

bosd commented Aug 26, 2022

Could be, but how to handle corner cases?
I've got a couple of invoices. Where they put the company info in the image header of the invoice.

The invoice line part is the same.
Branch A --> Shows header image with Branch A business Info
Branch B --> Shows header image with Branch B business Info

(Or another company who issues invoices with their company info as flat image, and the rest of the invoice as text.)

@bosd
Copy link
Collaborator

bosd commented Aug 26, 2022

Previously there was a function in invoice2data which was checking the PDF output. It was something like. If the output is less then 80 characters, then fallback on Tesseract to OCR the PDF.
It was removed because of stability issues??

Maybe this is not needed to be solved in invoice2data.
As you pdfminer support hOCR now.
pdfminer/pdfminer.six#651

Maybe we need to update documentation how to use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants