Skip to content

Corrects text extracted from PDF files. The PDF is typically an OCR of scanned paper.

Notifications You must be signed in to change notification settings

Shoresh613/proofreadTextFromPDF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

proofreadTextFromPDF

Proof reads text that is extracted from PDF files, the path is specified in the PDFpath variable. By default the proof reading is done in Swedish. This can be changed in the prompt sent to OpenAI. You can also change what "Page" is called in your language through the variable page_name.

Initially the idea was to increase the likelyhood of finding what you search for if you have a large library of PDF files, where the OCR process has been less than perfect.

Put all PDF files in a folder named PDF, or specify the path in the variable PDFpath.

The temperature is set to 0.1, as that is what is used in the OpenAI example for correcting grammar.

You need an OpenAI API key environment variable. Remember OpenAI charges for the processing. The cost can be reduced by orders of magnitude by instead using the gpt-3.5-turbo model and adapting the code accordingly. The cost would then go down to $0.002 / 1K tokens instead of $0.12 / 1K tokens.

Unfortunately the results are dissappointing using the best text completion model (text-davinci-003).

About

Corrects text extracted from PDF files. The PDF is typically an OCR of scanned paper.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages