proofreadTextFromPDF

Proof reads text that is extracted from PDF files, the path is specified in the PDFpath variable. By default the proof reading is done in Swedish. This can be changed in the prompt sent to OpenAI. You can also change what "Page" is called in your language through the variable page_name.

Initially the idea was to increase the likelyhood of finding what you search for if you have a large library of PDF files, where the OCR process has been less than perfect.

Put all PDF files in a folder named PDF, or specify the path in the variable PDFpath.

The temperature is set to 0.1, as that is what is used in the OpenAI example for correcting grammar.

You need an OpenAI API key environment variable. Remember OpenAI charges for the processing. The cost can be reduced by orders of magnitude by instead using the gpt-3.5-turbo model and adapting the code accordingly. The cost would then go down to $0.002 / 1K tokens instead of $0.12 / 1K tokens.

Unfortunately the results are dissappointing using the best text completion model (text-davinci-003).

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.gitignore		.gitignore
README.md		README.md
proofreadTextFromPDF.py		proofreadTextFromPDF.py
sample.png		sample.png
sample_smaller.png		sample_smaller.png
uppgift.txt		uppgift.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

proofreadTextFromPDF

About

Releases

Packages

Languages

Shoresh613/proofreadTextFromPDF

Folders and files

Latest commit

History

Repository files navigation

proofreadTextFromPDF

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages