PDF OCR Inspector

This one-off script is used to scan through a directory's .pdf files and search for bad OCR in these files.

All files and a few metrics are then listed in a log file.

Installation

Python 3.8 and over is required to run this script. It has not been tested with older distributions.

Once Python is installed, simply download the master.zip archive and extract its contents in a working directory.

Open a Terminal of PowerShell window in the pdf_ocr_inspector/ directory and run the following command to install the script's dependencies:

$ pip install -r requirements.txt

To make sure everything is installed properly, run the following command. You should see this output:

$ python inspector.py -V
> PDF OCR Inspector version 0.3

Usage

To run this script, you must write the following command from the Terminal or PowerShell window inside the pdf_ocr_inspector/ directory:

$ python inspector.py path_to_directory [-v]

path_to_directory must be a valid path enclosed in quotes;
-v is an optional parameter to increase verbosity in the Terminal output.

The script will go through each .pdf file, extracting text and will look for a particular pattern in the text. That pattern is considered as characters being "badly encoded". Metrics will be built with the files' OCR text:

the total of characters;
the total of bad characters;
the percentage of bad characters.

At the end of the script, an Excel file is created containing all these metrics. A log file is also created to list all files for which scanning has failed.

To-do list:

Add concurrency to speed up the scanning process;
Create a logger to list all errors encountered in files.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
test_dir		test_dir
.gitignore		.gitignore
LICENSE.txt		LICENSE.txt
README.md		README.md
inspector.py		inspector.py
requirements.txt		requirements.txt
tests.py		tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF OCR Inspector

Installation

Usage

To-do list:

About

Releases 3

Packages

Languages

License

metalogueur/pdf_ocr_inspector

Folders and files

Latest commit

History

Repository files navigation

PDF OCR Inspector

Installation

Usage

To-do list:

About

Resources

License

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages