A tool to help (semi-)automatically find typos. By default, uses Wikipedia as source of likely typos.
For now, the code is run via cloning from GitHub (a PR to make this pip-installable would be welcomed).
git clone https://github.com/bwignall/typochecker.git
cd typochecker
The code currently requires access to certain UK and US dictionaries.
sudo apt-get install wamerican-huge wbritish-huge
python -m typochecker.corrector --dir BASE_DIRECTORY
Example: python corrector.py -d path/to/dest/folder
(This assumes that corrector.py
is in your Python path.)
This will traverse BASE_DIRECTORY
and its subdirectories.
make -f path/to/typochecker/Makefile
Example (explicit): cd path/to/my/folder ; git ls-files | python path/to/my/git/src/typochecker/corrector.py
For either method, this will iterate through the files found, cross-reference the list of likely typos, and prompt the user for how they would like to handle each potential typo. Files will be modified, but no Git commits happen automatically.
- To accept the suggestion, enter
/
. - To ignore the suggestion and keep the existing text, press
Enter
. - To ignore the "typo" for the remainder of the session, enter
!i
. - For help, enter
!h
.
Not all nominal typos are genuine typos. For example, your domain may use terminology that has spelling similar to some non-technical words. To have those not flagged as typos, you can specify whitelisted words. These may be specified directly on the command invocation, via a file, or both.
Examples, direct:
python -m typochecker.corrector --dir BASE_DIRECTORY -w exampleone
python -m typochecker.corrector --dir BASE_DIRECTORY -w exampleone -w exampletwo
Examples, whitelist file:
python -m typochecker.corrector --dir BASE_DIRECTORY -W fileone
python -m typochecker.corrector --dir BASE_DIRECTORY -W fileone -W filetwo
Examples, mixed:
python -m typochecker.corrector --dir BASE_DIRECTORY -w wordone -W fileone
python -m typochecker.corrector --dir BASE_DIRECTORY -w wordone -w wordtwo -W fileone
python -m typochecker.corrector --dir BASE_DIRECTORY -w wordone -W fileone -W filetwo
Note that words are listed explicitly via -w
(i.e., lowercase) and
files are via -W
(i.e., uppercase); long-form options exist for both;
add -h
/--help
for details.
The tool splits on non-alphabetical characters,
so corrections for words like doens't
should be fixed with doesn
(the 't
does not match, and so is not replaced;
entering doesn't
would result in doesn't't
in the resulting text).
The original list of typos was based on a general-purpose list from Wikipedia. The typochecker codebase contains another list of typos, based on scanning some large and well-used codebases. But if you have a different work or codebase, it may not be well represented by the data used in generating these lists. You may apply a provided script for applying heuristics to sniff out possible typos in your work.
# Minimal invocation:
python -m typochecker.levenshtein_corrector BASE_DIRECTORY
# Invocation to remove often-unhelpful suggestions
python -m typochecker.levenshtein_corrector --ignore-appends --ignore-prepends BASE_DIRECTORY
This will generate a file, which then needs to be folded into a list of typos known to the program:
make data/extra_endings.txt
The corrector
script may then be run as usual, as described above.
The tool uses information from Wikipedia as a source of useful typos to check for. Note that some typos (e.g., "wich") contain multiple potential corrections.