The “dictionary” is collected from the texts of multiple novels, like the Harry Potter Series, Hunger Games, and Chronicles of Narnia, that you can find in the fileDict
directory(I’m a big book nerd!).
NOTE: This is still the beta version(There might be some hiccups, you obviously need a dictionary of thousands of documents to make this more accurate)!
Storage of Corpus and Execution Time are the main pain points.
So, I used a cache functionality (i.e., @cache
, which is taken from from functools import cache
) for functions that are called multiple times.
I want to attempt using Redis and SQLAlchemy to cache data
Caching is a powerful technique to improve the performance of applications by storing frequently accessed data in a fast, in-memory storage system. Redis, a popular in-memory data store, provides excellent support for caching in Python applications.
Download the Github zip file
Input this command in the terminal inside the spellChecker
directory:
python spellCheck.py [txt file that you want spellchecked]
For a word that is incorrectly spelled in the txt file, it will print something like this into the terminal:
Instead of [incorrectly spelled word], did you mean?
First choice for correct spelling
Second choice
Third choice
…
In this model, the goal is to find the intended word given a word where the letters have been scrambled in some manner.
The correct spelling of a word will be the one with the highest probability in the Noisy Channel Model
I went through every document and word in the corpus and created an inverted index.
I removed the documents' unnecessary whitespace and the leading/trailing funky characters like "!@#$%^&".
This was the structure of the index:
{
word: {file #: [indexes of positions where word is found]}
}
Example:
{
“this”: {4: [3, 5, 70], 6: [4, 45]},
“tomorrow”: {41: [3, 54, 70], 62: [42, 45]}
}
I encoded this inverted index using delta encoding to reduce space.
A dictionary that holds the times each document was last edited. If a document was last edited very recently, there might be a higher chance that this document is more popular and more reliable for correct spelling than a document that was last edited months ago.
Thus, if a word comes from many recently edited documents, we want to place a higher weight on it than a word that does not.
The lastEdit values are used when determining popularity.
The structure of lastEdit:
{file #: current time - last edited time, file #: current time - last edited time}
This dictionary is sorted in ascending values.
I’m creating a dictionary where words with the same phonetic code are grouped together.
The phonetic code of a word is calculated using metaphone and this python library.
For example, the dictionary might look like this:
{“PJTR“: [“bajador“, “begetter”, “budgetary”]}
Structure: {3: [“cat”, “bat”], 5: ["hello", "pinky"]}
The popularity of a word from the inverted dictionary
Words that start with the same letter as incorrectly spelled(with a length diff of three max)
I read every word from the txt file the user inputted and stripped any leading/trailing funky characters like “!#@$%^&”.
I created an array called iDict that will hold all the correct spellings of the incorrectly spelled word in a ranked order.
This is what I’m returning back to the user.
Structure of iDict:
[[word_1: Levenshtein Distance], [word_2: Levenshtein Distance]]
I find the phonetic code of the word, and if that code doesn’t exist in the phonetic code dictionary I made, the Python library I’m using provides a secondary code. If the secondary code doesn’t exist, then I skip this step.
Find the Levenshtein Distance between the incorrectly spelled word and all words with the same phonetic code.
Sort the distances in iDict ascending order.
Find the Levenshtein Distance between the incorrectly spelled word and all words with the same starting letter
Sort the distances in iDict ascending order.
Find the Levenshtein Distance between the incorrectly spelled word and all words with the same length
Sort the distances in iDict ascending order.
Filter the iDict array such that any two elements next to each other are less than .01 in Levenshtein distance
- compare the popularity of both
- compare the Levenshtein distance of both (with a more detailed algorithm)
Let's say we have an iDict like this:
[[A: 1.0], [B, 1.0], [C, 2.0]]
Since A and B are less than .01 away from each other, one must go. If the Levenshtein distance of B to the incorrectly spelled word is less than the distance of A to the incorrectly spelled word, then the resulting iDict will look like this:
[[B, 1.0], [C, 2.0]]
But let's say that A is an insanely popular word(i.e., popularity(A) * 0.1 > popularity(B)
), so there is a chance that the user meant to type in A even though it might be farther away from it than B.
Thus, in that case, the resulting iDict will look like this:
[[A, 1.0], [C, 2.0]]
You made it this far? Congrats.
Whatever is left in the iDict is then displayed to the user. This is repeated for every word in the txt file.
Levenshtein Algorithm More Detailed
Let's say we have an incorrect word, "wdsh" and we have two options for the potentially correct spelling, "wash" and wish".
There is a higher chance that the user meant to type in "wash" because "a" is closer to "d" than it is to "i".
So, let's think about the distance between characters on a keyboard when figuring out the cost of operations in this algorithm.
In the traditional Levenshtein Alg., the cost of inserting/replacing/deletion is all the same.
But if we had an incorrect word, "lon", and two options for correct spelling, "lion" and "lan".
The cost of adding an "i" is less than the cost of replacing "o" with an "a"("i" is closer to "o" on the keyboard, and "a" is on the other side of the keyboard).
As seen here, I'm calculating the distance between two keys(only focusing on the alphabet keys) on the keyboard.
To do this, I'm turning the alphabet keys into an x and y coordinate graph.
For example, the distance between "f"(0.9, 1) and "z"(0, 2) is: ((2-1)^2+(0-0.9)^2)^.5