Similar

A canonicalizer attempts to convert low quality database entries with known canonical target names into those target names.

When data entry personnel enter data they will contract, abbreviate, acronym, misspell, and make mistakes such as transposing, skipping, or adding letters, or having their fingers over the wrong keys.

A canonicalizer works with a limited set of target names and computes the least statistical distance between the entered data and a specific canonical target name.

This canonicalizer uses multiple algorithms to compute multiple statistical distances and uses a least product of these distances to choose a best fit to a target name.

Before this code was deployed the conversion rate for target names was about 17%. After this code was deployed the conversion rate rose to above 80%. The remainder of the failures was submitted for hand-canonicalization off-shore.

It uses (or has hooks for) the following algorithms:

acronym: "Massachusetts Institute of Technology" intended for "MIT".
Levenshtein: "Massachusetts" intended by "Masachusets", "Masssachusetts", or "Massahcusetts".
fat finger: a finger strikes a key adjacent to an intended key by accident.
contraction: "Massachusetts Institute of Technology" for "Mass Inst Tech", or "International" for "Int'l."
soundex: two words are pronounced the same, 1st algorithm.
metaphone: two words are pronounced the same, 2nd algorithm.
NYSSIS: two words are pronounced the same, 3rd algorithm.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Makefile		Makefile
README.md		README.md
Similar.py		Similar.py
boards.csv		boards.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Similar

About

Releases

Packages

Languages

jlettvin/Similar

Folders and files

Latest commit

History

Repository files navigation

Similar

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages