Skip to content
This repository has been archived by the owner on Nov 2, 2021. It is now read-only.

Core: Match fields that are not the name. #5

Open
jeremyjbowers opened this issue Jun 19, 2017 · 1 comment
Open

Core: Match fields that are not the name. #5

jeremyjbowers opened this issue Jun 19, 2017 · 1 comment

Comments

@jeremyjbowers
Copy link
Contributor

Perhaps a match_field that is matched against instead? Could include name but also other things, e.g., address. Might also consider fielding out things like address or at least city / state or other common attributes to make matching cleaner.

@SHewitt95
Copy link
Contributor

The route taken here seems to depend on whether we use FuzzyWuzzy or Dedupe for the entity service. Fuzzy compares individual strings to compute a score, while Dedupe can compare n-number fields in objects, compute scores and cluster similar entities based on training data. The difference between these two libraries, then, alludes to a bigger issue: how will the sent-in data be formatted?

The Fuzzy route could include combining all the fields of an object into a single large field, and then comparing that to a similar blob in the database. This route could also consider matching field for field and figuring some sort of score that way (maybe an average of all the field scores). New entries in the database are only created when incoming JSON data is different enough score-wise from existing database entries.

The Dedupe route would have us venture into machine learning, where the algorithm will act based on training data that we provide. This would involve the database storing all queries, but clustering those that the algorithm deem as similar. As more new entries come in, more training will have to be done (do these two entries refer to the same thing, yes or no?).

Ultimately, the entity service needs to compare an incoming entity and an existing database entity and determine if they are the same. If so, the service returns a UUID for the existing entity. If not, a new entity is added to the database, and the new UUID is returned. For example, if I send in {name="Shannon Smith," address=""}, and later {name="Shannon Smith," address="123 Broadway"}, the service should see that the two are the same.

How the difference in information is handled depends on the library we choose to use. Dedupe would add the incoming entity to the database anyway and group it with the existing entity. How FuzzyWuzzy would handle the difference in information for multi-field objects hasn't been discussed yet. For single-field objects however, it's just a matter of comparing the single fields. If the incoming entity is similar to an existing database entity, no new entity is created in the case of FuzzyWuzzy.

Regarding the formatting of data, two options have been discussed, but a solid choice has not been made yet: {name="Shannon Smith," address="123 Broadway"} and {name="Shannon Smith", match_fields=[{"address": "123 Broadway"}]. The first option lends itself to Dedupe, for the library asks for which fields to examine and what type of it is. The second option with match_fields is more so geared toward having a single field identify an entity (in this example, name) while the data in match_fields works to see if the entity as a whole is unique or a duplicate of an existing database entity.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants