Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up using trigrams #4

Open
msuchane opened this issue Aug 10, 2022 · 2 comments
Open

Speed up using trigrams #4

msuchane opened this issue Aug 10, 2022 · 2 comments

Comments

@msuchane
Copy link
Owner

Before comparing the file content using the Levenshtein or Jaro distance, first compare the two files using word-level trigrams to get the general sense of their similarity. Then, use the distance metric only on files that are relatively similar by trigrams.

Resources:

@msuchane
Copy link
Owner Author

msuchane commented Aug 10, 2022

With version 0.5.0, the tool now pre-selects using character-level trigrams. As a result, the search is about 10 times faster.

Word-level trigrams could produce more accurate results and might even be faster, but no library can currently calculate them.

I'm leaving this open to consider word-level trigrams in the future.

@msuchane
Copy link
Owner Author

msuchane commented Aug 30, 2022

The slice::windows method would be quite useful when implementing word trigrams.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant