-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question: Why choose Symspell over Tantivy's built-in fuzzy search? #90
Comments
So we actually support both, it's just generally I recommend trying fast-fuzzy first, it brings something new to the table and has some nice quality of life features. The good bitsThe reason why I generally push for Fast-Fuzzy/SymSpell is that it gives you considerably more performance (and more stable performance) by drastically cutting down the number of terms that actually need to be considered in the dataset, then it becomes a standard FTS which is very fast. On the other side, it also lets us do some things that we couldn't reasonably do without it, like word segmentation and compound-aware corrections. For example: We can correct What you looseProbably the biggest issue right now with the system is that the relevancy behaves differently from how it would with traditional fuzzy searching. I.e as of right now we don't have prefix searching however, there is an experimental branch that has it which utilises the lookup table to create possible prefixes. Another thing we lose is generally, Is the ability currently to do score based on the distance (Right now your score comes from the BM25 score of the text search with the correct words), we could, but we would have to go with a term-by-term system which would, for the most part, lose the ability to do our compound-aware corrections. This for me is something that I don't want to lose as it's a very useful addition, we could modify the algorithm slightly to give us a rough guide on what the distances are per term, but for now, that's just what it's like. Some other points
This is true we generate a large lookup table essentially, but this does traditionally scale logarithmically based on your dataset (We don't use the algorithm exactly how it was intended, we build the lookup dictionary based on the data in your index rather than a set of static dictionaries). For example, a dataset of 20k documents uses maybe 50MB ish of memory, probably less. A dataset of 5 million documents with about 2.4 million unique words/tokens uses about 1.8-2GB of memory. A bigger version of the dataset with 27 million documents has about 2.6 million words and uses about 2GB of memory. Providing your data isn't massively random, the size of your computed dictionary will level off as the dataset gets bigger. Normal Fuzzy searchNote: The vertical axis range is not the same as fast-fuzzy. Fast FuzzyNote: The vertical axis range is not the same as fuzzy. Side by side |
Thanks, this is a terrific answer. Why is it that you gain the ability to do word segmentation search with symspell? I'm thinking that the method is to try all of the split points to see what works: hellothere -> [h, ellothere], [he, llothere], etc. Then you look up each fragment in the dictionary. That's a lot of lookups. How does fast-fuzzy enable that? Under the hood, you're storing the dictionary (and the symspell-added terms) in the default Tantivy fst structure, right? |
That's probably best explained by Golf Garbe's Blog Posts but yes it can be potentially hundreds of lookups, but a hundred lookups is far less than the thousands or millions of lookups you'd need to do on your inverted index for each word with your DFA.
We actually don't, the lookup table is kept purely in memory with just an excessive amount of de-duplication i.e there is only ever 1 unique delete or word and then 4 Byte pointers to those words. This reduces memory usage by a reasonable amount. We build the dictionaries after a commit based on the segment reader's term dictionaries to be exact. So the symspell structure isn't a persisted structure in itself. it just isn't worth it. |
Just a question, not a bug. Why did you choose to implement Symspell instead of just using Tantivy's built-in Levenshtein automata for doing fuzzy queries?
I'm just getting into this, but it strikes me that Symspell is going to generate much larger dictionaries, where the Levenshtein code should just traverse the existing FST. Did you run into any performance issues with the existing fuzzy search? Or functionality issues?
The text was updated successfully, but these errors were encountered: