Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial implementation of manual ngram-based search in MongoDB #993

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

ml-evs
Copy link
Member

@ml-evs ml-evs commented Nov 6, 2024

Closes #679 -- MongoDB text indexes tokenize using whitespace and punctuation only. This PR investigates whether we can build a manual ngram index, so when searching for refcode ABCDEF, you get results if you only ask for ABC, BCD etc.

This is done by making a separate collection called item_fts that is used only for this kind of search, by storing immutable_id, type and ngrams for all items, with an index over ngrams. Lookup is then done by ngrammifying the query string and doing array lookup and ordering by the number of matches.

Will have to fiddle around to see:

  • what the optimal value of N is, and whether we need to do all N+1-grams up to a fixed range
  • how expensive this is for realistic deployment sizes
  • whether it might be better to try an edit-distance based approach for some fields

@ml-evs ml-evs added the server label Nov 6, 2024
@ml-evs ml-evs changed the title [WIP] Initial noodling with manual ngram index in MongoDB [WIP] Initial noodling with manual ngram-based search in MongoDB Nov 6, 2024
Copy link

cypress bot commented Nov 6, 2024

datalab    Run #2758

Run Properties:  status check passed Passed #2758  •  git commit b49b764aa7 ℹ️: Merge d4969ae68b137119b765c9651c1e892c747cb8a9 into 700dc5e9fd490ddbe40c0520e42b...
Project datalab
Branch Review ml-evs/mongo-fts-ngram
Run status status check passed Passed #2758
Run duration 06m 27s
Commit git commit b49b764aa7 ℹ️: Merge d4969ae68b137119b765c9651c1e892c747cb8a9 into 700dc5e9fd490ddbe40c0520e42b...
Committer Matthew Evans
View all properties for this run ↗︎

Test results
Tests that failed  Failures 0
Tests that were flaky  Flaky 0
Tests that did not run due to a developer annotating a test with .skip  Pending 0
Tests that did not run due to a failure in a mocha hook  Skipped 0
Tests that passed  Passing 405
View all changes introduced in this branch ↗︎

Copy link

codecov bot commented Nov 6, 2024

Codecov Report

Attention: Patch coverage is 93.42105% with 5 lines in your changes missing coverage. Please review.

Project coverage is 68.92%. Comparing base (700dc5e) to head (d4969ae).

Files with missing lines Patch % Lines
pydatalab/src/pydatalab/mongo.py 91.66% 4 Missing ⚠️
pydatalab/src/pydatalab/routes/v0_1/items.py 96.29% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #993      +/-   ##
==========================================
+ Coverage   68.49%   68.92%   +0.43%     
==========================================
  Files          62       62              
  Lines        3955     4026      +71     
==========================================
+ Hits         2709     2775      +66     
- Misses       1246     1251       +5     
Files with missing lines Coverage Δ
pydatalab/src/pydatalab/main.py 64.82% <100.00%> (+0.24%) ⬆️
pydatalab/src/pydatalab/routes/v0_1/items.py 83.61% <96.29%> (+0.96%) ⬆️
pydatalab/src/pydatalab/mongo.py 84.34% <91.66%> (+5.24%) ⬆️

@ml-evs ml-evs changed the title [WIP] Initial noodling with manual ngram-based search in MongoDB Initial implementation of manual ngram-based search in MongoDB Nov 10, 2024
@ml-evs
Copy link
Member Author

ml-evs commented Nov 10, 2024

I think this is ready for review now, though we should not make it the default (yet) until we can test scaling and quality of search results.

@ml-evs ml-evs added the enhancement New feature or request label Nov 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Free text search could be improved
1 participant