Rank documents with more matches higher than those with fewer #777

LukasKalbertodt · 2024-09-06T08:04:27Z

LukasKalbertodt
Sep 6, 2024

I am building a service that searches through videos. Captions/subtitles of the video are also in the index and searchable. That makes it possible for users to find videos where some word is said at some point, even if it's not mentioned in the metadata (like the description or title). Captions are usually quite a bit more text than title, description and other metadata.

I noticed that videos that contain the query term many many times in the captions are sometimes ranked below videos that only mention it once. Intuitively, a video where the query term is spoken all the time is a lot more relevant than one where it is only mentioned once.

I read the ranking rules documentation again and conducted some tests and yes, it seems to me that Meili does not consider the number of matches at all during ranking. Consider these two documents:

{
  "id": 1,
  "title": "Foo",
  "captions": "Quick mention of banana, but otherwise talking about Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."
}
{
  "id": 2,
  "title": "Bar",
  "captions": "I do need a bit of time to get to my point, and I am very sorry about that, please hold on a little longer, but I like banana. Banana is best. Everything banana. Lorem banana ipsum banana dolor banana sit banana amet, banana consetetur banana sadipscing banana elitr, banana sed banana diam banana nonumy banana eirmod banana tempor banana invidunt banana ut banana labore banana et banana dolore banana magna banana aliquyam banana erat, banana sed banana diam banana voluptua. banana At banana vero banana eos banana et banana accusam banana et banana justo banana duo banana dolores banana et banana ea banana rebum. banana Stet banana clita banana kasd banana gubergren, banana no banana sea banana takimata banana sanctus banana est banana Lorem banana ipsum banana dolor banana sit banana amet."
}

Searching for banana results in:

Score of "Foo" is: 0.626
Score of "Bar" is: 0.55

"Foo" has a higher score since the query term appears earlier in the attribute. But for "Bar", almost every second word is "banana", so I would expect that to be sorted higher.

curl commands to reproduce the above test

curl \
  -X POST 'http://localhost:7700/indexes' \
  -H 'Content-Type: application/json' \
  --data-binary '{
    "uid": "text",
    "primaryKey": "id"
  }'


curl \
  -X POST 'http://localhost:7700/indexes/test/documents' \
  -H 'Content-Type: application/json' \
  --data-binary '[
    {
      "id": 1,
      "title": "Foo",
      "captions": "Quick mention of banana, but otherwise talking about Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet. Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua. At vero eos et accusam et justo duo dolores et ea rebum. Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."
    }
  ]'


curl \
  -X POST 'http://localhost:7700/indexes/test/documents' \
  -H 'Content-Type: application/json' \
  --data-binary '[
    {
      "id": 2,
      "title": "Bar",
      "captions": "I do need a bit of time to get to my point, and I am very sorry about that, please hold on a little longer, but I like banana. Banana is best. Everything banana. Lorem banana ipsum banana dolor banana sit banana amet, banana consetetur banana sadipscing banana elitr, banana sed banana diam banana nonumy banana eirmod banana tempor banana invidunt banana ut banana labore banana et banana dolore banana magna banana aliquyam banana erat, banana sed banana diam banana voluptua. banana At banana vero banana eos banana et banana accusam banana et banana justo banana duo banana dolores banana et banana ea banana rebum. banana Stet banana clita banana kasd banana gubergren, banana no banana sea banana takimata banana sanctus banana est banana Lorem banana ipsum banana dolor banana sit banana amet."
    }
  ]'


curl 'localhost:7700/indexes/test/search?q=banana&attributesToRetrieve=title&showRankingScore=true'

So I wonder: why is this? I'm sure this was considered at some point?

Unfortunately, I don't see a way to configure Meili to consider the number of matches. None of the built-in ranking rules seem to care about it (so reordering them changes nothing), and the custom ranking rules are just "sort by attribute" as far as I can see.

What's the best thing I can do in this situation?

ManyTheFish · 2024-09-09T08:27:38Z

ManyTheFish
Sep 9, 2024
Collaborator

Hello @LukasKalbertodt,
I understand what you want to achieve but I am not convinced by the relevancy of such a feature. Indeed, this feature of counting matches is well known and was used a long time ago by search engines, but it has big drawbacks:

It will favor long generalistic documents over short specialized ones
it encourages the content creators to add a lot of word repetitions, and to create irrelevant footers containing repeated buzz-words

If I get your intent, you want to favor documents that globally speak about a specific subject related to the query over the documents containing a reference or a quote but don't focus on the subject.
To do this, you may prefer adding a sementic layer to Meilisearch using hybrid search, this will contextualize your documents and your queries in addition to match a sequence in the documents and provide a better ranking order of your documents.

2 replies

LukasKalbertodt Sep 9, 2024
Author

Thanks for your answer!

I see, so "counting" is probably a bad idea for the reasons you mentioned. But comparing "density" of matches should circumvent both the problems you mentioned, right? I.e. comparing how many percent of the attribute's text consist of a query match. Or am I missing something else here?

I will take a look at semantic search, but I feel that already the standard search/ranking system should somehow be able to address this use case.

Thanks for your work on Meili!

ManyTheFish Sep 9, 2024
Collaborator

But comparing "density" of matches should circumvent both the problems you mentioned, right? I.e. comparing how many percent of the attribute's text consist of a query match. Or am I missing something else here?

Indeed, it would be better, however, this implies counting every matching word in each documents to compare with the total number of word in them. This data would be difficult to compute during indexing without taking a lot of disk space, and computing it dynamically at search time would not scale well with the number of documents. And in a way, we are kind of reinventing vector search with bad performances.
To be honest, Keyword search is really good to find quotes, names, products, etc... But, if you want to take a step back and look at the context of a document or a query, vector search is one of the best approach to achieve it.

I will take a look at semantic search, but I feel that already the standard search/ranking system should somehow be able to address this use case.

Note that you may not need an ultra-complex model to add a bit of context around your searches. I suggest starting by the less time consuming models to make your tests and see if you need more or not.

Thanks for your work on Meili!

thank you for the kind word, and sorry if I didn't provide the solution you were searching for ☺️

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meilisearch

Rank documents with more matches higher than those with fewer #777

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Meilisearch

Rank documents with more matches higher than those with fewer #777

LukasKalbertodt Sep 6, 2024

Replies: 1 comment · 2 replies

ManyTheFish Sep 9, 2024 Collaborator

LukasKalbertodt Sep 9, 2024 Author

ManyTheFish Sep 9, 2024 Collaborator

LukasKalbertodt
Sep 6, 2024

Replies: 1 comment 2 replies

ManyTheFish
Sep 9, 2024
Collaborator

LukasKalbertodt Sep 9, 2024
Author

ManyTheFish Sep 9, 2024
Collaborator