Experimental feature: Hybrid Search and Vector Store #677

Kerollmops · 2023-06-26T15:25:31Z

Kerollmops
Jun 26, 2023
Maintainer

Feature name	How to enable	Description	Missing for stabilization	Expected stabilization date/version
vector store	`curl \ -X PATCH 'http://localhost:7700/experimental-features/' \ -H 'Content-Type: application/json' \ --data-binary '{ "vectorStore": true }'`	Enables generating, storing and searching by using semantic vectors and performing hybrid search between keyword and semantic search	Confidence in the speed of the indexation, search of the vectors, and API surface	N/A

Meilisearch v1.3 (released July 31) introduces a Vector Store feature:

Stores embeddings (semantic vectors) associated with the documents using a new reserved _vectors field.
Returns the nearest neighbors' documents based on the newly introduced vector query field. The _semanticScore field is added to the resulting documents. It represents a dot product of the distance between the nearest vector and the vector from the search query.

Meilisearch v1.6 (released January 15th) improves on the Vector Store feature:

Configure multiple embedder configurations per index
Automatically generate embeddings at indexing time
Perform hybrid search to merge the results of keyword and semantic search according to their relevancy
Store the vectors in a new backend that scales better

Meilisearch v1.7 (released March 11th) improves on the Vector Store feature:

Supports new embedding models from OpenAI
Supports compilation flag to enable CUDA support for the Hugging Face embedder, refer to the relevant section of the public API page for more information.

Keywords: Semantic Search, Vector Search, Embeddings Search, Hybrid Search.

Experimental feature abstract

Creating one or multiple embedders for an index triggers a new step in the indexing process where embeddings are generated for each indexed document.

Passing "hybrid": {} to a query from the /indexes/{:indexUid}/search or /multi-search performs both a keyword search and a vector search. If no "vector" was provided, it is generated from the "q" field.

How to use the feature?

Please refer to the public API page

What is an experimental feature

By enabling this feature via the /experimental-features route, you opt into the following:

The API and the behavior of the way of sending vectors can break between two minor versions of Meilisearch.
The embedders setting can change in a breaking way between two minor versions of Meilisearch.

You can use this feature in production but be prepared to update your code from one version to the next.

Why is this feature not stable yet?

Storing the vectors is currently very expensive, and retrieving them is too. We hope to make progress on that. We are unsure of the API surface we want to expose, even if the current one seems correct.

🗣️ You are welcome to give feedback about the score details or ask any question on its usage; we are eager to collect feedback on the feature

When will the feature potentially be stable?

[Updated] Due to the large increase in API surface and us missing previous estimates, we cannot provide an estimate for the time being

⚠️ Disabling the feature

To fully disable the feature, you need to delete the embedders setting from any index using the feature by calling DELETE http://localhost:7700/indexes/<index_using_the_feature>/settings/embedders before calling PATCH 'http://localhost:7700/experimental-features/' with vectorStore set to false.

⚠️ Failure to clear the settings of the indexes will result in embeddings still being generated at indexing time, even after disabling the feature through the experimental-features route ⚠️

geminigeek · 2023-07-05T22:07:44Z

geminigeek
Jul 5, 2023

hi,

an excellent feature, can we use filters with vector search ?

1 reply

dureuill Jul 6, 2023
Collaborator

Hello 👋

Yes! Just use the filters as usual in your vector search request

taehun007 · 2023-07-24T08:32:27Z

taehun007
Jul 24, 2023

Hi, if I use OpenAI to create the vector to store in DB but when I search I use Hugingface vector to search is it will work fine or show the right result or not?

3 replies

Kerollmops Jul 24, 2023
Maintainer Author

Hey @taehun007 👋

No, sorry. You must use the same source of vectors. So, either generate the vectors from Hugging Face or OpenAI for both endpoints.

RaflyLesmana3003 Aug 10, 2023

so can we use openAI embedding to create vector and store it in meilisearch?

AJV009 Aug 10, 2023

yes, if you are comfortable to try you can try rust-bert too, helps getting more instantsearch like feel, not sure about the accuracy

noangel2014 · 2023-07-27T07:22:57Z

noangel2014
Jul 27, 2023

When I tested the meilisearch:v1.3.0-rc.3 vector search, I got results, but didn't see _semanticSimilarity. Strange

1 reply

Kerollmops Jul 27, 2023
Maintainer Author

Hey @noangel2014 👋 It's because it has been renamed _semanticScore.

doutatsu · 2023-08-03T19:48:02Z

doutatsu
Aug 3, 2023

I've been experimenting with Vector Search with the release in 1.3, but I am having two issues, which I am not sure how to address

Some of the documents won't have the vector data (e.g. not all movies have descriptions), but I still want to index them for the regular search through titles. But it seems the indexing just ignores those documents and doesn't index them
Majority of data doesn't get indexed. With 27k documents, 20k have vector data, yet only 2k gets indexed ultimately... I even tried on a smaller subset - for example out of 10 documents, only 4 would get indexed. All I get is 202 status in the logs, so I have no idea what is going on.

I know it's experimental, but this seems to be very drastic issue, so maybe I am not doing it right?

Update:
So I tried a couple of things - and indexing works correctly if I only try to index documents that have a vector. If I try to index mixed - aka indexing both documents with and without vectors, it starts to produce these weird results, where almost nothing gets indexed (including those documents that do have a vector) 🤔

9 replies

doutatsu Aug 14, 2023

Here is my Rails configuration for this specific index:

  meilisearch enqueue: :reindex_search do
    attribute :id, :title, :alt_titles, :content_rating, :content_type
    attribute :available_chapters_count, :users_count
    attribute :has_active_sources do
      manga_sources.any? { |s| s.deprecated_at.nil? && !s.pending }
    end
    attribute :manga_source_site_ids do
      manga_sources.pluck(:manga_source_site_id)
    end
    attribute :created_at_unix do
      created_at.to_i
    end
    attribute :classifications do
      series_classifications.as_json.map { |c| c.slice('category', 'name') }
    end
    attribute :_vectors do
      EmbeddingRetriever.calculate_vector(description)&.first || []
    end

Kerollmops Aug 14, 2023
Maintainer Author

Thank you very much @doutatsu for this clear report of the tasks. I now understand the issue: we must accept null as a valid _vectors value in the documents.

In the mean time, could you try to not set at all the _vectors field instead of setting it to null or []?

Kerollmops Aug 14, 2023
Maintainer Author

For your information, I just fixed the issue in this PR, and it should be merged and released in v1.3.2. You'll now be able to send _vectors fields set to the JSON null value when you don't want to set the _vectors vector. Previously, you were forced to not send the _vectors field in your documents.

doutatsu Aug 14, 2023

Was about to say that setting nothing does work. As well as sending only documents with vectors also works.

Look forward to giving a go when the new version comes out, to see if it's all fixed on my side as well. Thanks for a quick fix!

drewbietron Oct 11, 2023

I was also having this issue and found this helpful.

I found something new that I also thought was interesting as well. I'm testing this on 629 results and only one has a vector value. its a 384 dimension vector and its coming from a Postgres database where the row data type is a vector from the pg vector extension. It is returned as a string and not as an array from the database from the client I am using. When I try to index all 629 documents with either that one's vector value, or null, it will index 540 documents. If I parse the string so that it sends up an array when its indexed (via the Meili JS client) it'll index all 629 results, one with a vector and the rest with null values for _vectors.

 _vectors: product.embedding ? JSON.parse(product.embedding) : null,
// all 629 records are indexed

 _vectors: product.embedding ? product.embedding : null,
// 540 of the 629 records are indexed

product.embedding in this case is a string. It will also show as a string in the Meilisearch Dashboard. Saving it as an array will show the array dropdown as expected. I mainly just thought this was strange that is indexed 540 when a string was passed in. I also notice that the document with the vector isnt included in the 540.

AJV009 · 2023-08-06T20:17:26Z

AJV009
Aug 6, 2023

Hybrid search is not yet in the experimental right? I assume we are looking into it before the 1.4 release.
I am excited to make integrations for Drupal. (If hybrid search becomes at least in the usable state I am sure to see many folks on Drupal open source cms making modules over these hybrid search tools.)

I could set up and use the current vector search right away, I also had a session at "Drupal Camp Pune" and did a small introduction to vectors and hybrid search added it behind a small demo for the users to interact with. (We are trying to use it behind a QnA bot)

Let me know how can I help with this hybrid search thing. Even though I am not a rust expert I can give it a try :)
Love you meilisearch team for such an easy-to-use experience.

0 replies

sanjay920 · 2023-08-15T00:06:42Z

sanjay920
Aug 15, 2023

Is it possible to have varying lengths of vectors across documents in an index?
e.g.

documents = [
    { 'id': 1, 'Text': 'I like to eat broccoli and bananas.', '_vectors': embedding_service(["I like to eat broccoli", "I like to eat bananas", "I like to eat broccoli and bananas."]) },
    { 'id': 2, 'Text': 'The Packers won the 2011 NFL Super Bowl with Aaron Rodgers', '_vectors': embedding_service(["The Packers won in 2011", "The Packers won the 2011 NFL Super Bowl with Aaron Rodgers"]) },
]

would fail with this message:

Invalid vector dimensions: expected: `1152`, found: `768`.

What I'm looking for is having multiple vectors belong to a document and upon vector search, compute dot product with all vectors for all documents

3 replies

sanjay920 Aug 15, 2023

To add some more detail:

ID 1 has 3 vectors, each with length 384
ID 2 has 2 vectors, each with length 384

sanjay920 Aug 15, 2023

Just checked the example under "Send Vectorized Documents" in #621 (comment) and it fails as well. Similar approach to what I posted above

irevoire Aug 16, 2023
Collaborator

Hey @sanjay920.

Yes, you should be able to send multiple (any number) vectors per document to an index.
The only condition is that your vectors have the same number of dimensions (or length).

Invalid vector dimensions: expected: 1152, found: 768.

Here as you can see meilisearch reports that your vector didn’t have the same size.
If you can reproduce the issue with a minimal example that we can copy-paste, please open an issue instead of answering here. It'll help us schedule a bug fix!
Thanks for trying out this new feature

nlgtuankiet · 2023-09-02T10:24:28Z

nlgtuankiet
Sep 2, 2023

I have some questions regarding this feature:

Will index parameters (m, ef_construction) configurable at index creation time?
Many sources [1] suggest that those will affect the trade-off between index build time, recall performance, and query speed. It looks like Meilisearch is always using (m=32 and ef_construction=100) but people use Meilisearch for different use cases. For example, some may have a very large number of documents, and some may need to update the document frequently.
Related to question 1, if the index parameters are configurable is there any guideline on how to properly configure those parameters with a given set of requirements?
For example: trade-offs in question 1, input dimension, number of documents.
Will vector search support distinct attribute and filtering?

[1]
https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md
https://www.pinecone.io/learn/series/faiss/hnsw/
https://jkatz05.com/post/postgres/pgvector-hnsw-performance/

I am new to the whole vector search so please correct me if I mention something wrong.
Thank you!

1 reply

macraig Oct 19, 2023
Maintainer

Hi @nlgtuankiet , apologies for the late reply!

Will index parameters (m, ef_construction) configurable at index creation time?
Related to question 1, if the index parameters are configurable is there any guideline on how to properly configure those parameters with a given set of requirements?

It is technically possible to change the parameters at index creation time. However we don't have any guidelines to offer as we use instant-distance for HNSH. It might be worth asking in their repo :)

Will vector search support distinct attribute and filtering?

Yes, it should work with both of them!

abdirahmn1 · 2023-09-30T08:55:42Z

abdirahmn1
Sep 30, 2023

When will the feature potentially be stable?
The target version is Meilisearch v1.4 (released September 25).

Hi everyone, its september 30 today, how is the feature going?, what other potential date can be expect it to be fully stable relatively?

Thanks for considering this feature 👍

6 replies

ajohnclark Nov 2, 2023

I absolutely love this software, I am self-hosting it and it's just amazing how fast it is and how well it 'just works' compared to TypeSense but you can still customize quiet nicely.

To clarify, is v1.6 where vectors will be a stable feature? I used it experimentally with 60-70k records for product deals and it was highly relevant, it was great and didn't have any issues indexing thankfully but I am still edgy because it's alpha before I deploy to live site.

Again, great work on this all around.

drewbietron Nov 2, 2023

I absolutely love this software, I am self-hosting it and it's just amazing how fast it is and how well it 'just works' compared to TypeSense but you can still customize quiet nicely.

To clarify, is v1.6 where vectors will be a stable feature? I used it experimentally with 60-70k records for product deals and it was highly relevant, it was great and didn't have any issues indexing thankfully but I am still edgy because it's alpha before I deploy to live site.

Again, great work on this all around.

+1!!

How did you create the embeddings for your product deals?

ajohnclark Nov 2, 2023

I used OpenAI API for embeddings, was looking at some options on huggingface though too but it was inexpensive enough to do it

drewbietron Nov 2, 2023

I used OpenAI API for embeddings, was looking at some options on huggingface though too but it was inexpensive enough to do it

Do you mind sharing what input you used specifically for the product embedding and also what input you used for account/search side embedding? We've been experimenting as well but haven't found a good combo of model and input to have expected outputs.

ajohnclark Nov 2, 2023

I trashed my code for it/overwrote my notebook I was using for it, I was just looking to share with ya, only have Typesense testing code left. I believe I just used title and possibly description or a category but don't remember as it was a few months ago. Sorry can't help more.

geminigeek · 2023-12-08T11:29:23Z

geminigeek
Dec 8, 2023

hi,

i just setup a collection with 16 Million 3 dimension vectors basically all possible RGB values normalized , as i inserted them, first 300k records got added fine after that, its painfully slow , the tasks still running after 12 hrs

can anyone suggest any limit to what amount of vector we can add ? looking for a solution of 40M records with 1024 dimensions.

3 replies

Kerollmops Dec 9, 2023
Maintainer Author

Hey @geminigeek 👋

We are working hard on making it possible with Meilisearch v1.6, which will be officially released on January 15, and the first release candidates will be available on December 18. This version will feature the new Arroy library, which will support many more vectors with high dimensionality than Meilisearch can support now. You can read more about it in this article and the soon-to-be-released others.

clxyder Dec 29, 2023

Great read! Has @irevoire released his blog on incremental indexing?

irevoire Dec 29, 2023
Collaborator

Hey @clxyder !
Not yet, I’m still working on implementing the feature first and then I’ll talk about it 😄

Avey777 · 2023-12-29T05:48:48Z

Avey777
Dec 29, 2023

Vector Search Is it possible to search for images by image?

3 replies

dureuill Jan 3, 2024
Collaborator

Hello,

to perform a search for images by image, you need the following in the upcoming v1.6.0:

Enable the vectorStore experimental feature

curl \
  -X PATCH 'http://localhost:7700/experimental-features/' \
  -H 'Content-Type: application/json' -H 'Authorization: Bearer foo' \
--data-binary '{ "vectorStore": true }'

Send a list of embedders to the settings route containing your image embedder:

curl \
-X PATCH 'http://localhost:7700/indexes/your_index/settings/embedders' \
-H 'Content-Type: application/json' --data-binary \
'{ "image": { "source": "userProvided", "dimensions": 512 } }'

Provide a _vectors.image field representing the embedding of your image in each of your documents (if the document contains multiple images, provide an array of vectors in this field. If the document contains no image, provide an empty array)

curl \
-X POST 'http://localhost:7700/indexes/your_index/documents' \
-H 'Content-Type: application/json' \
--data-binary '{"id": 0, "title": "Other fields of your documents as normal", "_vectors": {"image": [0.00397, 0.553, ..., 0.0] } }'

Lastly, perform a vector search on your image embedder:

curl \
-X POST 'http://localhost:7700/indexes/your_index/search' \
-H 'Content-Type: application/json' \
--data-binary '{"vector": [0.3967, 0.333, ..., 0.1], "hybrid": {"semanticRatio": 1.0, "embedder": "image"} }'

As Meilisearch does not provide embedding generation for images at the moment, you will have to provide the vectors corresponding to the images in your document and the image in your query yourself, like demonstrated above.

Avey777 Jan 11, 2024

As far as I understand hugging face can only calculate text vectors, openai is not very clear yet.

Does Meilisearch have plans to develop its own vector calculation engine? It is best to support vector calculations of text, images, audio and video

macraig Jan 11, 2024
Maintainer

Hi @Avey777 , we're not planning to develop our own vector embedding engine. Our goal is to provide the best search experience regardless of your chosen model. You can provide vectors generated by your preferred embedder to get image search like @dureuill mentioned above.

macraig · 2024-01-15T19:48:29Z

macraig
Jan 15, 2024
Maintainer

Hey folks 👋

v1.6 has been released! 🦊 Hybrid search and auto-embedding features are now available for your use ✨

Check out our documentation to learn how to use them. We're looking forward to your feedback!

4 replies

abdirahmn1 Jan 16, 2024

Hey folks 👋

v1.6 has been released! 🦊 Hybrid search and auto-embedding features are now available for your use ✨

Check out our documentation to learn how to use them. We're looking forward to your feedback!

is this stable and ready for prod? 😃

ajohnclark Jan 16, 2024

error causing example https://www.meilisearch.com/docs/learn/experimental/vector_search#generate-auto-embeddings-with-openai -- source for OpenAI should be openAi

curquiza Jan 16, 2024
Maintainer

Thank for the feedback, just opened meilisearch/documentation#2687

ajohnclark Jan 16, 2024

Thanks for awesome search tool!

nlgtuankiet · 2024-01-16T21:23:01Z

nlgtuankiet
Jan 16, 2024

hi @macraig, is it possible to hybrid search with user-provided embeddings? I read through the document but seem like they are mutual exclusive?

6 replies

nlgtuankiet Jan 16, 2024

Hi @dureuill thank you very much for the guide 😍 could you please put this in the documentation, I have no idea how to do it 🙌

dureuill Jan 17, 2024
Collaborator

Hello,

it is given as a usage example in the page describing the feature on notion

We can ask @guimachiavelli if he would be OK to add this use case, but I think the idea was to be minimalistic on the documentation for this feature since it is experimental, and the use case is advanced already.

For future reference you can contribute to our documentation by following our contributing documentation 👍

nlgtuankiet Jan 20, 2024

Hi @dureuill I tried the hybrid search

{
  "q": "...",
  "offset": 0,
  "limit": 20,
  "attributesToRetrieve": [...],
  "attributesToHighlight": [
    "description"
  ],
  "highlightPreTag": "<em>",
  "highlightPostTag": "</em>",
  "hybrid": {
    "semanticRatio": 0.5,
    "embedder": "openai-text-embedding-ada-002"
  },
  "vector": [...]
}

response:

{
  "hits": [
    {
      "description": "...",
      "id": "...",
      "groupId": "...",
      "_formatted": {
        "description": "...",
        "id": "...",
        "groupId": "..."
      }
    },
    {
      "description": "...",
      "id": "...",
      "groupId": "...",
      "_formatted": {
        "description": "...",
        "id": "...",
        "groupId": "..."
      },
      "_semanticScore": 0.9361339
    }
  ],
  "query": "...",
  "vector": [...],
  "processingTimeMs": 24,
  "limit": 20,
  "offset": 0,
  "estimatedTotalHits": 2
}

in the response, hits contain 2 results, 1 for normal search (without the _semanticScore field) and 1 for semantic search ( with "_semanticScore": 0.9361339), both have the _formatted field but the keywords in _formatted.description aren't highlighted for the normal search result entry.
Is this the limitation of hybrid search?
I understand that semantic search entry can't be highlighted but what about the normal search entry?

dureuill Jan 23, 2024
Collaborator

Hello @nlgtuankiet

I reproduce the issue, I opened a new issue about it, I think I have a fix.

ghost Mar 6, 2024

Hi,
Thank you for your help !
I agree, you should put this use case in the documentation, I thought it was not possible until I read your answer.

doutatsu · 2024-01-17T04:36:50Z

doutatsu
Jan 17, 2024

I've wanted to add some more recent and well-performing embedders, like E5 models or even mpnet model, which is the 3rd most popular Sentence Similarity model on Hugging Face.

But as I encountered errors and asked in support, I was told only BERT models for autoembedding with HuggingFace are supported at the moment, which makes it impractical for me to use built-in Meilisearch capabilities for this and I have to use a separate service, where I can use better models.

Here's a good link discussing differences and why many mainstream BERT embedders are not as good as new alternatives: https://blog.metarank.ai/from-zero-to-semantic-search-embedding-model-592e16d94b61

So it would be great if support for more models could be added.

2 replies

irevoire Jan 24, 2024
Collaborator

Hey, even if you can't use the hybrid search directly, can't you use the vector store embedded in meilisearch to get better performances?
If your model returns a vector, it should work:thinking:

doutatsu Feb 23, 2024

@irevoire I meant that it doesn't let me index anything, due to the model I am trying to use. Unless I am misunderstanding what you meant

carlosbaraza · 2024-01-24T18:21:00Z

carlosbaraza
Jan 24, 2024

Anyone knows how to monitor the state of the auto-embedding? I configured the embedder using the BAAI/bge-base-en-v1.5 model:

embedders: {
    default: {
      source: 'huggingFace',
      model: 'BAAI/bge-base-en-v1.5',
      documentTemplate:
        "Title: {{doc.title}}. {% if doc.subheading %}Subheading: {{doc.subheading}}. {% endif %}Content: {{doc.content}}."
}

I have 100k documents in the index. And it is running for a while already in my local machine, so I would like to debug how long the full auto-embedding would take.

Is there an endpoint that could give me information of the current embedding?

3 replies

dureuill Jan 25, 2024
Collaborator

Hello 👋,

We don't have this at this point. You could restart the indexing after putting the --log-level of your machine to trace (assuming self-hosted), then you'd have an idea of progress, but the output is going to be super noisy (and also, the indexing is going to be slower 😓).

Looking at your documentTemplate, it looks to me like there is a possibility that the rendered template might contain too many words for long document. I advise truncating the contents because embedders really don't like long texts. The quality of the embedding also decreases with the length of it because the quantity of information in an embedding is constant regardless of the length of the input text. So maybe:

embedders: {
    default: {
      source: 'huggingFace',
      model: 'BAAI/bge-base-en-v1.5',
      documentTemplate:
        "Title: {{doc.title}}. {% if doc.subheading %}Subheading: {{doc.subheading}}. {% endif %}Content: {{doc.content|truncatewords: 50}}."
}

As a point of reference, to embed 30k documents with a rendered prompt of about 50 words per document I needed about 30 mins on a mac M1. The rest of the indexing process was negligible (about 3s)

carlosbaraza Jan 25, 2024

Good to have a reference like that. That amounts to around 0.0003 s/character in M1. My M2 should take maybe 0.00025 s/character. That should have taken around 45 minutes for my dataset, and it was running for 12h, so there is definitely something off.

Did you run meilisearch on Docker or natively @dureuill?

dureuill Jan 29, 2024
Collaborator

Hello. I'm running natively. I'm not certain the time is going to be linear.

Gosti · 2024-01-30T14:20:45Z

Gosti
Jan 30, 2024

Since 25/01 OpenAI added two new embedding model:

text-embedding-3-small
text-embedding-3-large

They also added an optional field that can limit the number of dimensions for both new models.

As of right now only openai-text-embedding-ada-002 is supported, without optional parameter.

userProvided embedding works fine, but as OpenAI is already integrated, wouldn't be better to extend the already existing feature, with latest models and the new parameter ?

see: OpenAI blog post

1 reply

macraig Feb 29, 2024
Maintainer

Hi @Gosti , the new OpenAI models and the optional dimension parameter will be available in our upcoming v1.7 release, which will be out on March 11th.

doutatsu · 2024-05-15T07:48:51Z

doutatsu
May 15, 2024

Are there any updates on the stabilisation of this feature? Really want to start using it in production, but waiting till it's not an experimental feature

2 replies

macraig May 15, 2024
Maintainer

Hi @doutatsu , we're still iterating on the feature and improving some performance aspects of it before we consider it stable. To give you some context, for v1.9 (which will be out July 1st) we're planning to move all vector storage to the vector store and avoid regenerating embeddings when you import a dump. We don't have a fixed stabilization date, but it won't be sooner than v1.10 (out on August 26th). In the meantime, please let us know if you have any further feedback on the feature.

doutatsu May 15, 2024

No worries, just wanted to understand the timeline a bit better. This gives me some clarity, thanks

abdullah-alnahas · 2024-05-15T13:10:39Z

abdullah-alnahas
May 15, 2024

Why is this feature not stable yet?
Storing the vectors is currently #very expensive, and retrieving them is too. We hope to make progress on that. We are unsure of the API surface we want to expose, even if the current one seems correct.

Regarding the storage part, an excellent solution would be binary quantization of vectors. It maintains 95% to 99% of the retrieval performance with significantly less storage because the vectors are binary instead of floating-point numbers.

Regarding the speed, I think the necessary algorithms and data structures are already available. There are many established and emerging players in the vector search field that claim impressive speeds. For example, a new library claims "0.1 milliseconds query latency on million-scale vector datasets." Therefore, achieving this should be feasible.

All that being said, considering the current vector search landscape, natively implementing Colbert ranking is indispensable for a top-choice, go-to database solution.

2 replies

Kerollmops May 15, 2024
Maintainer Author

Hey @abdullah-alnahas 👋

Thank you for taking the time to look at this feature. We already plan to work on binary quantization on our internal vector store. It would be awesome to reduce the storage size by around 32x 🚀

Regarding the speed, we are pretty happy with it, but we still want to measure and compare it with other solutions. For now, Meilisearch/arroy is very pleasant to work with, and the quality and speed of this library are very good!

I will probably need more information about Colbert's ranking. Do you have any links or something?

abdullah-alnahas May 15, 2024

Hey @Kerollmops! 👋

I'm thrilled about the prospect of getting BQ upstreamed to MS!

Here are some resources that you might find useful about ColBERT and its late interaction algorithm

Original implementation repo: ColBERT GitHub
This blog post provides an excellent summary of the work: What is ColBERT and Late Interaction and Why They Matter in Search

In my opinion, the key innovation of ColBERT is not just the embedding model they train/finetune, but rather the late interaction algorithm. The algorithm calculates the relevance scores of documents to a given query by first embedding each token of the query and each token of each document. It then computes the dot product between the query embeddings and document embeddings, applies max-pooling to the dot product matrix to extract the most relevant information, and sums up the resulting scores to obtain a final relevance score for each document. Finally, the algorithm sorts the documents based on their relevance scores and returns the top-k documents.

doutatsu · 2024-05-26T15:33:27Z

doutatsu
May 26, 2024

Is there a known memory leak issue on version 1.8.1 with added vector store? As can be seen, prior to adding vector info to the index (26k documents with 768 vector space) I've had stable RAM consumption just below 1GB. But after adding the vector store, even when not in use, there seems to be a memory leak

29 replies

doutatsu Jun 12, 2024

Gotcha - I'll be waiting for the bug fix then, thanks for the update.

dureuill Jun 12, 2024
Collaborator

you're welcome, thanks for your investigation. Sorry about the bug, we'll keep you posted 👍

dureuill Jun 19, 2024
Collaborator

Hello, we found a fix, we'll be releasing a v1.8.3 soon. More details in the PR: meilisearch/meilisearch#4707

doutatsu Jun 19, 2024

Fantastic news, thank you very much @dureuill!

doutatsu Jun 21, 2024

Just wanted to come back and confirm your fix worked like a charm. Thanks again!

macraig · 2024-06-25T19:37:56Z

macraig
Jun 25, 2024
Maintainer

Adding a feature request from rustyx for non-BERT models described in meilisearch/meilisearch#4718

0 replies

alimoezzi · 2024-06-26T13:08:52Z

alimoezzi
Jun 26, 2024

Hi everyone 👋

v1.8 has been released! 🪼

We've added a new REST embedder source and a distribution shift setting to fine-tune your hybrid ranking scores. There are also some breaking changes in the search response to take into account, so make sure to check the release notes before updating.

As usual, let us know what you think!

null value is a valid value for _vectors. Can rest embedder also respond in the correct JSON path with null to set the field to null value?

9 replies

dureuill Jun 26, 2024
Collaborator

text sent to an image embedding very much looks like an error to me?

alimoezzi Jun 26, 2024

With the rest embedder all the fields are passed to rest embedder. One can retrieve image from the id field.
But in a collection, there might be documents that isn't associated with an image or in general needs to be opt out.

alimoezzi Jun 26, 2024

You are forgetting that rest is an autoembedder.

text sent to an image embedding very much looks like an error

The use case is to opt out of autoembedder programmatically with existing null feature.

dureuill Jun 26, 2024
Collaborator

OK I think I have a clearer view of the use case, thank you

dureuill Jun 26, 2024
Collaborator

Do note that, absent this feature and as a workaround, you can manually set the key corresponding to that embedder to null in the _vectors field of documents that should not have a vector for that embedder.

This also works for autoembedder

macraig · 2024-07-01T20:02:32Z

macraig
Jul 1, 2024
Maintainer

Hey folks 👋

🦎 v1.9 has been released and it includes multiple updates and some ⚠️ breaking changes ⚠️ to the hybrid search feature. Please check the release notes for all the update details.

Looking forward to your feedback!

0 replies

ifsheldon · 2024-07-03T13:03:29Z

ifsheldon
Jul 3, 2024

Would love to see configuration on API base for OpenAI embedder get supported.

Currently it's hardcoded
https://github.com/meilisearch/meilisearch/blob/809e742253511dae3adab970592f5a76f8c7d182/milli/src/vector/openai.rs#L145

but in enterprise env, we may need to access the service with a different URL or a proxy, which is done by changing the URL base. The official openai python library support this.

4 replies

macraig Jul 3, 2024
Maintainer

@ifsheldon have you considered using the REST embedder option?

ifsheldon Jul 4, 2024

Yes, I've read the code, but it seems I need to duplicate a lot of configurations and code to get a URL replaced.

macraig Jul 4, 2024
Maintainer

Got it, thanks for the feedback! We will consider adding the option to switch the URL in the OpenAI embedder and keep you updated

dureuill Jul 16, 2024
Collaborator

Hello @ifsheldon 👋

URL override for OpenAI embedder is implemented as part of meilisearch/meilisearch#4801

macraig · 2024-07-17T14:21:00Z

macraig
Jul 17, 2024
Maintainer

Hi everyone,

We are preparing to stabilize the feature and would appreciate your feedback to improve it. We noticed that most users opt for the userProvided embedder option. If you are using this option, could you share why you chose it over the other available options?

Your input is invaluable in helping us refine and enhance the feature.

Thank you!

0 replies

oliver-kriska · 2024-07-25T14:09:28Z

oliver-kriska
Jul 25, 2024

Hi,
we are considering of using Meilisearch for our marketplace. Currently we use Postgesql for our Hybrid search. We use OpenAI embeddings. There is simple inner product distance for "semantic part". But for fulltext search we use similarity function from postgresql with boosting for entries which contains specific value. Both search return ranking merged with very simple Reciprocal Rank Fusion logic. Is it possible to use this kind of logic also with you Hybrid search?
That's our solution for problem, your solution can be different, so problem is:
We have products in Norwegian an Swedish language. Customer search only in one language in one request. These languages contain words which are merged by multiple words, etc... so regular fulltext search doesn't work correctly how we would prefer. That's reason why we use similarity function. But we want to boost results which are exact match. Not sure if this describe our problem enough.
Our current solution is good, but one query takes about 60-200ms and that's a lot for the most usage query/entry point query.

6 replies

oliver-kriska Jul 29, 2024

Hi, thanks for reply.
We have product name(avg 20 chars) and description(avg 600 chars). We store embeddings per product like you do. But we store also search terms embeddings as some kind of cache, because getting embedding vector everytime when customer hit search is time expensive. I just checked we have about 10k products in search at the moment. So do you have any idea how long it will take to search one query term with 10k entries? Because currently I have only one "selling point" to use Meilisearch is that it will be much faster. But I'm only guessing. If you will call the OpenAI API for embedding every time when we call the search function even with the same query term it can be very slow.

dureuill Jul 29, 2024
Collaborator

description(avg 600 chars)

It sounds just a bit long for comfort. I try to count in words, and I find 30 words to be a good length, but YMMV. Using the documentTemplate, you can specify with liquid filters that you want to truncate the field to some length, e.g. {{doc.name}} is {{doc.description|truncatewords:30}}.

We don't provide a cache for embedded queries in Meilisearch currently. We might consider this in a future version.

oliver-kriska Jul 29, 2024

thanks, good point with that description. We will look on it as well. Currently we take whole description without technical chars (tiptap js editor) so maybe real avg is much smaller but got the point.

dureuill Jul 29, 2024
Collaborator

Note that you don't need to change the description inside of the document to truncate it on your end, you can do it while specifying the documentTemplate parameter in the embedder settings in Meilisearch (as I did above in my example)

oliver-kriska Jul 29, 2024

yes I know, but also it's about getting proper context. Because when you just truncate first 30 words you can lost some important points. So basically for us it would be much better take (make first) 30 tags per product and use it for creating proper vectors with name.

andreieuganox · 2024-08-16T08:34:58Z

andreieuganox
Aug 16, 2024

Is there any way to check the indexing status? Is there any way to check for which documents the vectors are generated?
The problem in hand is.

We Eenabled the experimented feature;
Added embedder to the index.

And nothing happens.
Checking OpenAI - only a few requests for embedding were executed, but nothing after.
Checking search works - no cosines.
We indexed and re-indexed all documents 10 times - nothing.
Update documents - nothing.

So, the real question in hand is - how to debug what is going on? Please advise.

2 replies

curquiza Aug 19, 2024
Maintainer

Hello @andreieuganox
You can follow the advancement of tasks with the /tasks route: https://www.meilisearch.com/docs/reference/api/tasks#get-tasks

dureuill Aug 26, 2024
Collaborator

Hello @andreieuganox

As @curquiza said, the first step is to check that the embedder addition task completed successfully via the /tasks route. Once this has been done, then you can check that your documents have vector by fetching them (POST /indexes/{:indexUid}/documents/fetch) with the retrieveVectors parameter set to true in the fetch request.

macraig · 2024-08-29T18:58:33Z

macraig
Aug 29, 2024
Maintainer

Hello again everyone 👋

🦩 v1.10 has been released and it includes multiple updates and some ⚠️ breaking changes ⚠️ to the hybrid search feature. Please check the release notes before upgrading.

Looking forward to your feedback!

1 reply

CommanderStorm Aug 29, 2024

The Rest embedder looks muuuch cleaner. Honestly, was scratching my head about that API-design a lot during meilisearch/meilisearch-rust#554

cgtobi · 2024-09-10T08:05:29Z

cgtobi
Sep 10, 2024

Hi, I am experimenting with local embedders. In the FAQ for Multilingual-E5-large they state that it is required to prefix input texts with "query: and "passage: ". But to my understanding this would then interfere with the normal keyword search. Can anyone point me to how this would/should be used with meilisearch?

Thanks in advance
Tobi

2 replies

dureuill Sep 10, 2024
Collaborator

Hello,

Meilisearch does not currently support different embedding variants at search and indexing time.
Now, in this case it can be implemented in the following way:

Put passage: at the start of your documentTemplate, so that passage: becomes part of the embedding text for documents at indexing time.
In you frontend, prepend query: in front of the user requests before sending them to Meilisearch.

That being said, looking at the model you linked, it does not appear to be a BERT model, so it won't work with a huggingFace embedder (we only support BERT models at the time). You'll have to use a rest or a userProvided embedder.

cgtobi Sep 10, 2024

Thank you very much for the insights @dureuill

tlindener · 2024-09-10T15:44:18Z

tlindener
Sep 10, 2024

Do you guys have any experience/recommendations for multilanguage models? We're trying to get meilisearch running on english/german use cases and the search results have been quite odd.

One such example (english) without semantic search:

{
  "q": "turntable",
  "hybrid": {
    "semanticRatio": 0.0,
    "embedder": "default"
  },
  "showRankingScore": true,
  "rankingScoreThreshold": 0.94,
  "showRankingScoreDetails": true
}

First hit: "title": "TEAC - Turntable - Natural Wood"

With semantic search

{
  "q": "turntable",
  "hybrid": {
    "semanticRatio": 0.7,
    "embedder": "default"
  },
  "showRankingScore": true,
  "rankingScoreThreshold": 0.94,
  "showRankingScoreDetails": true
}

"title": "Timex - Ladies' Health Tracker Watch - Blue"

1 reply

macraig Sep 11, 2024
Maintainer

Hi @tlindener,

We don’t formally recommend any specific models, but I can share an anecdote that might help. While working on playground.meilisearch.com and our blog post on choosing the best model for semantic search, we noticed a significant difference when using multi-modal models. One that stood out for us is the cohere-multilingual model (Cohere docs), which performed better than voyage-multilingual in our experience. It could be worth exploring for your English/German use case.

underthesand · 2024-09-12T08:37:31Z

underthesand
Sep 12, 2024

Hello, how can I update the REST embedding server URL without recomputing all the embeddings please ?

1 reply

ManyTheFish Sep 16, 2024
Collaborator

Hello @underthesand,
for now it's not possible to do it, Meilisearch will automatically recompute every documents. 😞
Meilisearch can't be sure that the generated vectors coming form the new API are the same.

note for the @meilisearch/product-team, we could possibly add a new setting in the embedder API forcing Meilisearch to keep the old vectors, something like unsafe-keep-current-vectors

macraig · 2024-10-29T15:29:32Z

macraig
Oct 29, 2024
Maintainer

Hello once again 👋

🐿️ v1.11 has been released and it includes multiple updates and some ⚠️ breaking changes ⚠️ to the hybrid search feature. Please check the release notes before upgrading.

Looking forward to your feedback!

0 replies

Experimental feature: Hybrid Search and Vector Store #677

Kerollmops Jun 26, 2023 Maintainer

Experimental feature abstract

How to use the feature?

What is an experimental feature

Why is this feature not stable yet?

When will the feature potentially be stable?

⚠️ Disabling the feature

Replies: 40 comments · 125 replies

dureuill Jul 6, 2023 Collaborator

Kerollmops Jul 24, 2023 Maintainer Author

Kerollmops Jul 27, 2023 Maintainer Author

Kerollmops Aug 14, 2023 Maintainer Author

Kerollmops Aug 14, 2023 Maintainer Author

irevoire Aug 16, 2023 Collaborator

macraig Oct 19, 2023 Maintainer

Kerollmops Dec 9, 2023 Maintainer Author

irevoire Dec 29, 2023 Collaborator

dureuill Jan 3, 2024 Collaborator

macraig Jan 11, 2024 Maintainer

macraig Jan 15, 2024 Maintainer

curquiza Jan 16, 2024 Maintainer

dureuill Jan 17, 2024 Collaborator

dureuill Jan 23, 2024 Collaborator

irevoire Jan 24, 2024 Collaborator

dureuill Jan 25, 2024 Collaborator

dureuill Jan 29, 2024 Collaborator

macraig Feb 29, 2024 Maintainer

macraig May 15, 2024 Maintainer

Kerollmops
Jun 26, 2023
Maintainer

Replies: 40 comments 125 replies

dureuill Jul 6, 2023
Collaborator

Kerollmops Jul 24, 2023
Maintainer Author

Kerollmops Jul 27, 2023
Maintainer Author

Kerollmops Aug 14, 2023
Maintainer Author

Kerollmops Aug 14, 2023
Maintainer Author

irevoire Aug 16, 2023
Collaborator

macraig Oct 19, 2023
Maintainer

Kerollmops Dec 9, 2023
Maintainer Author

irevoire Dec 29, 2023
Collaborator

dureuill Jan 3, 2024
Collaborator

macraig Jan 11, 2024
Maintainer

macraig
Jan 15, 2024
Maintainer

curquiza Jan 16, 2024
Maintainer

dureuill Jan 17, 2024
Collaborator

dureuill Jan 23, 2024
Collaborator

irevoire Jan 24, 2024
Collaborator

dureuill Jan 25, 2024
Collaborator

dureuill Jan 29, 2024
Collaborator

macraig Feb 29, 2024
Maintainer

macraig May 15, 2024
Maintainer