Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for asymmetric embedding models #710

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

br3no
Copy link

@br3no br3no commented Apr 25, 2024

Description

This PR adds support for asymmetric embedding models such as https://huggingface.co/intfloat/multilingual-e5-small to the neural-search plugin.

It builds on the work done in opensearch-project/ml-commons#1799.

Asymmetric embedding models behave differently when embedding passages and queries. For that end, the model must "know" on inference time, what kind of data it is embedding.

The changes are:

1. src/main/java/org/opensearch/neuralsearch/processor/TextEmbeddingProcessor.java

The processor signals it is embedding passages, by passing the new AsymmetricTextEmbeddingParameters using the content type EmbeddingContentType.PASSAGE.

2. src/main/java/org/opensearch/neuralsearch/query/NeuralQueryBuilder.java

Analogously, the query builder uses EmbeddingContentType.QUERY.

3. src/main/java/org/opensearch/neuralsearch/ml/MLCommonsClientAccessor.java

Here is where most of the work was done. The class has been extended in a backwards-compatible way with inference methods that allow one to pass MLAlgoParams objects. Usage of AsymmetricTextEmbeddingParameters (which implements MLAlgoParams) is mandatory for asymmetric models. At the same time symmetric models do not accept them.

The only way to know whether a model is asymmetric or symmetric is by reading its model configuration (if the models' configuration contains a passage_prefix and/or a query_prefix, they are asymmetric, otherwise they are symmetric).

The src/main/java/org/opensearch/neuralsearch/ml/MLCommonsClientAccessor.java class deals with this, keeping the complexity in one place and not requiring any API change to the neural-search plugin (as proposed in #620). When calling the inference methods, clients (such as the TextEmbeddingProcessor) may pass the AsymmetricTextEmbeddingParameters object without caring if the model they are using is symmetric or asymmetric. The accessor class will first read the model's configuration (by calling the getModel API of the mlClient) and deal appropriately.

To avoid adding this extra roundtrip to every inference call, the asymmetry information is kept in a cache in memory.

Issues Resolved

#620

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@navneet1v
Copy link
Collaborator

@br3no can you add an entry in the changelog.

@navneet1v
Copy link
Collaborator

navneet1v commented Apr 26, 2024

@br3no Thanks for raising the PR. I am wondering do we require this change? In MLCommons repository a generic MLInference processor is getting launched which is supposed to do the inference of any kind of model both during ingestion and search. RFC: opensearch-project/ml-commons#2173

That capability is getting build as of now. Do you think we still need this feature then?

@br3no
Copy link
Author

br3no commented Apr 26, 2024

@navneet1v I have been loosely following the discussions in the mentioned RFC. It's a large change that I don't expect to be stable soon – the PR is very much in flux. Also, I don't see the use-case of asymmetric embedding models being addressed.

This PR here is much smaller in comparison and is not in any way in conflict with the RFC work. If once the work on the ML Inference Processors is finished and the use-case is addressed there as well, we can deprecate and eventually remove the functionality again.

Until then, this PR offers users the chance to use more modern local embeddings. I'm eager to put this to spin, tbh.

@navneet1v
Copy link
Collaborator

Also, I don't see the use-case of asymmetric embedding models being addressed.

If that is the case I would recommend posting the same on the RFC to ensure that your use case is handled.

On the other hand, I do agree this is an interesting feature. I would like to get some eyes on this change mainly in terms of should this be added or not given a more generic processor is around the corner. As I am of my opinion is concerned the main reason of generic processor was to avoid creating new/updating processors to support new model types which is happening in this PR.

Thoughts? @jmazanec15 , @martin-gaievski , @vamshin , @vibrantvarun .

Let me add some PMs too for Opensearch-project to know their thoughts. @dylan-tong-aws

Copy link

codecov bot commented Apr 26, 2024

Codecov Report

Attention: Patch coverage is 87.12871% with 13 lines in your changes missing coverage. Please review.

Project coverage is 84.41%. Comparing base (7c54c86) to head (44f14ec).
Report is 12 commits behind head on main.

Current head 44f14ec differs from pull request most recent head 6d3dba6

Please upload reports for the commit 6d3dba6 to get more accurate results.

Files Patch % Lines
...earch/neuralsearch/ml/MLCommonsClientAccessor.java 85.22% 9 Missing and 4 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main     #710      +/-   ##
============================================
- Coverage     85.02%   84.41%   -0.61%     
+ Complexity      790      785       -5     
============================================
  Files            60       59       -1     
  Lines          2430     2464      +34     
  Branches        410      409       -1     
============================================
+ Hits           2066     2080      +14     
- Misses          202      215      +13     
- Partials        162      169       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@br3no
Copy link
Author

br3no commented Apr 26, 2024

@navneet1v I have added a comment earlier today to the RFC (cf. opensearch-project/ml-commons#2173 (comment)).

Sure, let's open the discussion and get some PMs into it.

I really don't mind leaving this out if the support is introduced in another PR in 2.14. I'm concerned opensearch-project/ml-commons#2173 is a much larger effort, that won't be ready that quickly...

It's not about my contribution – I need the feature. 🙃

@navneet1v
Copy link
Collaborator

I really don't mind leaving this out if the support is introduced in another PR in 2.14. I'm concerned opensearch-project/ml-commons#2173 is a much larger effort, that won't be ready that quickly...

I can see the feature is marked for 2.14 release of Opensearch. Let me add maintainers from ML team too. @mingshl , @ylwu-amzn

Signed-off-by: br3no <[email protected]>
@br3no
Copy link
Author

br3no commented Apr 29, 2024

@mingshl @ylwu-amzn, I'd really like to have this feature in 2.14.

Do you think this use-case will be fully supported with opensearch-project/ml-commons#2173? Cf. opensearch-project/ml-commons#2173 (comment)

If not, I'd be happy to help this PR get merged as an interim solution! Let me know what you think!

@mingshl
Copy link

mingshl commented Apr 29, 2024

@br3no ml inference processor is targeting at first supporting remote model only. How did you usually connect this model? is it local or remote?

if remote, can you please provide a SageMaker deployment code piece then I can quickly test it in 2.14 test cluster. Thanks

@br3no
Copy link
Author

br3no commented May 13, 2024

@mingshl sorry for taking so long to answer!

The use-case for now is to use a local, asymmetric model such as https://huggingface.co/intfloat/multilingual-e5-small.

This PR here is the last puzzle piece to allow one to use these kinds of model and should in principle also work with remote models. It makes sure that the neural-search plugin uses the correct inference parameters when embedding passages and queries with asymmetric models. Regardless of whether the model is local or remote, if you are using asymmetric models, you will need to provide this information anyway.

The thing is that asymmetric models need to know at inference time what exactly they are embedding. OpenSearch currently treats embedding models as symmetric, meaning that regardless of whether the text being embedded is a query or a passage, the embedding will be always the same. Asymmetric models require content "hints" to the text being embedded; the model exemplified above uses the string prefixes passage: and query: . These models perform better than similarly sized symmetric models.

In opensearch-project/ml-commons#1799 we have added the concept of asymmetric models into ml-commons, introducing the AsymmetricTextEmbeddingParameters class, used at inference time to signal if the text being embedded is a query or a passage. So this PR is only using this new infrastructure.

I would really be happy to get this merged as an interim solution until the ml inference processor fully supports this use-case.

@reuschling
Copy link

I also vote for this PR in need for this functionality.

@navneet1v
Copy link
Collaborator

@br3no will it possible if you can contribute back in MLInference processor for local model support? Is that even an option?

@br3no
Copy link
Author

br3no commented May 15, 2024

@navneet1v you mean making sure this works there as well? Sure, I can commit to that. I'd propose then to merge this PR now and then start the work to eventually replace this once the MLInference processor supports this use case...

@navneet1v
Copy link
Collaborator

I'd propose then to merge this PR now and then start the work to eventually replace this once the MLInference processor supports this use case...

The problem is once this is released it cannot be deprecated till a major version release. Hence I am bit hesitant to have this feature in neural search plugin.

@br3no
Copy link
Author

br3no commented May 16, 2024

This PR has a very small surface. It doesn't change any APIs. So I believe this is not something to worry about, actually. Once the MLInferece processor supports asymmetric models the neural-search plugin can be changed to start using it instead of what I built here. This would be not a breaking change, only an internal implementation detail.

@ylwu-amzn
Copy link
Collaborator

This PR has a very small surface. It doesn't change any APIs. So I believe this is not something to worry about, actually. Once the MLInferece processor supports asymmetric models the neural-search plugin can be changed to start using it instead of what I built here. This would be not a breaking change, only an internal implementation detail.

Thanks @br3no , Can you show some examples of neural search query with asymmetric embedding model? ML inference processor can support asymmetric embedding model. We are working on unifying the interface so we can also support local model in ML inference processor. I think @navneet1v concern is valid. We should consider deprecation effort. Let's check how user experience looks like with this PR.

@br3no
Copy link
Author

br3no commented May 17, 2024

@ylwu-amzn as I said in the comment above, there is no API change in this PR. You would use the neural-search plugin in the same way it is used today.

What is implemented here?
The main work is done in MLCommonsClientAccessor. Whenever neural-search is embedding text (either in the TextEmbeddingProcessor or in the NeuralQueryBuilder) the plugin will ensure that if the embedding model is asymmetric, the parameters will be passed on correctly. More details can be found in the description and code.

@ylwu-amzn
Copy link
Collaborator

@br3no , sorry that I didn't have enough time to read the details. Took a quick look, seems the code will specify the "QUERY" type when run query, and set "PASSAGE" type when ingest. That seems no API change, cx can use same ingest pipeline and neural search, switching to asymmetric model is seamless. I think this is a good design, so it can support BWC and seems can be deprecated/migrated together with current text embedding APIs. I'm good for merge this.

@br3no
Copy link
Author

br3no commented May 18, 2024

Exactly! Great!

@br3no
Copy link
Author

br3no commented May 22, 2024

@ylwu-amzn could you drive the review process forward then?

@ylwu-amzn
Copy link
Collaborator

ylwu-amzn commented Jun 10, 2024

@ylwu-amzn could you drive the review process forward then?

Sorry , missed your comment. Asking neural-search maintainers if they have other concerns

Update: Pinged neural-search plugin owner SDM @vamshin , he will ask team help review.

Mockito.verifyNoMoreInteractions(singleSentenceResultListener);
}

public void testInferenceSentences_whenGetModelException_thenFailure() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to test scenario when we're retrying 1-2 times. I see scenario when first request has failed with error that isn't retryable, this isn't a full coverage

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@martin-gaievski thanks for pointing this out.

I'm wondering though, if we should make this retryable at all. Let me elaborate:

In my understanding, inference requests are retried because they tend to fail more often than regular operations in OpenSearch. I don't know the history and complete reasoning behind this, so I speculate it has to do with the fact that the inference is done natively and that many things can go wrong there.

With my change, if fetching the model information fails (mlClient.getModel(modelId, ...), there is no retry. Model information is fetched the first time inference is requested with a particular model. After that, the result is cached and the method behaves exactly as before the PR.

So my argument is: should we really add a retry logic to this relatively simple operation? If getModel fails, it is most likely to fail again, so retrying wouldn't make sense. If so, one could argue that all operations in OpenSearch should be wrapped in a retry logic.

* @param mlAlgoParams {@link MLAlgoParams} which will be used to run the inference
* @param listener {@link ActionListener} which will be called when prediction is completed or errored out
*/
public void inferenceSentence(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to refactor this method to accept single POJO object that can be built with builder pattern, similar to ml-commons class with algorithm params. Reason - we cannot add new method every time we need to add new parameter to inference ml client.
I'm fine if we do it in a separate PR, please create GH issue and post link here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Cf. #790

@@ -40,6 +48,7 @@
public class MLCommonsClientAccessor {
private static final List<String> TARGET_RESPONSE_FILTERS = List.of("sentence_embedding");
private final MachineLearningNodeClient mlClient;
private final Map<String, Boolean> modelAsymmetryCache = new ConcurrentHashMap<>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add comment around this cache behavior and usage. You can start from following:

  • it's local for the data node
  • how we invalidate and evict if we ever going to do this
  • if cache miss, what is the latency to do retrieve value via API call and put it to cache
  • how big is a single object, that also implies some eviction strategy as we cannot grow indefinitely
  • what is the behavior is case of node restart/drop/model got redeployed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Yes, it is local for the data node.
  • We are never evicting entries.
  • The latency is that of fetching the model configuration and parsing it.
  • The cache will map modelId to Boolean; the size requirement is 20 chars + 1 Boolean per entry.
  • If the node restarts the cache will be empty; the first inference request will lead to the cache entry being populated
  • if the node drops the cluster, the inference request will fail. The cache will not be emptied.
  • if the model gets redeployed the user will need to request inference with a new modelId; which will lead to a new cache entry. The old one will continue there.

The only scenario I can think of where this design could be problematic is if a malicious actor floods the cluster with billions of inference requests with inexistent models. This would lead to an increase in heap usage that would never be GC'd.

Let me know if you think this should be changed.

@martin-gaievski martin-gaievski added the Enhancements Increases software capabilities beyond original client specifications label Aug 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancements Increases software capabilities beyond original client specifications
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants