[FEATURE INTAKE] Improving Search relevancy through Reranker interfaces #542

martin-gaievski · 2024-01-17T20:07:09Z

This document captures the activities that needs to be performed in order to prepare the Re-ranking Feature #485 for release.

Release Activities

Below are the release activities that needs to be done to ensure that Re-ranking feature can be merged in 2.12 release of OpenSearch.
Code Freeze Date: Feb 6, 2024
Release calendar: https://opensearch.org/releases.html

PR Merge

Once the PR is approved, it can be merged to the feature branch. The PR will move to main and 2.x branch once the security review and Benchmarking is done. We can wait for documentation to be completed.

Status: Completed

Application Security Approval

Status: In progress

Benchmarking

To ensure that this feature is fully tested and we are aware of latency and search relevancy impact team needs to run the benchmarks.

Status: Not started

Benchmarking Details

Benchmarking tool: https://github.com/martin-gaievski/info-retrieval-test/tree/score-normalization-combination-testing/beir/retrieval

Cluster Configuration

Config Key	Value
Data nodes	3
Data nodes Type	r5.8xlarge
Master Node	1
Master Node Type	c4.2xlarge
ML nodes	Use Datanodes as ML Nodes
ML Node Type	NA
ML Model Link	https://huggingface.co/sentence-transformers/msmarco-distilbert-base-tas-b
Heap Size	32gb

Number of Shards	12
Number of replicas	1
Number of Segments	No force merge is required.
Refresh Interval	default

Bulk Size	200
Bulk Client	1

Search Clients	1

K	100
KNN Algorithrm	hnsw
size	100
KNN engine	nmslib
space type	inner product
dimensions	768

Data sets

Data set Name	Link To download data set	Model Zip file Name	Model File Link
NFCorpus	https://github.com/martin-gaievski/info-retrieval-test/blob/score-normalization-combination-testing/README.md#beers-available-datasets	nfcorpus_traced.zip	https://huggingface.co/navneet1v/finetunedmodels/tree/main
Trec-Covid	https://github.com/martin-gaievski/info-retrieval-test/blob/score-normalization-combination-testing/README.md#beers-available-datasets	trec_covid_tuned.zip	https://huggingface.co/navneet1v/finetunedmodels/tree/main
Scidocs	https://github.com/martin-gaievski/info-retrieval-test/blob/score-normalization-combination-testing/README.md#beers-available-datasets	scidocs_tuned.zip	https://huggingface.co/navneet1v/finetunedmodels/tree/main
Quora	https://github.com/martin-gaievski/info-retrieval-test/blob/score-normalization-combination-testing/README.md#beers-available-datasets	quora_tuned.zip	https://huggingface.co/navneet1v/finetunedmodels/tree/main
Amazon ESCI	https://github.com/amazon-science/esci-data?tab=readme-ov-file#usage	amazon_traced.zip	https://huggingface.co/navneet1v/finetunedmodels/tree/main
DBPedia	https://github.com/martin-gaievski/info-retrieval-test/blob/score-normalization-combination-testing/README.md#beers-available-datasets	dbpedia_tuned.zip	https://huggingface.co/navneet1v/finetunedmodels/tree/main
FiQA	https://github.com/martin-gaievski/info-retrieval-test/blob/score-normalization-combination-testing/README.md#beers-available-datasets	fiqa_tuned.zip	https://huggingface.co/navneet1v/finetunedmodels/tree/main

Re-ranking Model

Use the models which are Opensource so that the results can be reproduced by other users.

Model Names	Link To download model

Once you have the results please paste the result table on the RFC. The benchmarking results will be reviewed by the maintainers of the Neural Search plugin to understand the tradeoffs. As a general guidance you should be able justify the latency trade-offs(if any) with improved relevancy metrics.

Documentation

As this is a new feature the feature owner needs to start writing the documentation for this feature. This task has not been started yet. We can add the new section under this table of Search Relevancy. https://opensearch.org/docs/latest/search-plugins/search-relevance/index/.

Status: Not started

Expectation from Documentation:

A working example should be provided to outline how to use the feature.
Examples needs to be provided on how upload a re-ranking local model.
Example ML commons blue print needs to be added and linked in this documentation on how to use remote re-ranking model like Cohere.
Example and details needs to be provided on how the query and processor can be configured. All different permutation and combination needs to be added.
Add a section of limitation if present.
If not planning to publish a blog on this, add the benchmarking details with the documentation.

Feature Demo

As this is new feature we need to a feature demo for this. Aryn team can provide a demo video.

Status: Not started

HenryL27 · 2024-01-18T16:37:25Z

@martin-gaievski is there any chance you have a cloud formation script or otherwise that sets up the benchmarking cluster or am I on my own?

martin-gaievski · 2024-01-18T16:57:47Z

if you are ok with hosting cluster in AWS then you can use this tool - https://github.com/opensearch-project/opensearch-cluster-cdk. as your code isn't part of the official build you'll need to replace neural-search artifacts in data nodes before running benchmarks. You can try to build a complete deployable tarball using this tool https://github.com/opensearch-project/opensearch-build/, but setup maybe be a bit complex, so I suggest for one time test like this to use cluster-cdk. You can take this latest 2.12 distribution build as a basis:

https://ci.opensearch.org/ci/dbc/distribution-build-opensearch/2.12.0/8999/linux/x64/tar/dist/opensearch/opensearch-2.12.0-linux-x64.tar.gz
https://ci.opensearch.org/ci/dbc/distribution-build-opensearch/2.12.0/8999/linux/arm64/tar/dist/opensearch/opensearch-2.12.0-linux-arm64.tar.gz

HenryL27 · 2024-02-05T05:34:40Z

Performance benchmarks:

First a couple notes:

Cluster configuration: 3 r5.8xlarge, 1 c4.2xlarge
I did not load dbpedia as I ran out of time (or overloaded the cluster with too many requests. both happened)
I did not test trec-covid as I ingested it with a different tool that didn't play well with the benchmarking script (mappings were misaligned)
I didn't even try Amazon ESCI
The benchmarking script appears to use the "took" field of search responses to measure time. This makes sense. Unfortunately, it appears that this field is not updated by search pipelines, so it's basically useless for the purposes of this experiment. Instead, I measured total time from request to response directly in the benchmarking script. So all measurements include the network call timing (ssh from my laptop to AWS (IAD) cluster). I wouldn't be surprised if that contributes in large part to some of the baseline measurements.
I only ran 300 queries from each dataset. This was essential for something like quora, which has 10,000 queries, and since reranking is rather resource intensive, that adds up quickly (I needed to finish this month)
I did not run these on a GPU instance. I was going for fidelity to the cluster specification listed above, but given my other issues I'm not sure that was worth it. But that would probably speed up the model inferences significantly.
All embeddings for neural search were created with "sentence-transformers/all-MiniLM-L6-v2", and all reranking experiments were done over neural searches
I used size=50 for all tests. Reranking latency scales linearly with size (except maybe if you used a remote API like cohere. But these are locally hosted model tests)
Ok, table

dataset	model	p50 (ms)	p90 (ms)	p99 (ms)	ndcg@10
fiqa	bm25	156.0	182.0	268.03	0.2175
fiqa	neural, no reranking	506.0	974.4	1,335.25	0.3859
fiqa	MiniLM-L-6-v2	2,177.5	2,372.2	2558.25	0.3620
fiqa	bge-rerank-base	12,542.5	13,436.7	14,142.18
fiqa	bge-rerank-base (quantized)	2,780.0	3,776.7	4,604.32	0.3217
nfcorpus	bm25	155.0	179.1	229.23	0.3018
nfcorpus	neural, no reranking	446.0	948.2	1,326.14	0.3140
nfcorpus	MiniLM-L-6-v2	5,500.0	6,203.6	6,692.94	0.3352
nfcorpus	bge-rerank-base	13,018.0	13,742.8	14,221.22
nfcorpus	bge-rerank-base (quantized)	4,438.5	5,391.5	6,115.67	0.2987
quora	bm25	157.0	182.0	262.13	0.7230
quora	neural, no reranking	506.0	952.7	1,307.02	0.8920
quora	MiniLM-L-6-v2	993.0	1,207.2	1,541.17	0.8475
quora	bge-rerank-base	5,497.5	6,361.3	7,446.49	0.6711
quora	bge-rerank-base (quantized)	644.5	1,024.3	1,334.68	0.7074
scidocs	bm25	156.0	182.0	249.17	0.1461
scidocs	neural, no reranking	509.0	961.4	1,302.03	0.2180
scidocs	MiniLM-L-6-v2	2,150.0	2,348.7	2,705.5	0.1696
scidocs	bge-rerank-base	12,998.5	13,967.9	15,089.01
scidocs	bge-rerank-base (quantized)	3,171.5	3,955.2	4,911.07	0.1477

If these seem like rather lackluster results, that's because they kinda are... I think I may have prepared bge wrong?

martin-gaievski · 2024-02-05T16:57:37Z

@HenryL27 thank you for sharing these results. For the review we also need results for search relevancy, mainly nDCG@10, format can be a simplified version of one that shared in the blog for Hybrid query search

navneet1v · 2024-03-25T19:17:27Z

@HenryL27 I am closing this issue. As the feature is released.

amitgalitz · 2024-03-27T00:05:05Z

hey @HenryL27 I am currently trying to replicate this results for reranking, I wanted to ask what you meant by this statement:

All embeddings for neural search were created with "sentence-transformers/all-MiniLM-L6-v2", and all reranking experiments were done over neural searches

For neural search benchmarking I am following what is in the example given by Martin to test search latency with neural query. However for reranking based on the docs I see for the feature, the query isn't a neural query. What do you mean by "all reranking experiments were done over neural searches"?

HenryL27 · 2024-03-27T17:00:41Z

reranking is an extension of a query - you run a query and then you rerank the results from that query.

All embeddings for neural search were created with "sentence-transformers/all-MiniLM-L6-v2", and all reranking experiments were done over neural searches

means that the base query I used (and subsequently extended with reranking) was a neural query with minilm.

amitgalitz · 2024-03-27T17:08:44Z

reranking is an extension of a query - you run a query and then you rerank the results from that query.

All embeddings for neural search were created with "sentence-transformers/all-MiniLM-L6-v2", and all reranking experiments were done over neural searches

means that the base query I used (and subsequently extended with reranking) was a neural query with minilm.

I see so the index you queried against was an knn index (or at least had embedding stored in one of the fields)? In the doc website I saw:

POST /_search?search_pipeline=rerank_pipeline
{
  "query": {
    "match": {
      "text_representation": {
        "query": "Where is Albuquerque?"
      }
    }
  },
  "ext": {
    "rerank": {
      "query_context": {
        "query_text_path": "query.match.text_representation.query"
      }
    }
  }
}

is it more relevant to benchmarking of a query that looks like this:

{
    "query": {
        "neural": {
          "passage_embedding": {
            "query_text": "Hi world",
            "k": 100
          }
        }
      },
  "ext": {
    "rerank": {
      "query_context": {
        "query_text_path": "query.neural.passage_embedding.query_text"
      }
    }
  }
}

My goal is to get a good base line for p50, p90, p99 query latency for reranker neural clause for comparison.

HenryL27 · 2024-03-27T17:18:42Z

yeah, the second is essentially what I did in benchmarking. In general the reranking latency is gonna be substantially higher than the neural or bm25, just by nature of the computations that are going on.

martin-gaievski added the Features Introduces a new unit of functionality that satisfies a requirement label Jan 17, 2024

github-actions bot added the untriaged label Jan 17, 2024

martin-gaievski mentioned this issue Jan 17, 2024

[RFC] Improving Search relevancy through Generic Reranker interfaces #485

Closed

martin-gaievski mentioned this issue Feb 6, 2024

Added Reranker feature #591

Merged

5 tasks

HenryL27 mentioned this issue Feb 6, 2024

[DOC] Documentation for new reranking feature opensearch-project/documentation-website#6359

Closed

4 tasks

vamshin removed the untriaged label Mar 21, 2024

navneet1v closed this as completed Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE INTAKE] Improving Search relevancy through Reranker interfaces #542

[FEATURE INTAKE] Improving Search relevancy through Reranker interfaces #542

martin-gaievski commented Jan 17, 2024 •

edited by navneet1v

Loading

HenryL27 commented Jan 18, 2024

martin-gaievski commented Jan 18, 2024

HenryL27 commented Feb 5, 2024 •

edited

Loading

martin-gaievski commented Feb 5, 2024

navneet1v commented Mar 25, 2024

amitgalitz commented Mar 27, 2024

HenryL27 commented Mar 27, 2024

amitgalitz commented Mar 27, 2024

HenryL27 commented Mar 27, 2024

[FEATURE INTAKE] Improving Search relevancy through Reranker interfaces #542

[FEATURE INTAKE] Improving Search relevancy through Reranker interfaces #542

Comments

martin-gaievski commented Jan 17, 2024 • edited by navneet1v Loading

Release Activities

PR Merge

Application Security Approval

Benchmarking

Benchmarking Details

Cluster Configuration

Data sets

Re-ranking Model

Documentation

Expectation from Documentation:

Feature Demo

HenryL27 commented Jan 18, 2024

martin-gaievski commented Jan 18, 2024

HenryL27 commented Feb 5, 2024 • edited Loading

martin-gaievski commented Feb 5, 2024

navneet1v commented Mar 25, 2024

amitgalitz commented Mar 27, 2024

HenryL27 commented Mar 27, 2024

amitgalitz commented Mar 27, 2024

HenryL27 commented Mar 27, 2024

martin-gaievski commented Jan 17, 2024 •

edited by navneet1v

Loading

HenryL27 commented Feb 5, 2024 •

edited

Loading