Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Hybrid search error with field of type nested on the index #466

Closed
tiagoshin opened this issue Oct 18, 2023 · 18 comments
Closed

[BUG] Hybrid search error with field of type nested on the index #466

tiagoshin opened this issue Oct 18, 2023 · 18 comments
Assignees
Labels
bug Something isn't working v2.12.0 Issues targeting release v2.12.0

Comments

@tiagoshin
Copy link

What is the bug?

I identified a bug in the Hybrid search on release 2.10. The same happens on release 2.11.

When we have any field on the index mapping properties with type nested, it doesn't apply normalization and weighted combination. Instead, it just sums up the values, the same way that Opensearch did before having Hybrid search feature.
To identify it, I created an unused field in the index mapping properties with type nested and verified the scores in the hybrid search. To compare, I did the same by adding this field with type text and verified the results.
The same behavior happens whether we use the field of type nested or not.

How can one reproduce the bug?

Before running these steps, create a model and use its model_id.

PUT {{host}}/_ingest/pipeline/pipeline-test
{
"description": "An NLP ingest pipeline",
"processors": [
{
"text_embedding": {
"model_id": "{{model_id}}",
"field_map": {
"name": "passage_embedding"
}
}
}
]
}

PUT {{host}}/index-test
{
"settings": {
"index.knn": true,
"default_pipeline": "pipeline-test"
},
"mappings": {
"properties": {
"id": {
"type": "text"
},
"passage_embedding": {
"type": "knn_vector",
"dimension": 384,
"method": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "lucene",
"parameters": {
"ef_construction": 512,
"m": 8
}
}
},
"name": {
"type": "text"
},
"passage_text": {
"type": "text"
},
"test": {
"type": "nested"
}
}
}
}

PUT {{host}}/index-test/_doc/1
{
"name": "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .",
"id": "4319130149.jpg"
}
PUT {{host}}/index-test/_doc/2
{
"name": "A wild animal races across an uncut field with a minimal amount of trees .",
"id": "1775029934.jpg"
}
PUT {{host}}/index-test/_doc/3
{
"name": "People line the stands which advertise Freemont 's orthopedics , a cowboy rides a light brown bucking bronco .",
"id": "2664027527.jpg"
}
PUT {{host}}/index-test/_doc/4
{
"name": "A man who is riding a wild horse in the rodeo is very near to falling off .",
"id": "4427058951.jpg"
}
PUT {{host}}/index-test/_doc/4
{
"name": "A man who is riding a wild horse in the rodeo is very near to falling off .",
"id": "4427058951.jpg"
}
PUT {{host}}/index-test/_doc/5
{
"name": "A rodeo cowboy , wearing a cowboy hat , is being thrown off of a wild white horse .",
"id": "2691147709.jpg"
}

PUT {{host}}/_search/pipeline/nlp-search-pipeline
{
"description": "Post processor for hybrid search",
"phase_results_processors": [
{
"normalization-processor": {
"normalization": {
"technique": "min_max"
},
"combination": {
"technique": "arithmetic_mean",
"parameters": {
"weights": [
0.7,
0.3
]
}
}
}
}
]
}

Querying lexical search

PUT {{host}}/index-test/_search
{
"_source": {
"excludes": [
"passage_embedding"
]
},
"query": {
"match": {
"name": {
"query": "wild west"
}
}
}
}

Results:
{
"took": 10,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 4,
"relation": "eq"
},
"max_score": 1.7878418,
"hits": [
{
"_index": "index-test",
"_id": "1",
"_score": 1.7878418,
"_source": {
"name": "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .",
"id": "4319130149.jpg"
}
},
{
"_index": "index-test",
"_id": "2",
"_score": 0.58093566,
"_source": {
"name": "A wild animal races across an uncut field with a minimal amount of trees .",
"id": "1775029934.jpg"
}
},
{
"_index": "index-test",
"_id": "5",
"_score": 0.55228686,
"_source": {
"name": "A rodeo cowboy , wearing a cowboy hat , is being thrown off of a wild white horse .",
"id": "2691147709.jpg"
}
},
{
"_index": "index-test",
"_id": "4",
"_score": 0.53899646,
"_source": {
"name": "A man who is riding a wild horse in the rodeo is very near to falling off .",
"id": "4427058951.jpg"
}
}
]
}
}

Query semantic search

PUT {{host}}/index-test/_search
{
"_source": {
"excludes": [
"passage_embedding"
]
},
"query": {
"neural": {
"passage_embedding": {
"query_text": "wild west",
"model_id": "{{model_id}}",
"k": 20
}
}
}
}

Response:
{
"took": 47,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 5,
"relation": "eq"
},
"max_score": 0.65891314,
"hits": [
{
"_index": "index-test",
"_id": "2",
"_score": 0.65891314,
"_source": {
"name": "A wild animal races across an uncut field with a minimal amount of trees .",
"id": "1775029934.jpg"
}
},
{
"_index": "index-test",
"_id": "1",
"_score": 0.6278618,
"_source": {
"name": "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .",
"id": "4319130149.jpg"
}
},
{
"_index": "index-test",
"_id": "5",
"_score": 0.62723345,
"_source": {
"name": "A rodeo cowboy , wearing a cowboy hat , is being thrown off of a wild white horse .",
"id": "2691147709.jpg"
}
},
{
"_index": "index-test",
"_id": "3",
"_score": 0.6229783,
"_source": {
"name": "People line the stands which advertise Freemont 's orthopedics , a cowboy rides a light brown bucking bronco .",
"id": "2664027527.jpg"
}
},
{
"_index": "index-test",
"_id": "4",
"_score": 0.5791679,
"_source": {
"name": "A man who is riding a wild horse in the rodeo is very near to falling off .",
"id": "4427058951.jpg"
}
}
]
}
}

Hybrid search

GET {{host}}/index-test/_search?search_pipeline=nlp-search-pipeline
{
"_source": {
"exclude": [
"passage_embedding"
]
},
"query": {
"hybrid": {
"queries": [
{
"match": {
"name": {
"query": "wild west"
}
}
},
{
"neural": {
"passage_embedding": {
"query_text": "wild west",
"model_id": "{{model_id}}",
"k": 20
}
}
}
]
}
}
}

Response:
{
"took": 60,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 5,
"relation": "eq"
},
"max_score": 2.4157035,
"hits": [
{
"_index": "index-test",
"_id": "1",
"_score": 2.4157035,
"_source": {
"name": "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .",
"id": "4319130149.jpg"
}
},
{
"_index": "index-test",
"_id": "2",
"_score": 1.2398489,
"_source": {
"name": "A wild animal races across an uncut field with a minimal amount of trees .",
"id": "1775029934.jpg"
}
},
{
"_index": "index-test",
"_id": "5",
"_score": 1.1795204,
"_source": {
"name": "A rodeo cowboy , wearing a cowboy hat , is being thrown off of a wild white horse .",
"id": "2691147709.jpg"
}
},
{
"_index": "index-test",
"_id": "4",
"_score": 1.1181643,
"_source": {
"name": "A man who is riding a wild horse in the rodeo is very near to falling off .",
"id": "4427058951.jpg"
}
},
{
"_index": "index-test",
"_id": "3",
"_score": 0.6229783,
"_source": {
"name": "People line the stands which advertise Freemont 's orthopedics , a cowboy rides a light brown bucking bronco .",
"id": "2664027527.jpg"
}
}
]
}
}

What is the expected behavior?

Note that on hybrid search steps, the score is higher than 1, which means that the normalization was not applied.
The expected result is what we get when the "test" field on the index is defined with type "text":
{
"took": 87,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 5,
"relation": "eq"
},
"max_score": 0.88318545,
"hits": [
{
"_index": "index-test",
"_id": "1",
"_score": 0.88318545,
"_source": {
"name": "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .",
"id": "4319130149.jpg"
}
},
{
"_index": "index-test",
"_id": "2",
"_score": 0.32350767,
"_source": {
"name": "A wild animal races across an uncut field with a minimal amount of trees .",
"id": "1775029934.jpg"
}
},
{
"_index": "index-test",
"_id": "5",
"_score": 0.18827114,
"_source": {
"name": "A rodeo cowboy , wearing a cowboy hat , is being thrown off of a wild white horse .",
"id": "2691147709.jpg"
}
},
{
"_index": "index-test",
"_id": "3",
"_score": 0.16481397,
"_source": {
"name": "People line the stands which advertise Freemont 's orthopedics , a cowboy rides a light brown bucking bronco .",
"id": "2664027527.jpg"
}
},
{
"_index": "index-test",
"_id": "4",
"_score": 0.001,
"_source": {
"name": "A man who is riding a wild horse in the rodeo is very near to falling off .",
"id": "4427058951.jpg"
}
}
]
}
}

What is your host/environment?

I ran it on Docker on Mac M2

@tiagoshin tiagoshin added bug Something isn't working untriaged labels Oct 18, 2023
@navneet1v
Copy link
Collaborator

Note that on hybrid search steps, the score is higher than 1, which means that the normalization was not applied.
The expected result is what we get when the "test" field on the index is defined with type "text":

Hi @tiagoshin,
Getting a score above 1 doesn't mean that Normalization is not applied. Actually after doing the normalization of the scores, the normalized scores for different queries are combined. Hence the score can be greater than 1 too.

I hope this clarifies.

@navneet1v navneet1v added question Further information is requested and removed bug Something isn't working labels Oct 19, 2023
@tiagoshin
Copy link
Author

tiagoshin commented Oct 19, 2023

Hi @navneet1v I know that getting a score above one doesn't necessarily mean that normalization wasn't applied because the sum of the weights could be higher than 1, but it's not the case. Please take a look at the creation of the post-processor for hybrid search; we use a combination based on arithmetic mean with the sum of the weights being equal to 1.
To leave no doubt about this bug I calculated the expected results in a spreadsheet and compared with the real results with or without the type nested:

Document ID lexical results normalized lexical (calc) semantic results normalized semantic (calc) hybrid with type text results hybrid calculated (arithm 0.7, 0.3) (calc) hybrid with type nested results simple sum of lexical and semantic (calc)
1 1.7878418 1 0.6278618 0.6106182639 0.88318545 0.8831854792 2.4157035 2.4157036
2 0.58093566 0.03358238099 0.65891314 1 0.32350767 0.3235076667 1.2398489 1.2398488
3   0 0.6229783 0.549379499 0.16481397 0.1648138497 0.6229783 0.6229783
4 0.53899646 0 0.5791679 0 0.001 0 1.1181643 1.11816436
5 0.55228686 0.01064215045 0.62723345 0.6027387967 0.18827114 0.1882711443 1.1795204 1.17952031

Please notice that the results from the hybrid search with type nested on the index are just the sum of the lexical and semantic scores. This is the same that we had before release 2.10 without applying normalization and combination techniques.

@navneet1v
Copy link
Collaborator

@martin-gaievski can you just try to reproduce with the steps added they are pretty detailed.

@martin-gaievski
Copy link
Member

martin-gaievski commented Oct 19, 2023

I repro the issue using provided steps, seems the problem is with the nested type, we're doing more deep dive to figure out the root cause, for now I can see that some elements of hybrid query are skipped from execution and that ruins the designed logic.

@martin-gaievski martin-gaievski added bug Something isn't working and removed question Further information is requested labels Oct 20, 2023
@dagneyb
Copy link

dagneyb commented Oct 31, 2023

@martin-gaievski what is the LoE and ETA for the deep dive on root cause?

@dagneyb
Copy link

dagneyb commented Nov 7, 2023

Additional context gathered: whenever there is a nested field in the index, it impacts the results, despite that field being included. Additionally, we need to ensure users can filter by nested fields.

@navneet1v
Copy link
Collaborator

Additionally, we need to ensure users can filter by nested fields.

@dagneyb can you explain a bit more on this?

@dagneyb
Copy link

dagneyb commented Nov 14, 2023

@navneet1v are you looking for more context on my comment or on the overall issue?

@navneet1v
Copy link
Collaborator

@navneet1v are you looking for more context on my comment or on the overall issue?

yes

@dagneyb
Copy link

dagneyb commented Nov 15, 2023

@navneet1v I think the overall summary provided does a good job of this: When we have any field on the index mapping properties with type nested, it doesn't apply normalization and weighted combination. Instead, it just sums up the values, the same way that Opensearch did before having Hybrid search feature.

If you have a specific question, let me know and I can reach out to the impacted user directly.

@tiagoshin
Copy link
Author

tiagoshin commented Nov 16, 2023

@navneet1v The context of this comment

Additionally, we need to ensure users can filter by nested fields.

is that we need to make sure it's possible to declare an index with nested fields and also to apply filters by them in the search query

@martin-gaievski
Copy link
Member

We've pushed a code change that fixes this issue, it's part of the main and 2.x branches.
For 2.x codeline it's going to be part of 2.12 coming release. Team has tested it internally using one of recent 2.12 Release Candidate builds. That build can be used for fix verification before 2.12 official release. Below are links to distribution build tar ball artifacts for x64 and arm:

https://ci.opensearch.org/ci/dbc/distribution-build-opensearch/2.12.0/8999/linux/x64/tar/dist/opensearch/opensearch-2.12.0-linux-x64.tar.gz
https://ci.opensearch.org/ci/dbc/distribution-build-opensearch/2.12.0/8999/linux/arm64/tar/dist/opensearch/opensearch-2.12.0-linux-arm64.tar.gz

We cannot put it to 2.11 as that release only accepts critical security fixes.

@navneet1v
Copy link
Collaborator

@tiagoshin can you use the links provided by @martin-gaievski to test and validate. Feel free to provide the feedback.

@vamshin vamshin added the v2.12.0 Issues targeting release v2.12.0 label Dec 14, 2023
@martin-gaievski
Copy link
Member

@tiagoshin we run your initial scenario on a 2.12 RC build. Only unknown piece was a model, for our testing we used huggingface/sentence-transformers/all-MiniLM-L12-v2 from list of supported pre-trained models https://opensearch.org/docs/latest/ml-commons-plugin/pretrained-models/#supported-pretrained-models, it matches your mapping configuration for vector with 384 dimensions.
below are our results:

hybrid search query

{
    "_source": {
        "exclude": [
            "passage_embedding"
        ]
    },
    "query": {
        "hybrid": {
            "queries": [
                {
                    "match": {
                        "name": {
                            "query": "wild west"
                        }
                    }
                },
                {
                    "neural": {
                        "passage_embedding": {
                            "query_text": "wild west",
                            "model_id": "{{model_id}}",
                            "k": 20
                        }
                    }
                }
            ]
        }
    }
}

{
    "took": 28,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 5,
            "relation": "eq"
        },
        "max_score": 1.0,
        "hits": [
            {
                "_index": "index-test",
                "_id": "1",
                "_score": 1.0,
                "_source": {
                    "name": "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .",
                    "id": "4319130149.jpg"
                }
            },
            {
                "_index": "index-test",
                "_id": "2",
                "_score": 0.17402677,
                "_source": {
                    "name": "A wild animal races across an uncut field with a minimal amount of trees .",
                    "id": "1775029934.jpg"
                }
            },
            {
                "_index": "index-test",
                "_id": "3",
                "_score": 0.07514995,
                "_source": {
                    "name": "People line the stands which advertise Freemont 's orthopedics , a cowboy rides a light brown bucking bronco .",
                    "id": "2664027527.jpg"
                }
            },
            {
                "_index": "index-test",
                "_id": "5",
                "_score": 0.072944336,
                "_source": {
                    "name": "A rodeo cowboy , wearing a cowboy hat , is being thrown off of a wild white horse .",
                    "id": "2691147709.jpg"
                }
            },
            {
                "_index": "index-test",
                "_id": "4",
                "_score": 0.001,
                "_source": {
                    "name": "A man who is riding a wild horse in the rodeo is very near to falling off .",
                    "id": "4427058951.jpg"
                }
            }
        ]
    }
}

Below are response for sub-queries for case when we run them as independent queries.

bm25 query

{
    "_source": {
        "excludes": [
            "passage_embedding"
        ]
    },
    "query": {
        "match": {
            "name": {
                "query": "wild west"
            }
        }
    }
}

{
    "took": 3,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 4,
            "relation": "eq"
        },
        "max_score": 1.7878418,
        "hits": [
            {
                "_index": "index-test",
                "_id": "1",
                "_score": 1.7878418,
                "_source": {
                    "name": "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .",
                    "id": "4319130149.jpg"
                }
            },
            {
                "_index": "index-test",
                "_id": "2",
                "_score": 0.58093566,
                "_source": {
                    "name": "A wild animal races across an uncut field with a minimal amount of trees .",
                    "id": "1775029934.jpg"
                }
            },
            {
                "_index": "index-test",
                "_id": "5",
                "_score": 0.55228686,
                "_source": {
                    "name": "A rodeo cowboy , wearing a cowboy hat , is being thrown off of a wild white horse .",
                    "id": "2691147709.jpg"
                }
            },
            {
                "_index": "index-test",
                "_id": "4",
                "_score": 0.53899646,
                "_source": {
                    "name": "A man who is riding a wild horse in the rodeo is very near to falling off .",
                    "id": "4427058951.jpg"
                }
            }
        ]
    }
}

neural search query

{
    "_source": {
        "excludes": [
            "passage_embedding"
        ]
    },
    "query": {
        "neural": {
            "passage_embedding": {
                "query_text": "wild west",
                "model_id": "{{model_id}}",
                "k": 20
            }
        }
    }
}

{
    "took": 2713,
    "timed_out": false,
    "_shards": {
        "total": 1,
        "successful": 1,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": {
            "value": 5,
            "relation": "eq"
        },
        "max_score": 0.64152133,
        "hits": [
            {
                "_index": "index-test",
                "_id": "1",
                "_score": 0.64152133,
                "_source": {
                    "name": "A West Virginia university women 's basketball team , officials , and a small gathering of fans are in a West Virginia arena .",
                    "id": "4319130149.jpg"
                }
            },
            {
                "_index": "index-test",
                "_id": "2",
                "_score": 0.5972555,
                "_source": {
                    "name": "A wild animal races across an uncut field with a minimal amount of trees .",
                    "id": "1775029934.jpg"
                }
            },
            {
                "_index": "index-test",
                "_id": "3",
                "_score": 0.57493645,
                "_source": {
                    "name": "People line the stands which advertise Freemont 's orthopedics , a cowboy rides a light brown bucking bronco .",
                    "id": "2664027527.jpg"
                }
            },
            {
                "_index": "index-test",
                "_id": "5",
                "_score": 0.5720773,
                "_source": {
                    "name": "A rodeo cowboy , wearing a cowboy hat , is being thrown off of a wild white horse .",
                    "id": "2691147709.jpg"
                }
            },
            {
                "_index": "index-test",
                "_id": "4",
                "_score": 0.5526823,
                "_source": {
                    "name": "A man who is riding a wild horse in the rodeo is very near to falling off .",
                    "id": "4427058951.jpg"
                }
            }
        ]
    }
}

@ryanbogan ryanbogan added v2.13.0 and removed v2.12.0 Issues targeting release v2.12.0 labels Feb 22, 2024
@jared-rheaply
Copy link

Does the re-tagging suggest this didn't make it into v2.12.0?

@navneet1v
Copy link
Collaborator

Does the re-tagging suggest this didn't make it into v2.12.0?

@martin-gaievski can you ans this question?

@jared-rheaply
Copy link

@martin-gaievski Just following up to see if you have an update on this? Nested types in indexes feel extremely common, so this really blocks a lot of Hybrid Search usage. Given it looks like the fix is complete, and how limiting this makes Hybrid Search, any way we can get this patched in soon?

@martin-gaievski
Copy link
Member

@jared-rheaply fix for the original problem reported in this issue has been fixed and is part of the 2.12. Please see corresponding PRs tagged in this issue (#490 and #498) and one more that is related #524. This was marked as 2.13 due to some internal procedures related to release, I'm closing this issue now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v2.12.0 Issues targeting release v2.12.0
Projects
Status: Done
Development

No branches or pull requests

7 participants