Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Upgrade from OpenSearch 2.11 -> 2.13 breaks working boolean+hybrid query, hybrid+request_processor combinations #903

Closed
sutsr opened this issue Sep 11, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@sutsr
Copy link

sutsr commented Sep 11, 2024

What is the bug?

I have two query types that ran fine on OpenSearch 2.11, but after upgrading to 2.13, error with illegal argument exceptions of hybrid query must be a top level query and cannot be wrapped into other queries.

Request processor used to combine a criterion with a hybrid search

POST /hybrid_search_index1/_search
{
  "query": {
    "hybrid": {
      "queries": [
        {
          "match": {
            "text": {
              "query": "family"
            }
          }
        },
        {
          "neural": {
            "embedding": {
              "query_text": "family",
              "model_id": “model_id_here“,
              "k": 5
            }
          }
        }
      ]
    }
  },
  "search_pipeline": {
    "request_processors": [
      {
        "filter_query": {
          "query": {
            "range": {
              "publish_date": {
                "lte": "2024-01-01”
              }
            }
          }
        }
      }
    ],
    "phase_results_processors": [
      {
        "normalization-processor": {
          "normalization": {
            "technique": "l2"
          },
          "combination": {
            "technique": "arithmetic_mean",
            "parameters": {
              "weights": [
                0.7,
                0.3
              ]
            }
          }
        }
      }
    ]
  }
}

Top level boolean used to combine an exclusion criterion with the results of a hybrid search

POST /hybrid_search_index1/_search
{
  "query": {
    "bool": {
      "must_not": [
        {
          "term": {
            "is_flagged": true
          }
        }
      ],
      "must": [
        {
          "hybrid": {
            "queries": [
              {
                "match": {
                  "text": {
                    "query": “family
                  }
                }
              },
              {
                "neural": {
                  "embedding": {
                    "query_text": “family”,
                    "model_id": “model_id_here”,
                    "k": 5
                  }
                }
              }
            ]
          }
        }
      ]
    }
  },
  "search_pipeline": {
    "phase_results_processors": [
      {
        "normalization-processor": {
          "normalization": {
            "technique": "l2"
          },
          "combination": {
            "technique": "arithmetic_mean",
            "parameters": {
              "weights": [
                0.7,
                0.3
              ]
            }
          }
        }
      }
    ]
  }
}

How can one reproduce the bug?

Attempt to combine a hybrid query with a search pipeline request processor or use it as part of a boolean query in the fashion shown by the example queries.

The index used for these queries has the following fields:

  • text - search_as_you_type
  • embedding - 1536D knn_vector
  • publish_date - date type, format of strict_year_month_day
  • is_flagged - boolean

What is the expected behavior?

In the first instance, I expect the hybrid query to run, only returning results that meet the additional criterion from the request processor.

In the second instance I expect the query to run and return results that meet both conditions.

What is your host/environment?

AWS Managed OpenSearch

Do you have any screenshots?

Not applicable

Do you have any additional context?

The model used for neural search is an external model calling out to OpenAI's ada-002 embedding API configured as per the blueprint

@sutsr sutsr added bug Something isn't working untriaged labels Sep 11, 2024
@yuye-aws
Copy link
Member

I think @martin-gaievski has more context knowledge on the hybrid query.

@martin-gaievski
Copy link
Member

@sutsr the behavior you're seeing in 2.13 is expected, we have closed the possibility of having hybrid query nested/wrapped into some other query. The change has been introduced in 2.12, #498.
Main reason for that was the current approach for running hybrid query, it does the normalization in form of post processing of all shard level results. Regular query does it at the shard level, and those scores will be changed by the hybrid query anyway.

@sutsr
Copy link
Author

sutsr commented Sep 12, 2024

Thanks for the clarification and explanation @martin-gaievski. Can you offer some guidance on how best to rework queries along the lines of my example queries?

Should a boolean query inside post_filter be used to replace any queries previously combined as in my examples (using bool wrapping hybrid or queries inside request_processors)? Essentially I am trying to provide result ranking from the hybrid query with knock-out exclusion on non-ranking-related fields.

e.g.

POST /hybrid_search_index1/_search
{
  "query": {
    "hybrid": {
      "queries": [
        {
          "match": {
            "text": {
              "query": "family"
            }
          }
        },
        {
          "neural": {
            "embedding": {
              "query_text": "family",
              "model_id": “model_id_here“,
              "k": 5
            }
          }
        }
      ]
    }
  },
  "search_pipeline": {
    "phase_results_processors": [
      {
        "normalization-processor": {
          "normalization": {
            "technique": "l2"
          },
          "combination": {
            "technique": "arithmetic_mean",
            "parameters": {
              "weights": [
                0.7,
                0.3
              ]
            }
          }
        }
      }
    ]
  },
  "post_filter": {
    "bool": {
      "filter": [
        {
          "range": {
            "publish_date": {
              "lte": "2024-01-1"
            }
          }
        }
      ],
      "must_not": [
        {
          "term": {
            "is_flagged": true
          }
        }
      ]
    }
  },
}

@martin-gaievski
Copy link
Member

@sutsr right, post filter should solve your use case, it does not considered to be a wrapper for the hybrid query. As long you don't need exact number of results and ok with post processing of the main query results post filters is a good choice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants