Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multi indices search to elasticsearch + quickwit search endpoints. #3734

Merged
merged 2 commits into from
Aug 17, 2023

Conversation

fmassot
Copy link
Contributor

@fmassot fmassot commented Aug 10, 2023

Fix #2770.

This is a partial implementation of the multi-index syntax of elasticsearch, which is defined here: only comma-separated and wildcard syntax are supported in this PR.

In multi-target syntax, you can use a comma-separated list to run a request on multiple resources, such as data streams, indices, or aliases: test1,test2,test3. You can also use [glob-like](https://en.wikipedia.org/wiki/Glob_(programming)) wildcard (*) expressions to target resources that match a pattern: test* or *test or te*t or *test*.

Examples:

  • /index-*/search
  • /index-1,index-2/search
  • /_elastic/*/_search

Limitations

NB: those limitations can be lifted, but it adds complexity due to managing multiple QueryAST, TagFilterAst, and timestamp_field.

My opinion on this is that the limitations are reasonable from the user's point of view.

  1. Limitation on queries with different resolved query AST

Queries on multiple indexes that lead to different QueryAst per index are not supported.

Example:

  • index 1 has two fields, field1 and field2, with the default field on field1.
  • index 2 has two fields, field1 and field2, with the default field on field2.

A query like test will be resolved as:

  • field1:test on index 1.
  • field2:test on index 2.
  1. Limitation on queries with different timestamp fields

Queries on multiple indexes with different timestamp fields per index are not supported. Note that if one index has no timestamp field, but the others have, this does not raise any issue.

Supporting different timestamp fields will require to:

  • call a modified version of refine_start_end_timestamp_from_ast for each timestamp field of each index and extract the start/end timestamp from the query
  • merge the start/end timestamp correctly: take the max of start timestamps, min of end timestamps and if None values are present ignore the refinements.

Also, it can lead to side effects in query performance on corner cases (start and end timestamps can be very narrowed on all indexes but one, the last one has no refinements on start/end timestamps, and we end up with no refinement. So adding just one index to the query will lead to a very different performance).

I implemented a version of it, but I decided not to add it to this PR. I would rather add it if users need this.

Follow-up in next PRs

  • Resolve with the default fields in the delete task planner.
  • Add docs.
  • Listing splits into several indexes introduced a discrepancy between our file-backed metastore and PostgreSQL metastore. In the file-backed metastore, if one index ID among all the requested index IDs does not exist, it will return an IndexesDoNotExistError. In the PostgreSQL metastore, we only check the existence of index IDs if no splits are returned. I suggest not checking for index existence in the list_split method and fixing that in a follow-up PR.

TODO

  • Remove list_all_indexes_metadatas as it is redundant with list_indexes_metadatas
  • Clean up the code.
  • Add more tests.

@fmassot fmassot force-pushed the fmassot/multi-indices-search branch 15 times, most recently from 0dd67c2 to f18e542 Compare August 13, 2023 17:58
@fmassot fmassot marked this pull request as ready for review August 13, 2023 17:58
@fmassot fmassot force-pushed the fmassot/multi-indices-search branch 3 times, most recently from 96d79cf to 4ea81ea Compare August 13, 2023 19:55
@fulmicoton
Copy link
Contributor

@fmassot is it still draft ?

quickwit/quickwit-config/src/lib.rs Outdated Show resolved Hide resolved
quickwit/quickwit-config/src/lib.rs Outdated Show resolved Hide resolved
quickwit/quickwit-config/src/lib.rs Outdated Show resolved Hide resolved
// If no splits were returned, maybe the index does not exist in the first place?
if pg_splits.is_empty()
&& index_opt_for_uid(&self.connection_pool, query.index_uid.clone())
// If no splits were returned, maybe some indexes do not exist in the first place?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, this check is a pain in the butt to implement, especially in the multi-index case, but there are some places (REST API, Janitor) where it could prove helpful. @fulmicoton, thoughts on dropping it?

quickwit/quickwit-query/src/query_ast/mod.rs Outdated Show resolved Hide resolved
quickwit/quickwit-search/src/lib.rs Outdated Show resolved Hide resolved
// We convert the error to return a 400 to the user (and not a 500).
.map_err(|err| SearchError::InvalidQuery(err.to_string()))?;

// Validate uniqueness of resolved query AST.
Copy link
Contributor

@fulmicoton fulmicoton Aug 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had three possible specs here:
a) pick the query_ast_resolved of a given index.. Maybe the freshest one, or the first defined in the list.
b) use a different query_ast_resolved for each indexed.
c) return an error when facing an inconsistency (that's what you picked)

Obviously this is a trade-off, but I strongly prefer b).
I don't think there is a strong case to be defensive here. This is actually an excellent use case for the default field.

I have several indexes for a security use case. One containing logs, one containing transactions, etc. I would like to search into all of them to gather all docs containing a given reference number. The default_fields in the different index makes the search experience possible.

Copy link
Contributor

@fulmicoton fulmicoton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, I think we should merge this BUT I think we should have gone with one query ast resolved per index.

This also had some wrangling on the models of the LeafRequest.
A dead simple implementation could have been something:
phase1: keep the leaf request unchanged (as in like it is in main) and do emit several gRPC
to each node, apply the existing merging logic on the root side.
phase2: add a batch leaf_request GRPC endpoint, that computes the different leaf request and pre-merge their results before sending the result back to the root.

If you agree, can we open a ticket to revisit this?

@fmassot fmassot force-pushed the fmassot/multi-indices-search branch 3 times, most recently from 8354b2b to 47f12e7 Compare August 17, 2023 12:48
@fmassot fmassot force-pushed the fmassot/multi-indices-search branch from 47f12e7 to afd797c Compare August 17, 2023 14:41
@fmassot fmassot merged commit 297ce3a into main Aug 17, 2023
3 checks passed
@fmassot fmassot deleted the fmassot/multi-indices-search branch August 17, 2023 15:01
ListIndexesQuery::All => build_regex_set_from_patterns(vec!["*".to_string()]),
};
let index_matcher =
index_matcher_result.map_err(|error| MetastoreError::InternalError {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I worked on retries yesterday, I noticed we retry on internal errors. However, we often use internal errors for invalid argument errors that shouldn't be retried. We should introduce a MetastoreError::InvalidArgument that is not retryable. I can take care of that as part of #3752 where I've already started refactoring metastore errors.

indexes_metadatas: &[IndexMetadata],
search_request: &SearchRequest,
) -> crate::Result<(TimestampFieldOpt, QueryAst, IndexesMetasForLeafSearch)> {
let mut metadatas_for_leaf: HashMap<IndexUid, IndexMetasForLeafSearch> = HashMap::new();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let mut metadatas_for_leaf: HashMap<IndexUid, IndexMetasForLeafSearch> = HashMap::new();
let mut metadatas_for_leaf: HashMap<IndexUid, IndexMetasForLeafSearch> = HashMap::with_capacity( indexes_metadatas.len());

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Target multiple indexes during a search
3 participants