Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add request_timeout_secs config to searcher config #5402

Merged
merged 5 commits into from
Sep 17, 2024
Merged

Conversation

PSeitz
Copy link
Contributor

@PSeitz PSeitz commented Sep 9, 2024

All GRPC get a 30s timeout, but the search RPCs, that have their own configurable timeout.

Current behavior before this PR

We have two fixed timeouts of when sending search requests between nodes, 30s on the channel and 60s on the search client. On very large datasets these fixed timeouts are too low for some queries. Local search does not have any timeout, which leads to inconsistent behavior. When converting between the errors, timeouts are converted to internal errors, loosing the semantics. The retry logic does not take timeout error into account, which could further overload the node.
This PR addresses these issues as outlined below.

Related: #5241

Configurable Timeout

This PR adds a new setting:

  • request_timeout_secs on the searcher config (Defaults to 30sec)

The timeout on the channel in the cluster is removed and we introduce a new tower timeout layer. That layer is similar to the timeout layer provided by tower, but requires to have a conversion from the TimeoutExceeded error to the services error. This allows us to forward the timeout semantics.

Cancel Search

This PR will add some logic to cancel the search on the leaf node if it exceeds the configured timeout. Previously it would have needlessly continued to consume resources.
It also removes the retry if we timeout, to not further overload the node.

How was this PR tested?

Manually tested the node config and the timeout on a distributed setting and singel node search.

Future work

Convert timeout errors to Quickwit errors for the search case

@PSeitz PSeitz force-pushed the config_timeout branch 3 times, most recently from 69e0ff0 to 2dad8d1 Compare September 9, 2024 08:55
Copy link

github-actions bot commented Sep 9, 2024

On SSD:

Average search latency is 1.0x that of the reference (lower is better).
Ref run id: 3431, ref commit: 2d4f5a6
Link

On GCS:

Average search latency is 1.0x that of the reference (lower is better).
Ref run id: 3432, ref commit: 2d4f5a6
Link

Copy link
Contributor

@fulmicoton fulmicoton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put the request timeout on search request only, and put that conf on the searcher config?

Copy link
Contributor

@fulmicoton fulmicoton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put the request timeout on search request only, and put that conf on the searcher config?

@PSeitz
Copy link
Contributor Author

PSeitz commented Sep 11, 2024

Can we put the request timeout on search request only, and put that conf on the searcher config?

It's what I started with, but the Channel in make_channel, which is used in quickwit-cluster, is used for multiple services e.g. search and indexing. So we can't have a search specific timeout there.

We could remove the timeout there and try to move to other layers, or reopen another channel without the timeout in the search case.

@PSeitz PSeitz force-pushed the config_timeout branch 4 times, most recently from 2c9ab45 to 19c3a92 Compare September 12, 2024 03:09
@PSeitz
Copy link
Contributor Author

PSeitz commented Sep 12, 2024

As discussed offline, the generic channel timeout is moved to a tower timeout

Copy link
Contributor

@fulmicoton fulmicoton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -1226,6 +1263,7 @@ pub async fn leaf_search(
let incremental_merge_collector = Arc::new(Mutex::new(incremental_merge_collector));

for (split, mut request) in split_with_req {
timeout.err_on_timeout()?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you timeout on the server side using a tower layer, it will cancel the future and do what you want without the TimeoutHandle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the current behavior and the client side was unaffected from the server timeouts and kept running. There's no special logic in the tower timeout layer to forward the timeout to the client side, so not sure how that could be the case.

But there's another problem: If we dispatch locally, it doesn't use the grpc stuff, so we need to have some timeout for that case.

I replaced TimeoutHandle with a tokio timeout, that should be much nicer overall.

@PSeitz
Copy link
Contributor Author

PSeitz commented Sep 16, 2024

Why not use? https://tower-rs.github.io/tower/tower/timeout/struct.TimeoutLayer.html

It's the same, but the main difference is:

/// The error of the service must implement `From<TimeoutExceeded>`.

So we keep the semantics of the timeout, this let's us e.g. easily disable retries.
The tower timeout just uses a Box<Error>.

Copy link
Contributor

@fulmicoton fulmicoton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see minor comments

On very large datasets the fixed timeouts are too low for some queries.
This PR adds a setting to configure the timeout.

Two settings are introduced:
- `request_timeout` on the node config
- `QW_REQUEST_TIMEOUT` env parameter

Currently there are two timeouts when doing a distributed search request, one from quickwit cluster when opening a
channel and one from the search client.
The timeout is applied to both (That means all cluster connections have
the same request_timeout applied, not only search nodes)

Related: #5241
@PSeitz PSeitz changed the title add request_timeout config add request_timeout_secs config to searcher requests Sep 17, 2024
@PSeitz PSeitz changed the title add request_timeout_secs config to searcher requests add request_timeout_secs config to searcher config Sep 17, 2024
@PSeitz PSeitz merged commit 7d357fa into main Sep 17, 2024
5 checks passed
@PSeitz PSeitz deleted the config_timeout branch September 17, 2024 06:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants