add request_timeout_secs config to searcher config #5402

PSeitz · 2024-09-09T08:52:15Z

All GRPC get a 30s timeout, but the search RPCs, that have their own configurable timeout.

Current behavior before this PR

We have two fixed timeouts of when sending search requests between nodes, 30s on the channel and 60s on the search client. On very large datasets these fixed timeouts are too low for some queries. Local search does not have any timeout, which leads to inconsistent behavior. When converting between the errors, timeouts are converted to internal errors, loosing the semantics. The retry logic does not take timeout error into account, which could further overload the node.
This PR addresses these issues as outlined below.

Related: #5241

Configurable Timeout

This PR adds a new setting:

request_timeout_secs on the searcher config (Defaults to 30sec)

The timeout on the channel in the cluster is removed and we introduce a new tower timeout layer. That layer is similar to the timeout layer provided by tower, but requires to have a conversion from the TimeoutExceeded error to the services error. This allows us to forward the timeout semantics.

Cancel Search

This PR will add some logic to cancel the search on the leaf node if it exceeds the configured timeout. Previously it would have needlessly continued to consume resources.
It also removes the retry if we timeout, to not further overload the node.

How was this PR tested?

Manually tested the node config and the timeout on a distributed setting and singel node search.

Future work

Convert timeout errors to Quickwit errors for the search case

github-actions · 2024-09-09T09:30:59Z

On SSD:

Average search latency is 1.0x that of the reference (lower is better).
Ref run id: 3431, ref commit: 2d4f5a6
Link

On GCS:

Average search latency is 1.0x that of the reference (lower is better).
Ref run id: 3432, ref commit: 2d4f5a6
Link

fulmicoton

Can we put the request timeout on search request only, and put that conf on the searcher config?

fulmicoton

Can we put the request timeout on search request only, and put that conf on the searcher config?

PSeitz · 2024-09-11T03:56:35Z

Can we put the request timeout on search request only, and put that conf on the searcher config?

It's what I started with, but the Channel in make_channel, which is used in quickwit-cluster, is used for multiple services e.g. search and indexing. So we can't have a search specific timeout there.

We could remove the timeout there and try to move to other layers, or reopen another channel without the timeout in the search case.

PSeitz · 2024-09-12T03:55:48Z

As discussed offline, the generic channel timeout is moved to a tower timeout

fulmicoton

Why not use?
https://tower-rs.github.io/tower/tower/timeout/struct.TimeoutLayer.html

quickwit/quickwit-serve/src/lib.rs

quickwit/quickwit-search/src/leaf.rs

fulmicoton · 2024-09-13T13:51:18Z

quickwit/quickwit-search/src/leaf.rs

@@ -1226,6 +1263,7 @@ pub async fn leaf_search(
    let incremental_merge_collector = Arc::new(Mutex::new(incremental_merge_collector));

    for (split, mut request) in split_with_req {
+        timeout.err_on_timeout()?;


I think if you timeout on the server side using a tower layer, it will cancel the future and do what you want without the TimeoutHandle.

I tested the current behavior and the client side was unaffected from the server timeouts and kept running. There's no special logic in the tower timeout layer to forward the timeout to the client side, so not sure how that could be the case.

But there's another problem: If we dispatch locally, it doesn't use the grpc stuff, so we need to have some timeout for that case.

I replaced TimeoutHandle with a tokio timeout, that should be much nicer overall.

PSeitz · 2024-09-16T01:03:49Z

Why not use? https://tower-rs.github.io/tower/tower/timeout/struct.TimeoutLayer.html

It's the same, but the main difference is:

/// The error of the service must implement `From<TimeoutExceeded>`.

So we keep the semantics of the timeout, this let's us e.g. easily disable retries.
The tower timeout just uses a Box<Error>.

quickwit/quickwit-serve/src/lib.rs

quickwit/quickwit-common/src/tower/timeout.rs

quickwit/quickwit-serve/src/lib.rs

fulmicoton

see minor comments

On very large datasets the fixed timeouts are too low for some queries. This PR adds a setting to configure the timeout. Two settings are introduced: - `request_timeout` on the node config - `QW_REQUEST_TIMEOUT` env parameter Currently there are two timeouts when doing a distributed search request, one from quickwit cluster when opening a channel and one from the search client. The timeout is applied to both (That means all cluster connections have the same request_timeout applied, not only search nodes) Related: #5241

PSeitz force-pushed the config_timeout branch 3 times, most recently from 69e0ff0 to 2dad8d1 Compare September 9, 2024 08:55

PSeitz requested a review from fulmicoton September 9, 2024 11:16

fulmicoton requested changes Sep 11, 2024

View reviewed changes

PSeitz force-pushed the config_timeout branch 4 times, most recently from 2c9ab45 to 19c3a92 Compare September 12, 2024 03:09

PSeitz requested a review from fulmicoton September 12, 2024 03:33

PSeitz force-pushed the config_timeout branch from 19c3a92 to c8a5035 Compare September 13, 2024 02:21

fulmicoton requested changes Sep 13, 2024

View reviewed changes

fulmicoton reviewed Sep 13, 2024

View reviewed changes

quickwit/quickwit-serve/src/lib.rs Outdated Show resolved Hide resolved

fulmicoton reviewed Sep 13, 2024

View reviewed changes

quickwit/quickwit-search/src/leaf.rs Outdated Show resolved Hide resolved

fulmicoton reviewed Sep 13, 2024

View reviewed changes

PSeitz force-pushed the config_timeout branch from a1410cd to 0bb0c61 Compare September 16, 2024 01:06

PSeitz requested a review from fulmicoton September 16, 2024 02:51

fulmicoton reviewed Sep 17, 2024

View reviewed changes

quickwit/quickwit-serve/src/lib.rs Outdated Show resolved Hide resolved

fulmicoton reviewed Sep 17, 2024

View reviewed changes

quickwit/quickwit-common/src/tower/timeout.rs Show resolved Hide resolved

fulmicoton reviewed Sep 17, 2024

View reviewed changes

quickwit/quickwit-serve/src/lib.rs Outdated Show resolved Hide resolved

fulmicoton approved these changes Sep 17, 2024

View reviewed changes

PSeitz added 4 commits September 17, 2024 13:31

move timeout to search config, add timeout tower layer

3c6f9d4

cancel search after timeout

5feed38

use tokio::timeout

eb70e6b

PSeitz changed the title ~~add request_timeout config~~ add request_timeout_secs config to searcher requests Sep 17, 2024

PSeitz changed the title ~~add request_timeout_secs config to searcher requests~~ add request_timeout_secs config to searcher config Sep 17, 2024

PSeitz force-pushed the config_timeout branch from 0bb0c61 to 0f8c35d Compare September 17, 2024 05:39

use global timeoutlayer

778e36b

PSeitz force-pushed the config_timeout branch from 0f8c35d to 778e36b Compare September 17, 2024 05:42

PSeitz merged commit 7d357fa into main Sep 17, 2024
5 checks passed

PSeitz deleted the config_timeout branch September 17, 2024 06:25

PSeitz mentioned this pull request Sep 19, 2024

Abort/timeout long running requests #5241

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add request_timeout_secs config to searcher config #5402

add request_timeout_secs config to searcher config #5402

PSeitz commented Sep 9, 2024 •

edited

Loading

github-actions bot commented Sep 9, 2024 •

edited

Loading

fulmicoton left a comment

fulmicoton left a comment

PSeitz commented Sep 11, 2024

PSeitz commented Sep 12, 2024

fulmicoton left a comment

fulmicoton Sep 13, 2024

PSeitz Sep 16, 2024

PSeitz commented Sep 16, 2024

fulmicoton left a comment

add request_timeout_secs config to searcher config #5402

add request_timeout_secs config to searcher config #5402

Conversation

PSeitz commented Sep 9, 2024 • edited Loading

Current behavior before this PR

Configurable Timeout

Cancel Search

How was this PR tested?

Future work

github-actions bot commented Sep 9, 2024 • edited Loading

On SSD:

On GCS:

fulmicoton left a comment

Choose a reason for hiding this comment

fulmicoton left a comment

Choose a reason for hiding this comment

PSeitz commented Sep 11, 2024

PSeitz commented Sep 12, 2024

fulmicoton left a comment

Choose a reason for hiding this comment

fulmicoton Sep 13, 2024

Choose a reason for hiding this comment

PSeitz Sep 16, 2024

Choose a reason for hiding this comment

PSeitz commented Sep 16, 2024

fulmicoton left a comment

Choose a reason for hiding this comment

PSeitz commented Sep 9, 2024 •

edited

Loading

github-actions bot commented Sep 9, 2024 •

edited

Loading