-
Notifications
You must be signed in to change notification settings - Fork 330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add request_timeout_secs config to searcher config #5402
Conversation
69e0ff0
to
2dad8d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put the request timeout on search request only, and put that conf on the searcher config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we put the request timeout on search request only, and put that conf on the searcher config?
It's what I started with, but the We could remove the timeout there and try to move to other layers, or reopen another channel without the timeout in the search case. |
2c9ab45
to
19c3a92
Compare
As discussed offline, the generic channel timeout is moved to a tower timeout |
19c3a92
to
c8a5035
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
quickwit/quickwit-search/src/leaf.rs
Outdated
@@ -1226,6 +1263,7 @@ pub async fn leaf_search( | |||
let incremental_merge_collector = Arc::new(Mutex::new(incremental_merge_collector)); | |||
|
|||
for (split, mut request) in split_with_req { | |||
timeout.err_on_timeout()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you timeout on the server side using a tower layer, it will cancel the future and do what you want without the TimeoutHandle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the current behavior and the client side was unaffected from the server timeouts and kept running. There's no special logic in the tower timeout layer to forward the timeout to the client side, so not sure how that could be the case.
But there's another problem: If we dispatch locally, it doesn't use the grpc stuff, so we need to have some timeout for that case.
I replaced TimeoutHandle
with a tokio timeout, that should be much nicer overall.
It's the same, but the main difference is:
So we keep the semantics of the timeout, this let's us e.g. easily disable retries. |
a1410cd
to
0bb0c61
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see minor comments
On very large datasets the fixed timeouts are too low for some queries. This PR adds a setting to configure the timeout. Two settings are introduced: - `request_timeout` on the node config - `QW_REQUEST_TIMEOUT` env parameter Currently there are two timeouts when doing a distributed search request, one from quickwit cluster when opening a channel and one from the search client. The timeout is applied to both (That means all cluster connections have the same request_timeout applied, not only search nodes) Related: #5241
0bb0c61
to
0f8c35d
Compare
0f8c35d
to
778e36b
Compare
All GRPC get a 30s timeout, but the search RPCs, that have their own configurable timeout.
Current behavior before this PR
We have two fixed timeouts of when sending search requests between nodes, 30s on the channel and 60s on the search client. On very large datasets these fixed timeouts are too low for some queries. Local search does not have any timeout, which leads to inconsistent behavior. When converting between the errors, timeouts are converted to internal errors, loosing the semantics. The retry logic does not take timeout error into account, which could further overload the node.
This PR addresses these issues as outlined below.
Related: #5241
Configurable Timeout
This PR adds a new setting:
request_timeout_secs
on the searcher config (Defaults to 30sec)The timeout on the channel in the cluster is removed and we introduce a new tower timeout layer. That layer is similar to the timeout layer provided by tower, but requires to have a conversion from the
TimeoutExceeded
error to the services error. This allows us to forward the timeout semantics.Cancel Search
This PR will add some logic to cancel the search on the leaf node if it exceeds the configured timeout. Previously it would have needlessly continued to consume resources.
It also removes the retry if we timeout, to not further overload the node.
How was this PR tested?
Manually tested the node config and the timeout on a distributed setting and singel node search.
Future work
Convert timeout errors to Quickwit errors for the search case