Abort/timeout long running requests #5241

PSeitz · 2024-07-21T03:04:29Z

Currently long running requests may continue to consume a large amount resources, while the original request already timed-out in a http-layer above. We should add a configurable timeout to abort requests in quickwit

There are two different timeouts configured:

Tonic Timeout

tonic timeouts return errors like:

{
  "message": "internal error: `Timeout expired`"
}

This timeout is originating here:

let timeout_channel = Timeout::new(node.channel(), Duration::from_secs(30));
let search_client = create_search_client_from_channel(
    grpc_addr,
    timeout_channel,
    max_message_size,
);
Some(Change::Insert(grpc_addr, search_client))

Tower Timeout

There's another similar timeout, from tower

{
  "message": "internal error: `request timed out`"
{

The tower timeout is defined here:

/// Creates a channel from a socket address.
///
/// The function is marked as `async` because it requires an executor (`connect_lazy`).
pub async fn make_channel(socket_addr: SocketAddr) -> Channel {
    let uri = Uri::builder()
        .scheme("http")
        .authority(socket_addr.to_string())
        .path_and_query("/")
        .build()
        .expect("provided arguments should be valid");
    Endpoint::from(uri)
        .connect_timeout(Duration::from_secs(5))
        .timeout(Duration::from_secs(30))
        .connect_lazy()
}

The behavior of timeouts also differs currently on the dispatch type, local dispatch never times out.

Retries

The current behavior is to retry directly after a timeout (cluster_client.rs::leaf_search). This could add additional load on a overloaded node.

The first step would be to properly transform the error type to keep the semantics

The text was updated successfully, but these errors were encountered:

On very large datasets the fixed timeouts are too low for some queries. This PR adds a setting to configure the timeout. Two settings are introduced: - `request_timeout` on the node config - `QW_REQUEST_TIMEOUT` env parameter Currently there are two timeouts when doing a distributed search request, one from chitchat when opening a channel and one from the search client. The timeout is applied to both (That means all chitchat connections have the same request_timeout applied, not only search nodes) Related: #5241

On very large datasets the fixed timeouts are too low for some queries. This PR adds a setting to configure the timeout. Two settings are introduced: - `request_timeout` on the node config - `QW_REQUEST_TIMEOUT` env parameter Currently there are two timeouts when doing a distributed search request, one from quickwit cluster when opening a channel and one from the search client. The timeout is applied to both (That means all chitchat connections have the same request_timeout applied, not only search nodes) Related: #5241

On very large datasets the fixed timeouts are too low for some queries. This PR adds a setting to configure the timeout. Two settings are introduced: - `request_timeout` on the node config - `QW_REQUEST_TIMEOUT` env parameter Currently there are two timeouts when doing a distributed search request, one from quickwit cluster when opening a channel and one from the search client. The timeout is applied to both (That means all cluster connections have the same request_timeout applied, not only search nodes) Related: #5241

* add request_timeout config On very large datasets the fixed timeouts are too low for some queries. This PR adds a setting to configure the timeout. Two settings are introduced: - `request_timeout` on the node config - `QW_REQUEST_TIMEOUT` env parameter Currently there are two timeouts when doing a distributed search request, one from quickwit cluster when opening a channel and one from the search client. The timeout is applied to both (That means all cluster connections have the same request_timeout applied, not only search nodes) Related: #5241 * move timeout to search config, add timeout tower layer * cancel search after timeout * use tokio::timeout * use global timeoutlayer

PSeitz · 2024-09-19T01:51:47Z

Fixed by #5402

PSeitz added the bug Something isn't working label Jul 21, 2024

PSeitz mentioned this issue Sep 9, 2024

add request_timeout_secs config to searcher config #5402

Merged

PSeitz closed this as completed Sep 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Abort/timeout long running requests #5241

Abort/timeout long running requests #5241

PSeitz commented Jul 21, 2024 •

edited

Loading

PSeitz commented Sep 19, 2024

Abort/timeout long running requests #5241

Abort/timeout long running requests #5241

Comments

PSeitz commented Jul 21, 2024 • edited Loading

Tonic Timeout

Tower Timeout

Retries

PSeitz commented Sep 19, 2024

PSeitz commented Jul 21, 2024 •

edited

Loading