Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abort/timeout long running requests #5241

Closed
PSeitz opened this issue Jul 21, 2024 · 1 comment
Closed

Abort/timeout long running requests #5241

PSeitz opened this issue Jul 21, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@PSeitz
Copy link
Contributor

PSeitz commented Jul 21, 2024

Currently long running requests may continue to consume a large amount resources, while the original request already timed-out in a http-layer above. We should add a configurable timeout to abort requests in quickwit

There are two different timeouts configured:

Tonic Timeout

tonic timeouts return errors like:

{
  "message": "internal error: `Timeout expired`"
}

This timeout is originating here:

let timeout_channel = Timeout::new(node.channel(), Duration::from_secs(30));
let search_client = create_search_client_from_channel(
    grpc_addr,
    timeout_channel,
    max_message_size,
);
Some(Change::Insert(grpc_addr, search_client))

Tower Timeout

There's another similar timeout, from tower

{
  "message": "internal error: `request timed out`"
{

The tower timeout is defined here:

/// Creates a channel from a socket address.
///
/// The function is marked as `async` because it requires an executor (`connect_lazy`).
pub async fn make_channel(socket_addr: SocketAddr) -> Channel {
    let uri = Uri::builder()
        .scheme("http")
        .authority(socket_addr.to_string())
        .path_and_query("/")
        .build()
        .expect("provided arguments should be valid");
    Endpoint::from(uri)
        .connect_timeout(Duration::from_secs(5))
        .timeout(Duration::from_secs(30))
        .connect_lazy()
}

The behavior of timeouts also differs currently on the dispatch type, local dispatch never times out.

Retries

The current behavior is to retry directly after a timeout (cluster_client.rs::leaf_search). This could add additional load on a overloaded node.

The first step would be to properly transform the error type to keep the semantics

@PSeitz PSeitz added the bug Something isn't working label Jul 21, 2024
PSeitz added a commit that referenced this issue Sep 9, 2024
On very large datasets the fixed timeouts are too low for some queries.
This PR adds a setting to configure the timeout.

Two settings are introduced:
- `request_timeout` on the node config
- `QW_REQUEST_TIMEOUT` env parameter

Currently there are two timeouts when doing a distributed search request, one from chitchat when opening a
channel and one from the search client.
The timeout is applied to both (That means all chitchat connections have
the same request_timeout applied, not only search nodes)

Related: #5241
PSeitz added a commit that referenced this issue Sep 9, 2024
On very large datasets the fixed timeouts are too low for some queries.
This PR adds a setting to configure the timeout.

Two settings are introduced:
- `request_timeout` on the node config
- `QW_REQUEST_TIMEOUT` env parameter

Currently there are two timeouts when doing a distributed search request, one from chitchat when opening a
channel and one from the search client.
The timeout is applied to both (That means all chitchat connections have
the same request_timeout applied, not only search nodes)

Related: #5241
PSeitz added a commit that referenced this issue Sep 9, 2024
On very large datasets the fixed timeouts are too low for some queries.
This PR adds a setting to configure the timeout.

Two settings are introduced:
- `request_timeout` on the node config
- `QW_REQUEST_TIMEOUT` env parameter

Currently there are two timeouts when doing a distributed search request, one from quickwit cluster when opening a
channel and one from the search client.
The timeout is applied to both (That means all chitchat connections have
the same request_timeout applied, not only search nodes)

Related: #5241
PSeitz added a commit that referenced this issue Sep 9, 2024
On very large datasets the fixed timeouts are too low for some queries.
This PR adds a setting to configure the timeout.

Two settings are introduced:
- `request_timeout` on the node config
- `QW_REQUEST_TIMEOUT` env parameter

Currently there are two timeouts when doing a distributed search request, one from quickwit cluster when opening a
channel and one from the search client.
The timeout is applied to both (That means all cluster connections have
the same request_timeout applied, not only search nodes)

Related: #5241
PSeitz added a commit that referenced this issue Sep 12, 2024
On very large datasets the fixed timeouts are too low for some queries.
This PR adds a setting to configure the timeout.

Two settings are introduced:
- `request_timeout` on the node config
- `QW_REQUEST_TIMEOUT` env parameter

Currently there are two timeouts when doing a distributed search request, one from quickwit cluster when opening a
channel and one from the search client.
The timeout is applied to both (That means all cluster connections have
the same request_timeout applied, not only search nodes)

Related: #5241
PSeitz added a commit that referenced this issue Sep 13, 2024
On very large datasets the fixed timeouts are too low for some queries.
This PR adds a setting to configure the timeout.

Two settings are introduced:
- `request_timeout` on the node config
- `QW_REQUEST_TIMEOUT` env parameter

Currently there are two timeouts when doing a distributed search request, one from quickwit cluster when opening a
channel and one from the search client.
The timeout is applied to both (That means all cluster connections have
the same request_timeout applied, not only search nodes)

Related: #5241
PSeitz added a commit that referenced this issue Sep 16, 2024
On very large datasets the fixed timeouts are too low for some queries.
This PR adds a setting to configure the timeout.

Two settings are introduced:
- `request_timeout` on the node config
- `QW_REQUEST_TIMEOUT` env parameter

Currently there are two timeouts when doing a distributed search request, one from quickwit cluster when opening a
channel and one from the search client.
The timeout is applied to both (That means all cluster connections have
the same request_timeout applied, not only search nodes)

Related: #5241
PSeitz added a commit that referenced this issue Sep 17, 2024
On very large datasets the fixed timeouts are too low for some queries.
This PR adds a setting to configure the timeout.

Two settings are introduced:
- `request_timeout` on the node config
- `QW_REQUEST_TIMEOUT` env parameter

Currently there are two timeouts when doing a distributed search request, one from quickwit cluster when opening a
channel and one from the search client.
The timeout is applied to both (That means all cluster connections have
the same request_timeout applied, not only search nodes)

Related: #5241
PSeitz added a commit that referenced this issue Sep 17, 2024
* add request_timeout config

On very large datasets the fixed timeouts are too low for some queries.
This PR adds a setting to configure the timeout.

Two settings are introduced:
- `request_timeout` on the node config
- `QW_REQUEST_TIMEOUT` env parameter

Currently there are two timeouts when doing a distributed search request, one from quickwit cluster when opening a
channel and one from the search client.
The timeout is applied to both (That means all cluster connections have
the same request_timeout applied, not only search nodes)

Related: #5241

* move timeout to search config, add timeout tower layer

* cancel search after timeout

* use tokio::timeout

* use global timeoutlayer
@PSeitz
Copy link
Contributor Author

PSeitz commented Sep 19, 2024

Fixed by #5402

@PSeitz PSeitz closed this as completed Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant