[FEATURE] Add support to recommend threshold tuning for heap based task cancellations by SearchBackpressureService #455

kaushalmahi12 · 2023-07-14T17:04:17Z

Is your feature request related to a problem?

Recently opensearch introduced a new feature called searchbackpressure to make the service more resilient to node drops and performance degradation. It solves the problem by cancelling resource guzzling search queries at shard level and coordinator node level. In order to achieve this it uses various settings to cancel a search query based on the resource the query is making heavy use of. As part of this feature we will try to add support to recommend threshold tuning for those settings for heap based query cancellation at shard and coordinator level.

What solution would you like?

Since there are multiple settings for each resource based cancellation. We will only recommend a single value (a multiplier) by which the thresholds should increase/decrease for a resource(In this case heap) as that would complicate the solution and number of RCAs we will need to create. We will emit actions for both the searchTask(Coordinator) and shard level differently.

Logic to mark the RCA unhealthy to increase the thresholds (Node level)

If the max heap used by openSearch process is below 85% for a minute. Since RCA runs at 5 seconds interval, we will keep a sliding window of heapUsed values for a minute.
And the heap based task cancellations are more than 3%. (Since there are rate limiters to limit the amount of cancellations. Can't cancel more than 10% of all successful tasks both at shard level and coordinator level).

Logic to mark the RCA unhealthy to decrease the thresholds (Node level)

If the max heap used by openSearch process is above 90% for a minute. Since RCA runs at 5 seconds interval, we will keep a sliding window of heapUsed values for a minute.
And the heap based task cancellations are less than 3%. (Since there are rate limiters to limit the amount of cancellations. Can't cancel more than 10% of all successful tasks both at shard level and coordinator level).

Marking the cluster level RCAs unhealthy

We will mark the cluster level RCA as unhealthy if any of the node in the cluster has unhealthy node level RCA for an hour with a cool off period of one day.

Adjusted SBP Settings

search_backpressure.search_task.total_heap_percent_threshold
search_backpressure.search_task.heap_percent_threshold
search_backpressure.search_task.heap_variance
search_backpressure.search_task.heap_moving_average_window_size

What alternatives have you considered?
The RCA framework is already in place to which runs as a side car and does not share the opensearch process resources. The alternate solution could have been to place this logic in the opensearch but that can create the resource scarcity and performance degradation of opensearch process under duress

Do you have any additional context?
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

dblock · 2024-06-06T15:23:43Z

[Triage -- attendees 1, 2, 3, 4, 5, 6, 7]

Looks like a legit feature request, thanks for opening it.

kaushalmahi12 added enhancement New feature or request untriaged labels Jul 14, 2023

kaushalmahi12 changed the title ~~[FEATURE] Add autotune feature for heap based task cancellations by SearchBackpressureService~~ [FEATURE] Add support to recommend threshold tuning for heap based task cancellations by SearchBackpressureService Jul 14, 2023

dblock removed the untriaged label Jun 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Add support to recommend threshold tuning for heap based task cancellations by SearchBackpressureService #455

[FEATURE] Add support to recommend threshold tuning for heap based task cancellations by SearchBackpressureService #455

kaushalmahi12 commented Jul 14, 2023 •

edited

Loading

dblock commented Jun 6, 2024

[FEATURE] Add support to recommend threshold tuning for heap based task cancellations by SearchBackpressureService #455

[FEATURE] Add support to recommend threshold tuning for heap based task cancellations by SearchBackpressureService #455

Comments

kaushalmahi12 commented Jul 14, 2023 • edited Loading

Is your feature request related to a problem?

What solution would you like?

Logic to mark the RCA unhealthy to increase the thresholds (Node level)

Logic to mark the RCA unhealthy to decrease the thresholds (Node level)

Marking the cluster level RCAs unhealthy

Adjusted SBP Settings

dblock commented Jun 6, 2024

kaushalmahi12 commented Jul 14, 2023 •

edited

Loading