You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Recently opensearch introduced a new feature called searchbackpressure to make the service more resilient to node drops and performance degradation. It solves the problem by cancelling resource guzzling search queries at shard level and coordinator node level. In order to achieve this it uses various settings to cancel a search query based on the resource the query is making heavy use of. As part of this feature we will try to add support to recommend threshold tuning for those settings for heap based query cancellation at shard and coordinator level.
What solution would you like?
Since there are multiple settings for each resource based cancellation. We will only recommend a single value (a multiplier) by which the thresholds should increase/decrease for a resource(In this case heap) as that would complicate the solution and number of RCAs we will need to create. We will emit actions for both the searchTask(Coordinator) and shard level differently.
Logic to mark the RCA unhealthy to increase the thresholds (Node level)
If the max heap used by openSearch process is below 85% for a minute. Since RCA runs at 5 seconds interval, we will keep a sliding window of heapUsed values for a minute.
And the heap based task cancellations are more than 3%. (Since there are rate limiters to limit the amount of cancellations. Can't cancel more than 10% of all successful tasks both at shard level and coordinator level).
Logic to mark the RCA unhealthy to decrease the thresholds (Node level)
If the max heap used by openSearch process is above 90% for a minute. Since RCA runs at 5 seconds interval, we will keep a sliding window of heapUsed values for a minute.
And the heap based task cancellations are less than 3%. (Since there are rate limiters to limit the amount of cancellations. Can't cancel more than 10% of all successful tasks both at shard level and coordinator level).
Marking the cluster level RCAs unhealthy
We will mark the cluster level RCA as unhealthy if any of the node in the cluster has unhealthy node level RCA for an hour with a cool off period of one day.
What alternatives have you considered?
The RCA framework is already in place to which runs as a side car and does not share the opensearch process resources. The alternate solution could have been to place this logic in the opensearch but that can create the resource scarcity and performance degradation of opensearch process under duress
Do you have any additional context?
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered:
kaushalmahi12
changed the title
[FEATURE] Add autotune feature for heap based task cancellations by SearchBackpressureService
[FEATURE] Add support to recommend threshold tuning for heap based task cancellations by SearchBackpressureService
Jul 14, 2023
Is your feature request related to a problem?
Recently opensearch introduced a new feature called searchbackpressure to make the service more resilient to node drops and performance degradation. It solves the problem by cancelling resource guzzling search queries at shard level and coordinator node level. In order to achieve this it uses various settings to cancel a search query based on the resource the query is making heavy use of. As part of this feature we will try to add support to recommend threshold tuning for those settings for heap based query cancellation at shard and coordinator level.
What solution would you like?
Since there are multiple settings for each resource based cancellation. We will only recommend a single value (a multiplier) by which the thresholds should increase/decrease for a resource(In this case heap) as that would complicate the solution and number of RCAs we will need to create. We will emit actions for both the searchTask(Coordinator) and shard level differently.
Logic to mark the RCA unhealthy to increase the thresholds (Node level)
Logic to mark the RCA unhealthy to decrease the thresholds (Node level)
Marking the cluster level RCAs unhealthy
We will mark the cluster level RCA as unhealthy if any of the node in the cluster has unhealthy node level RCA for an hour with a cool off period of one day.
Adjusted SBP Settings
search_backpressure.search_task.total_heap_percent_threshold
search_backpressure.search_task.heap_percent_threshold
search_backpressure.search_task.heap_variance
search_backpressure.search_task.heap_moving_average_window_size
What alternatives have you considered?
The RCA framework is already in place to which runs as a side car and does not share the opensearch process resources. The alternate solution could have been to place this logic in the opensearch but that can create the resource scarcity and performance degradation of opensearch process under duress
Do you have any additional context?
Add any other context or screenshots about the feature request here.
The text was updated successfully, but these errors were encountered: