Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add searchbp metrics to Performance Analyzer #5390

Merged
merged 5 commits into from
Feb 13, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
184 changes: 165 additions & 19 deletions _monitoring-your-cluster/pa/reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -821,27 +821,173 @@
</tbody>
</table>

## Relevant dimensions: `NodeID`, `searchbp_mode`

<table>
<thead style="text-align: left">
<tr>
<th>Metric</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>searchbp_shard_stats_cancellationCount
</td>
<td>The number of tasks marked for cancellation on the shard task.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Shard task" or just "shard"?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been recently renamed to shard_task. opensearch-project/performance-analyzer@81eae43

Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
</td>
</tr>
<tr>
<td>searchbp_shard_stats_limitReachedCount
</td>
<td>The number of times when the cancellable task total exceeded the set cancellation threshold on the shard task.
</td>
</tr>
<tr>
<td>searchbp_shard_stats_resource_heap_usage_cancellationCount
</td>
<td>The number of tasks marked for cancellation because of excessive heap usage since the node last restarted on the shard task.
</td>
</tr>
<tr>
<td>searchbp_shard_stats_resource_heap_usage_currentMax
</td>
<td>The maximum heap usage for tasks currently running on the shard task.
</td>
</tr>
<tr>
<td>searchbp_shard_stats_resource_heap_usage_rollingAvg
</td>
<td> The rolling average heap usage for the _n_ most recent tasks on the shard task. The default value for _n_ is 100.

Check failure on line 861 in _monitoring-your-cluster/pa/reference.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _monitoring-your-cluster/pa/reference.md#L861

[OpenSearch.Spelling] Error: _n_. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.
Raw output
{"message": "[OpenSearch.Spelling] Error: _n_. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_monitoring-your-cluster/pa/reference.md", "range": {"start": {"line": 861, "column": 50}}}, "severity": "ERROR"}

Check failure on line 861 in _monitoring-your-cluster/pa/reference.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _monitoring-your-cluster/pa/reference.md#L861

[OpenSearch.Spelling] Error: _n_. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.
Raw output
{"message": "[OpenSearch.Spelling] Error: _n_. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_monitoring-your-cluster/pa/reference.md", "range": {"start": {"line": 861, "column": 113}}}, "severity": "ERROR"}
</td>
</tr>
<tr>
<td>searchbp_shard_stats_resource_cpu_usage_cancellationCount
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaushalmahi12 : We follow the CamelCase naming convention for metric naming. Can you provide the same for these metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaushalmahi12 : We follow the CamelCase naming convention for metric naming. Can you provide the same for these metrics.

These look right to me. Can you please indicate which ones are incorrect?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For reference

CPU_Utilization - The Metric name begins with Uppercase letter, and uses _ . Following on same line, searchbp_shard_stats_resource_cpu_usage_cancellationCount will be SearchBP_Shard_Stats_Resource_CPU_Usage_CancellationCount.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @khushbr ! @Naarcha-AWS will do the doc review and merge.

Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
</td>
<td>The number of tasks marked for cancellation because of excessive CPU usage since the node last restarted on the shard task.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<td>The number of tasks marked for cancellation because of excessive CPU usage since the node last restarted on the shard task.
<td>The number of tasks marked for cancellation because of excessive CPU usage since the last restart of the node containing the shard task.

</td>
</tr>
<tr>
<td>searchbp_shard_stats_resource_cpu_usage_currentMax
</td>
<td>The maximum CPU time for all tasks currently running on the node on the shard task.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
</td>
</tr>
<tr>
<td>searchbp_shard_stats_resource_cpu_usage_currentAvg
</td>
<td>The average CPU time for all tasks currently running on the node on the shard task.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
</td>
</tr>
<tr>
<td>searchbp_shard_stats_resource_elaspedtime_usage_cancellationCount
</td>
<td>The number of tasks marked for cancellation because of excessive elapsed time since the node last restarted on the shard task.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
</td>
</tr>
<tr>
<td>searchbp_shard_stats_resource_elaspedtime_usage_currentMax
</td>
<td>The maximum elapsed time for all tasks currently running on the node on the shard task.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
</td>
</tr>
<tr>
<td>searchbp_shard_stats_resource_elaspedtime_usage_currentAvg
</td>
<td>The average elapsed time for all tasks currently running on the node on the shard task.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<td>The average elapsed time for all tasks currently running on the node on the shard task.
<td>The average time elapsed for all tasks currently running on the node containing the shard task.

</td>
</tr>
<tr>
<td>searchbp_task_stats_cancellationCount
</td>
<td>The number of tasks marked for cancellation on the search task level.
</td>
</tr>
<tr>
<td>searchbp_task_stats_limitReachedCount
</td>
<td>The number of times when the cancellable task total exceeded the set cancellation threshold on the search task level.
</td>
</tr>
<tr>
<td>searchbp_task_stats_resource_heap_usage_cancellationCount
</td>
<td>The number of tasks marked for cancellation because of excessive heap usage since the node last restarted on the search task level.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<td>The number of tasks marked for cancellation because of excessive heap usage since the node last restarted on the search task level.
<td>The number of tasks marked for cancellation because of excessive heap usage since the last restart of the node containing the search task level.

</td>
</tr>
<tr>
<td>searchbp_task_stats_resource_heap_usage_currentMax
</td>
<td>The maximum heap usage for tasks currently running on the search task level.
</td>
</tr>
<tr>
<td>searchbp_task_stats_resource_heap_usage_rollingAvg
</td>
<td> The rolling average heap usage for the _n_ most recent tasks on the search task level. The default value for _n_ is 10.

Check failure on line 927 in _monitoring-your-cluster/pa/reference.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _monitoring-your-cluster/pa/reference.md#L927

[OpenSearch.Spelling] Error: _n_. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.
Raw output
{"message": "[OpenSearch.Spelling] Error: _n_. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_monitoring-your-cluster/pa/reference.md", "range": {"start": {"line": 927, "column": 50}}}, "severity": "ERROR"}

Check failure on line 927 in _monitoring-your-cluster/pa/reference.md

View workflow job for this annotation

GitHub Actions / vale

[vale] _monitoring-your-cluster/pa/reference.md#L927

[OpenSearch.Spelling] Error: _n_. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.
Raw output
{"message": "[OpenSearch.Spelling] Error: _n_. If you are referencing a setting, variable, format, function, or repository, surround it with tic marks.", "location": {"path": "_monitoring-your-cluster/pa/reference.md", "range": {"start": {"line": 927, "column": 120}}}, "severity": "ERROR"}
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
</td>
</tr>
<tr>
<td>searchbp_task_stats_resource_cpu_usage_cancellationCount
</td>
<td>The number of tasks marked for cancellation because of excessive CPU usage since the node last restarted on the search task level.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<td>The number of tasks marked for cancellation because of excessive CPU usage since the node last restarted on the search task level.
<td>The number of tasks marked for cancellation because of excessive CPU usage since the last restart of the node containing the search task level.

</td>
</tr>
<tr>
<td>searchbp_task_stats_resource_cpu_usage_currentMax
</td>
<td>The maximum CPU time for all tasks currently running on the node on the search task level.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
</td>
</tr>
<tr>
<td>searchbp_task_stats_resource_cpu_usage_currentAvg
</td>
<td>The average CPU time for all tasks currently running on the node on the search task level.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
</td>
</tr>
<tr>
<td>searchbp_task_stats_resource_elaspedtime_usage_cancellationCount
</td>
<td>The number of tasks marked for cancellation because of excessive elapsed time since the node last restarted on the search task level.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<td>The number of tasks marked for cancellation because of excessive elapsed time since the node last restarted on the search task level.
<td>The number of tasks marked for cancellation because of excessive elapsed time since the last restart of the node containing the search task level.

</td>
</tr>
<tr>
<td>searchbp_task_stats_resource_elaspedtime_usage_currentMax
</td>
<td>The maximum elapsed time for all tasks currently running on the node on the search task level.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
</td>
</tr>
<tr>
<td>searchbp_task_stats_resource_elaspedtime_usage_currentAvg
</td>
<td>The average elapsed time for all tasks currently running on the node on the search task level.
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
</td>
</tr>
</tbody>
</table>


## Dimensions reference

| Dimension | Return values |
|----------------------|-------------------------------------------------|
| ShardID | The ID of the shard, for example, `1`. |
| IndexName | The name of the index, for example, `my-index`. |
| Operation | The type of operation, for example, `shardbulk`. |
| ShardRole | The shard role, for example, `primary` or `replica`. |
| Exception | OpenSearch exceptions, for example, `org.opensearch.index_not_found_exception`. |
| Indices | The list of indexes in the request URL. |
| HTTPRespCode | The response code from OpenSearch, for example, `200`. |
| MemType | The memory type, for example, `totYoungGC`, `totFullGC`, `Survivor`, `PermGen`, `OldGen`, `Eden`, `NonHeap`, or `Heap`. |
| DiskName | The name of the disk, for example, `sda1`. |
| DestAddr | The destination address, for example, `010015AC`. |
| Direction | The direction, for example, `in` or `out`. |
| ThreadPoolType | The OpenSearch thread pools, for example, `index`, `search`, or `snapshot`. |
| CBType | The circuit breaker type, for example, `accounting`, `fielddata`, `in_flight_requests`, `parent`, or `request`. |
| ClusterManagerTaskInsertOrder| The order in which the task was inserted, for example, `3691`. |
| ClusterManagerTaskPriority | The priority of the task, for example, `URGENT`. OpenSearch executes higher-priority tasks before lower-priority ones, regardless of `insert_order`. |
| ClusterManagerTaskType | The task type, for example, `shard-started`, `create-index`, `delete-index`, `refresh-mapping`, `put-mapping`, `CleanupSnapshotRestoreState`, or `Update snapshot state`. |
| ClusterManagerTaskMetadata | The metadata for the task (if any). |
| CacheType | The cache type, for example, `Field_Data_Cache`, `Shard_Request_Cache`, or `Node_Query_Cache`. |

| `ShardID` | The ID of the shard, for example, `1`. |
| `IndexName` | The name of the index, for example, `my-index`. |
| `Operation` | The type of operation, for example, `shardbulk`. |
| `ShardRole` | The shard role, for example, `primary` or `replica`. |
| `Exception` | OpenSearch exceptions, for example, `org.opensearch.index_not_found_exception`. |
| `Indices` | The list of indexes in the request URL. |
| `HTTPRespCode` | The response code from OpenSearch, for example, `200`. |
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
Naarcha-AWS marked this conversation as resolved.
Show resolved Hide resolved
| `MemType` | The memory type, for example, `totYoungGC`, `totFullGC`, `Survivor`, `PermGen`, `OldGen`, `Eden`, `NonHeap`, or `Heap`. |
| `DiskName` | The name of the disk, for example, `sda1`. |
| `DestAddr` | The destination address, for example, `010015AC`. |
| `Direction` | The direction, for example, `in` or `out`. |
| `ThreadPoolType` | The OpenSearch thread pools, for example, `index`, `search`, or `snapshot`. |
| `CBType` | The circuit breaker type, for example, `accounting`, `fielddata`, `in_flight_requests`, `parent`, or `request`. |
| `ClusterManagerTaskInsertOrder`| The order in which the task was inserted, for example, `3691`. |
| `ClusterManagerTaskPriority` | The priority of the task, for example, `URGENT`. OpenSearch executes higher-priority tasks before lower-priority ones, regardless of `insert_order`. |
| `ClusterManagerTaskType` | The task type, for example, `shard-started`, `create-index`, `delete-index`, `refresh-mapping`, `put-mapping`, `CleanupSnapshotRestoreState`, or `Update snapshot state`. |
| `ClusterManagerTaskMetadata` | The metadata for the task (if any). |
| `CacheType` | The cache type, for example, `Field_Data_Cache`, `Shard_Request_Cache`, or `Node_Query_Cache`. |
| `NodeID` | The ID of the node. |
| `Searchbp_mode` | The search backpressure mode, for example, `monitor_only` (default), `enforced`, or `disabled`. |