Additional recommended alerts #1135

ravindk89 · 2024-02-20T20:14:51Z

Summary

From an internal discussion, we should expand the alerting page to include the following list of recommended metrics:

metric	Description
`minio_node_drive_free_bytes`	Total storage available on a drive.
`minio_node_drive_free_inodes`	Total free inodes.
`minio_node_drive_latency_us`	Average last minute latency in µs for drive API storage operations.
`minio_node_drive_offline_total`	Total drives offline in this node.
`minio_node_drive_online_total`	Total drives online in this node.
`minio_node_drive_total`	Total drives in this node.
`minio_node_drive_total_bytes`	Total storage on a drive.
`minio_node_drive_used_bytes`	Total storage used on a drive.
`minio_node_drive_errors_timeout`	Total number of drive timeout errors since server start
`minio_node_drive_errors_availability`	Total number of drive I/O errors, permission denied and timeouts since server start
`minio_node_drive_io_waiting`	Total number I/O operations waiting on drive

There's a lot of metrics here and the page already has some examples, so I'm thinking we can use a tab setup of something like

| Example Alerts | Recommended Alerts |

To help constrain the default length of the procedure.

Goals

List the in-scope goals

Add alert examples matching the metrics above
Possibly tab out or otherwise organize page for readability

Non-Goals

Extensive testing of Prometheus + Alert Manager w/ the above metrics

Additional context
Add any other context or screenshots about the feature request here.

The text was updated successfully, but these errors were encountered:

Closes #1135

ravindk89 · 2024-03-08T16:28:13Z

@kannappanr some assistance:

curl --retry 10 -L -X GET https://play.min.io/minio/v2/metrics/cluster | grep -E '^minio_[\s a-z _]*_drive'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  
minio_cluster_drive_offline_total{server="play.min.io:9000"} 0
minio_cluster_drive_online_total{server="play.min.io:9000"} 4
minio_cluster_drive_total{server="play.min.io:9000"} 4
minio_cluster_health_erasure_set_healing_drives{pool="0",server="play.min.io:9000",set="0"} 0
minio_cluster_health_erasure_set_online_drives{pool="0",server="play.min.io:9000",set="0"} 4

Most of the recommended list as discussed does not appear in cluster metrics.

They do appear for the node endpoint:

minio_node_drive_errors_availability{drive="/disk1/data",server="play.min.io:9000"} 0
minio_node_drive_errors_availability{drive="/disk2/data",server="play.min.io:9000"} 0
minio_node_drive_errors_availability{drive="/disk3/data",server="play.min.io:9000"} 0
minio_node_drive_errors_availability{drive="/disk4/data",server="play.min.io:9000"} 0
minio_node_drive_errors_timeout{drive="/disk1/data",server="play.min.io:9000"} 0
minio_node_drive_errors_timeout{drive="/disk2/data",server="play.min.io:9000"} 0
minio_node_drive_errors_timeout{drive="/disk3/data",server="play.min.io:9000"} 0
minio_node_drive_errors_timeout{drive="/disk4/data",server="play.min.io:9000"} 0
minio_node_drive_free_bytes{drive="/disk1/data",server="play.min.io:9000"} 3.9700221952e+10
minio_node_drive_free_bytes{drive="/disk2/data",server="play.min.io:9000"} 4.0129953792e+10
minio_node_drive_free_bytes{drive="/disk3/data",server="play.min.io:9000"} 4.0129642496e+10
minio_node_drive_free_bytes{drive="/disk4/data",server="play.min.io:9000"} 4.013072384e+10
minio_node_drive_free_inodes{drive="/disk1/data",server="play.min.io:9000"} 2.0950584e+07
minio_node_drive_free_inodes{drive="/disk2/data",server="play.min.io:9000"} 2.0950777e+07
minio_node_drive_free_inodes{drive="/disk3/data",server="play.min.io:9000"} 2.0950777e+07
minio_node_drive_free_inodes{drive="/disk4/data",server="play.min.io:9000"} 2.0950777e+07
minio_node_drive_io_waiting{drive="/disk1/data",server="play.min.io:9000"} 0
minio_node_drive_io_waiting{drive="/disk2/data",server="play.min.io:9000"} 0
minio_node_drive_io_waiting{drive="/disk3/data",server="play.min.io:9000"} 0
minio_node_drive_io_waiting{drive="/disk4/data",server="play.min.io:9000"} 0
minio_node_drive_latency_us{api="storage.CreateFile",drive="/disk1/data",server="play.min.io:9000"} 3600
minio_node_drive_latency_us{api="storage.CreateFile",drive="/disk2/data",server="play.min.io:9000"} 3868
minio_node_drive_latency_us{api="storage.CreateFile",drive="/disk3/data",server="play.min.io:9000"} 3454
minio_node_drive_latency_us{api="storage.CreateFile",drive="/disk4/data",server="play.min.io:9000"} 4263
minio_node_drive_latency_us{api="storage.Delete",drive="/disk1/data",server="play.min.io:9000"} 35
minio_node_drive_latency_us{api="storage.Delete",drive="/disk2/data",server="play.min.io:9000"} 34
minio_node_drive_latency_us{api="storage.Delete",drive="/disk3/data",server="play.min.io:9000"} 32
minio_node_drive_latency_us{api="storage.Delete",drive="/disk4/data",server="play.min.io:9000"} 45
minio_node_drive_latency_us{api="storage.DiskInfo",drive="/disk1/data",server="play.min.io:9000"} 30
minio_node_drive_latency_us{api="storage.DiskInfo",drive="/disk2/data",server="play.min.io:9000"} 38
minio_node_drive_latency_us{api="storage.DiskInfo",drive="/disk3/data",server="play.min.io:9000"} 25
minio_node_drive_latency_us{api="storage.DiskInfo",drive="/disk4/data",server="play.min.io:9000"} 39
minio_node_drive_latency_us{api="storage.ListVols",drive="/disk1/data",server="play.min.io:9000"} 1000
minio_node_drive_latency_us{api="storage.ListVols",drive="/disk2/data",server="play.min.io:9000"} 615
minio_node_drive_latency_us{api="storage.ListVols",drive="/disk3/data",server="play.min.io:9000"} 643
minio_node_drive_latency_us{api="storage.ListVols",drive="/disk4/data",server="play.min.io:9000"} 2280
minio_node_drive_latency_us{api="storage.ReadFileStream",drive="/disk2/data",server="play.min.io:9000"} 58
minio_node_drive_latency_us{api="storage.ReadFileStream",drive="/disk3/data",server="play.min.io:9000"} 64
minio_node_drive_latency_us{api="storage.ReadXL",drive="/disk1/data",server="play.min.io:9000"} 58
minio_node_drive_latency_us{api="storage.ReadXL",drive="/disk2/data",server="play.min.io:9000"} 60
minio_node_drive_latency_us{api="storage.ReadXL",drive="/disk3/data",server="play.min.io:9000"} 49
minio_node_drive_latency_us{api="storage.ReadXL",drive="/disk4/data",server="play.min.io:9000"} 71
minio_node_drive_latency_us{api="storage.RenameData",drive="/disk1/data",server="play.min.io:9000"} 802
minio_node_drive_latency_us{api="storage.RenameData",drive="/disk2/data",server="play.min.io:9000"} 1039
minio_node_drive_latency_us{api="storage.RenameData",drive="/disk3/data",server="play.min.io:9000"} 868
minio_node_drive_latency_us{api="storage.RenameData",drive="/disk4/data",server="play.min.io:9000"} 1075
minio_node_drive_latency_us{api="storage.StatVol",drive="/disk1/data",server="play.min.io:9000"} 41
minio_node_drive_latency_us{api="storage.StatVol",drive="/disk2/data",server="play.min.io:9000"} 60
minio_node_drive_latency_us{api="storage.StatVol",drive="/disk3/data",server="play.min.io:9000"} 20
minio_node_drive_latency_us{api="storage.StatVol",drive="/disk4/data",server="play.min.io:9000"} 33
minio_node_drive_latency_us{api="storage.WalkDir",drive="/disk1/data",server="play.min.io:9000"} 234
minio_node_drive_latency_us{api="storage.WalkDir",drive="/disk2/data",server="play.min.io:9000"} 329
minio_node_drive_latency_us{api="storage.WalkDir",drive="/disk3/data",server="play.min.io:9000"} 465
minio_node_drive_latency_us{api="storage.WalkDir",drive="/disk4/data",server="play.min.io:9000"} 632
minio_node_drive_offline_total{server="play.min.io:9000"} 0
minio_node_drive_online_total{server="play.min.io:9000"} 4
minio_node_drive_total{server="play.min.io:9000"} 4
minio_node_drive_total_bytes{drive="/disk1/data",server="play.min.io:9000"} 4.292870144e+10
minio_node_drive_total_bytes{drive="/disk2/data",server="play.min.io:9000"} 4.292870144e+10
minio_node_drive_total_bytes{drive="/disk3/data",server="play.min.io:9000"} 4.292870144e+10
minio_node_drive_total_bytes{drive="/disk4/data",server="play.min.io:9000"} 4.292870144e+10
minio_node_drive_used_bytes{drive="/disk1/data",server="play.min.io:9000"} 3.228479488e+09
minio_node_drive_used_bytes{drive="/disk2/data",server="play.min.io:9000"} 2.798747648e+09
minio_node_drive_used_bytes{drive="/disk3/data",server="play.min.io:9000"} 2.799058944e+09
minio_node_drive_used_bytes{drive="/disk4/data",server="play.min.io:9000"} 2.7979776e+09

We had previously discussed de-emphasizing the node-level metrics because they should be included in the cluster endpoint as a rollup - is this a bug? cc/ @donatello @shtripat as I think you both have some experience here

ravindk89 · 2024-03-08T16:29:51Z

https://github.com/minio/minio/blob/master/docs/metrics/prometheus/list.md#drive-metrics

basically very few of these seem to roll up properly

Partially addresses #1135 To consider: I added the tabs as part of step 3 of the procedure, but we might want to consider having a recommended alerts section separate from the procedure, perhaps above the "Dashboards" heading. Let me know your thoughts.

bh4t · 2024-04-09T17:08:38Z

@kannappanr can you please assist here?

ravindk89 · 2024-04-09T17:33:43Z

This might be somewhat resolved with metrics v3, but until we've had enough time for customers to roll past that, we will need to maintain both:

Recommended alerts for metrics v2
Recommended alerts for metrics v3

And then fixups to ensure that node-level metrics are rolled up appropriately

allanrogerr · 2024-04-23T19:01:35Z

On metrics v3:
These node metrics do not roll up to any cluster metrics:

Total used inodes on a drive
Total free inodes on a drive
Total inodes available on a drive
Average last minute latency in µs for drive API storage operations
Total timeout errors on a drive
Total availability errors (I/O errors, timeouts) on a drive
Total waiting I/O operations on a drive

Node metric Total storage available on a drive in bytes rolls up to Cluster metrics

	Total cluster usable storage capacity in bytes
	Total cluster raw storage capacity in bytes

Node metric Total storage free on a drive in bytes rolls up to Cluster metrics

	Total cluster usable storage free in bytes
	Total cluster raw storage free in bytes

Node metric Total storage used on a drive in bytes rolls up to Cluster metric

	Total cluster usage in bytes

Node metric Count of offline drives rolls up to Cluster metric

	Count of offline drives in the cluster

Node metric Count of online drives rolls up to Cluster metric

	Count of online drives in the cluster

Node metric Count of all drives rolls up to Cluster metric

	Count of all drives in the cluster

ravindk89 · 2024-05-06T18:16:27Z

@kannappanr @anjalshireesh was there still progress on addressing the metrics v2 rollups above, or should we just proceed with documenting the node-level ones for now?

Otherwise we can just focus on the cluster rollups that do work and drop the rest until v3 stabilizes.

feorlen · 2024-06-13T14:47:34Z

re: v2 rollup, customer reported these metrics were "missing" after upgrade because they are now found under minio/v2/metrics/node

minio_cluster_replication_link_offline_duration_seconds
minio_cluster_replication_link_online
minio_cluster_replication_current_active_workers
minio_cluster_replication_current_link_latency_ms
minio_cluster_replication_recent_backlog_count
minio_cluster_replication_last_minute_queued_count
minio_cluster_replication_credential_errors
minio_cluster_replication_current_transfer_rate
minio_cluster_replication_last_minute_queued_bytes
minio_cluster_replication_max_queued_count

ravindk89 · 2024-06-13T14:52:02Z

@kannappanr @anjalshireesh are we generally going to leave metrics v2 as-is for now then, and focus metrics v3? Our attempt to document the recommended alerts gets flaky because we do not list the /node metrics at all - since historically those are not recommended for use.

feorlen · 2024-06-14T15:17:35Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Additional recommended alerts #1135

Additional recommended alerts #1135

ravindk89 commented Feb 20, 2024 •

edited by djwfyi

Loading

ravindk89 commented Mar 8, 2024

ravindk89 commented Mar 8, 2024

bh4t commented Apr 9, 2024

ravindk89 commented Apr 9, 2024

allanrogerr commented Apr 23, 2024

ravindk89 commented May 6, 2024

feorlen commented Jun 13, 2024

ravindk89 commented Jun 13, 2024

feorlen commented Jun 14, 2024

Additional recommended alerts #1135

Additional recommended alerts #1135

Comments

ravindk89 commented Feb 20, 2024 • edited by djwfyi Loading

ravindk89 commented Mar 8, 2024

ravindk89 commented Mar 8, 2024

bh4t commented Apr 9, 2024

ravindk89 commented Apr 9, 2024

allanrogerr commented Apr 23, 2024

ravindk89 commented May 6, 2024

feorlen commented Jun 13, 2024

ravindk89 commented Jun 13, 2024

feorlen commented Jun 14, 2024

ravindk89 commented Feb 20, 2024 •

edited by djwfyi

Loading