Add storage counters related to errors #1091

dplore · 2024-04-09T04:55:09Z

Change Scope

Add /components/component/storage/state/counters/ as a container to represent storage device counters
Add a few select leaves into this container based on common/often available SMART telemetry which are focused on detecting storage device errors.
Since this is adding leafs, this change is backwards compatible

Operational use case

While rare, in a large population of devices storage errors have led to a device becoming unhealthy, unable to accept software updates or unable to boot due to non-volatile media (flash, ssd media) errors. This is a counter to be able to measure the accumulation of storage areas as an statistic for storage component health.

Note, /components/component/healthz/state/status is also a useful data point, but as a boolean only value, it is very coarse. Storage counters can be used to predict a storage device will fail in the future.

Tree View

module: openconfig-platform
  +--rw components
     +--rw component* [name]

[... snip ...]

         +--rw storage
         |  +--rw config
         |  +--ro state
+        |     +--ro oc-storage:counters
+        |        +--ro oc-storage:soft-read-error-rate?                  oc-yang:counter64
+        |        +--ro oc-storage:reallocated-sectors?                   oc-yang:counter64
+        |        +--ro oc-storage:end-to-end-error?                      oc-yang:counter64
+        |        +--ro oc-storage:offline-uncorrectable-sectors-count?   oc-yang:counter64
+        |        +--ro oc-storage:life-left?                             uint8
+        |        +--ro oc-storage:percentage-used?                       uint8

Platform Implementations

Linux smartmontools - https://www.smartmontools.org/
Cisco IOS XR implements syslog messages for MEDIA errors
Arista EOS exposes storage related errors via dmesg and console

OpenConfigBot · 2024-04-09T04:56:37Z

No major YANG version changes in commit 3983f2b

sulrich · 2024-04-11T01:14:15Z

pyang tree output

   +--ro mount-points
     |  +--ro mount-point* [name]
     |     +--ro name     -> ../state/name
     |     +--ro state
     |        +--ro name?                string
     |        +--ro storage-component?   -> /oc-platform:components/component/name
     |        +--ro size?                uint64
     |        +--ro available?           uint64
     |        +--ro utilized?            uint64
     |        +--ro counters
     |           +--ro io-errors?   uint64

earies

Could you elaborate precisely what something like this error counter would map to for an underlying OS (e.g. Linux) implementation?

This also is attempting to categorize per mount point vs. device (block)

release/models/system/openconfig-system.yang

dplore · 2024-04-25T16:50:42Z

Addressed comments. This is now ready for review.

release/models/platform/openconfig-platform.yang

release/models/platform/openconfig-platform-storage.yang

dplore · 2024-05-08T00:21:21Z

This was reviewed in Apr 9, 2024 OC Operators meeting without objection. Addressed latest comments and now placing this on last-call for merge on May 21, 2024

LimeHat · 2024-05-08T00:57:49Z

  "This values increments when an I/O request completes with a
    failure.  This value corresponds to 'discard I/Os' on the linux
    kernel block layer statistics.

Is there a reference that can confirm that statement?

My understanding is that "discard i/o" is just another type of i/o operation, which is often used with SSD drives (see also fstrim man); and not an error.

dplore · 2024-05-08T01:07:06Z

  "This values increments when an I/O request completes with a
    failure.  This value corresponds to 'discard I/Os' on the linux
    kernel block layer statistics.
Is there a reference that can confirm that statement?

My understanding is that "discard i/o" is just another type of i/o operation, which is often used with SSD drives (see also fstrim man); and not an error.

Here's the reference I found: https://www.kernel.org/doc/Documentation/block/stat.txt

LimeHat · 2024-05-08T01:10:35Z

I checked that link, yes, but there's no indication that discard i/o is related to errors in any way.

If anything, it confirms my understanding, since they describe the discard operations in the same way as read/write e.g.:

read sectors, write sectors, discard_sectors
============================================
These values count the number of sectors read from, written to, or
discarded from this block device.  The "sectors" in question are the
standard UNIX 512-byte sectors, not any device- or filesystem-specific
block size.  The counters are incremented when the I/O completes.

LimeHat · 2024-05-08T01:13:47Z

Another ref: blkdiscard
https://man7.org/linux/man-pages/man8/blkdiscard.8.html

earies · 2024-05-08T22:12:04Z

For the driving use-case (trying to understand when storage media is having issues), maybe its best to narrow this in via SMART data vs. what is exposed in sysfs

@dplore - maybe best to align w/ what precisely is being monitored in your compute environment for such case?

LimeHat · 2024-05-31T21:34:35Z

I agree with the suggestion to use SMART.

dplore · 2024-06-05T18:29:28Z

For the driving use-case (trying to understand when storage media is having issues), maybe its best to narrow this in via SMART data vs. what is exposed in sysfs

@dplore - maybe best to align w/ what precisely is being monitored in your compute environment for such case?

Ironically the current description is aligned with at least one use case for what is monitored in one of our network environments. I do agree that SMART is a better data set to base this on and will refactor for that.

dplore · 2024-08-27T01:27:12Z

Updated this PR to use a select few SMART counters. I appreciate any feedback on this approach.

Note, I used https://en.wikipedia.org/wiki/Self-Monitoring,_Analysis_and_Reporting_Technology#Known_ATA_S.M.A.R.T._attributes as a reference for picking counters which are related to storage failures. There are 10 attributes noted for predicting/measuring failures. I modeled 4 so far. If we like this approach, I could add all 10 or so subset based on feedback.

earies · 2024-09-05T22:52:01Z

Overall I think your latest patch is a better approach but it should be noted that this is up to the drive manufacturer as far as implementation which is going to vary within platforms of the same vendor and across vendors.

For example, just grabbing one variant of SSD we ship today, 2 of the 4 attributes listed here are supported. This structure should be noted that these leaf nodes should be supported if the underlying hardware supports, otherwise optional/excluded.

dplore · 2024-09-12T16:43:52Z

This was reviewed in the OC Community meeting on Sep 12, 2024 without objection. Setting last-call to Sep 19 to allow a little more time for public review.

@s19nal do you have any comments? Can you approve?

s19nal

LGTM - all comments addressed

add mount point io-errors

63595a7

dplore requested a review from a team as a code owner April 9, 2024 04:55

OpenConfigBot added the non-breaking label Apr 9, 2024

dplore assigned s19nal Apr 9, 2024

earies reviewed Apr 11, 2024

View reviewed changes

release/models/system/openconfig-system.yang Outdated Show resolved Hide resolved

dplore changed the title ~~Add mount point io-errors~~ Add storage counters io-errors Apr 24, 2024

dplore added 4 commits April 24, 2024 15:19

Merge branch 'master' into dplore/mount-errors

24b71df

move to platform model

f5aa390

version bump

1337de8

Merge branch 'master' into dplore/mount-errors

c6ec232

earies reviewed Apr 25, 2024

View reviewed changes

release/models/platform/openconfig-platform.yang Outdated Show resolved Hide resolved

earies reviewed Apr 25, 2024

View reviewed changes

release/models/platform/openconfig-platform-storage.yang Outdated Show resolved Hide resolved

release/models/platform/openconfig-platform-storage.yang Outdated Show resolved Hide resolved

release/models/platform/openconfig-platform-storage.yang Outdated Show resolved Hide resolved

dplore added 7 commits May 7, 2024 16:53

merge master

86bfe14

fix rev

e3e8b9c

Update io-error descriptions for non-linux use case

aad728b

add counters container

f020914

fix typo

5d60346

bump ver

cf8abdd

add container description

617b9b5

dplore added 2 commits August 26, 2024 18:16

refactor to base off SMART data

120301a

bump date

b4bf29e

dplore changed the title ~~Add storage counters io-errors~~ Add storage counters related to errors Aug 27, 2024

dplore added 3 commits August 26, 2024 18:19

fix syntax

bf772bd

fix ver

9d4a199

fix rev dates

7a76e88

dplore added 2 commits August 26, 2024 18:33

fix ws

3bd2b44

Merge branch 'master' into dplore/mount-errors

bcf4155

dplore requested a review from earies September 5, 2024 22:18

k-iakhontov approved these changes Sep 6, 2024

View reviewed changes

dplore added the last-call PR that is in final review before merging. label Sep 12, 2024

Merge branch 'master' into dplore/mount-errors

38d9757

dplore added 6 commits September 12, 2024 10:22

update copyright and add storage life leaves

910f9e1

trim ws

c269cfe

Merge branch 'master' into dplore/mount-errors

be5f664

remove unused oc-types

d257d04

version fix

59e6d68

fix ws

3aa62b3

s19nal approved these changes Sep 24, 2024

View reviewed changes

Merge branch 'master' into dplore/mount-errors

3983f2b

dplore merged commit 960cfd9 into master Sep 24, 2024
14 checks passed

dplore deleted the dplore/mount-errors branch September 24, 2024 19:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add storage counters related to errors #1091

Add storage counters related to errors #1091

dplore commented Apr 9, 2024 •

edited

Loading

OpenConfigBot commented Apr 9, 2024 •

edited

Loading

sulrich commented Apr 11, 2024

earies left a comment

dplore commented Apr 25, 2024

dplore commented May 8, 2024

LimeHat commented May 8, 2024

dplore commented May 8, 2024

LimeHat commented May 8, 2024

LimeHat commented May 8, 2024

earies commented May 8, 2024

LimeHat commented May 31, 2024

dplore commented Jun 5, 2024

dplore commented Aug 27, 2024 •

edited

Loading

earies commented Sep 5, 2024

dplore commented Sep 12, 2024

s19nal left a comment

Add storage counters related to errors #1091

Add storage counters related to errors #1091

Conversation

dplore commented Apr 9, 2024 • edited Loading

Change Scope

Operational use case

Tree View

Platform Implementations

OpenConfigBot commented Apr 9, 2024 • edited Loading

sulrich commented Apr 11, 2024

pyang tree output

earies left a comment

Choose a reason for hiding this comment

dplore commented Apr 25, 2024

dplore commented May 8, 2024

LimeHat commented May 8, 2024

dplore commented May 8, 2024

LimeHat commented May 8, 2024

LimeHat commented May 8, 2024

earies commented May 8, 2024

LimeHat commented May 31, 2024

dplore commented Jun 5, 2024

dplore commented Aug 27, 2024 • edited Loading

earies commented Sep 5, 2024

dplore commented Sep 12, 2024

s19nal left a comment

Choose a reason for hiding this comment

dplore commented Apr 9, 2024 •

edited

Loading

OpenConfigBot commented Apr 9, 2024 •

edited

Loading

dplore commented Aug 27, 2024 •

edited

Loading