Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add storage counters related to errors #1091

Merged
merged 27 commits into from
Sep 24, 2024
Merged

Add storage counters related to errors #1091

merged 27 commits into from
Sep 24, 2024

Conversation

dplore
Copy link
Member

@dplore dplore commented Apr 9, 2024

Change Scope

  • Add /components/component/storage/state/counters/ as a container to represent storage device counters
  • Add a few select leaves into this container based on common/often available SMART telemetry which are focused on detecting storage device errors.
  • Since this is adding leafs, this change is backwards compatible

Operational use case

While rare, in a large population of devices storage errors have led to a device becoming unhealthy, unable to accept software updates or unable to boot due to non-volatile media (flash, ssd media) errors. This is a counter to be able to measure the accumulation of storage areas as an statistic for storage component health.

Note, /components/component/healthz/state/status is also a useful data point, but as a boolean only value, it is very coarse. Storage counters can be used to predict a storage device will fail in the future.

Tree View

module: openconfig-platform
  +--rw components
     +--rw component* [name]

[... snip ...]

         +--rw storage
         |  +--rw config
         |  +--ro state
+        |     +--ro oc-storage:counters
+        |        +--ro oc-storage:soft-read-error-rate?                  oc-yang:counter64
+        |        +--ro oc-storage:reallocated-sectors?                   oc-yang:counter64
+        |        +--ro oc-storage:end-to-end-error?                      oc-yang:counter64
+        |        +--ro oc-storage:offline-uncorrectable-sectors-count?   oc-yang:counter64
+        |        +--ro oc-storage:life-left?                             uint8
+        |        +--ro oc-storage:percentage-used?                       uint8

Platform Implementations

@dplore dplore requested a review from a team as a code owner April 9, 2024 04:55
@OpenConfigBot
Copy link

OpenConfigBot commented Apr 9, 2024

No major YANG version changes in commit 3983f2b

@sulrich
Copy link
Contributor

sulrich commented Apr 11, 2024

pyang tree output

   +--ro mount-points
     |  +--ro mount-point* [name]
     |     +--ro name     -> ../state/name
     |     +--ro state
     |        +--ro name?                string
     |        +--ro storage-component?   -> /oc-platform:components/component/name
     |        +--ro size?                uint64
     |        +--ro available?           uint64
     |        +--ro utilized?            uint64
     |        +--ro counters
     |           +--ro io-errors?   uint64

Copy link
Contributor

@earies earies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate precisely what something like this error counter would map to for an underlying OS (e.g. Linux) implementation?

This also is attempting to categorize per mount point vs. device (block)

release/models/system/openconfig-system.yang Outdated Show resolved Hide resolved
@dplore dplore changed the title Add mount point io-errors Add storage counters io-errors Apr 24, 2024
@dplore
Copy link
Member Author

dplore commented Apr 25, 2024

Addressed comments. This is now ready for review.

@dplore
Copy link
Member Author

dplore commented May 8, 2024

This was reviewed in Apr 9, 2024 OC Operators meeting without objection. Addressed latest comments and now placing this on last-call for merge on May 21, 2024

@LimeHat
Copy link

LimeHat commented May 8, 2024

  "This values increments when an I/O request completes with a
    failure.  This value corresponds to 'discard I/Os' on the linux
    kernel block layer statistics.

Is there a reference that can confirm that statement?

My understanding is that "discard i/o" is just another type of i/o operation, which is often used with SSD drives (see also fstrim man); and not an error.

@dplore
Copy link
Member Author

dplore commented May 8, 2024

  "This values increments when an I/O request completes with a
    failure.  This value corresponds to 'discard I/Os' on the linux
    kernel block layer statistics.

Is there a reference that can confirm that statement?

My understanding is that "discard i/o" is just another type of i/o operation, which is often used with SSD drives (see also fstrim man); and not an error.

Here's the reference I found: https://www.kernel.org/doc/Documentation/block/stat.txt

@LimeHat
Copy link

LimeHat commented May 8, 2024

I checked that link, yes, but there's no indication that discard i/o is related to errors in any way.

If anything, it confirms my understanding, since they describe the discard operations in the same way as read/write e.g.:

read sectors, write sectors, discard_sectors
============================================
These values count the number of sectors read from, written to, or
discarded from this block device.  The "sectors" in question are the
standard UNIX 512-byte sectors, not any device- or filesystem-specific
block size.  The counters are incremented when the I/O completes.

@LimeHat
Copy link

LimeHat commented May 8, 2024

Another ref: blkdiscard
https://man7.org/linux/man-pages/man8/blkdiscard.8.html

@earies
Copy link
Contributor

earies commented May 8, 2024

For the driving use-case (trying to understand when storage media is having issues), maybe its best to narrow this in via SMART data vs. what is exposed in sysfs

@dplore - maybe best to align w/ what precisely is being monitored in your compute environment for such case?

@LimeHat
Copy link

LimeHat commented May 31, 2024

I agree with the suggestion to use SMART.

@dplore
Copy link
Member Author

dplore commented Jun 5, 2024

For the driving use-case (trying to understand when storage media is having issues), maybe its best to narrow this in via SMART data vs. what is exposed in sysfs

@dplore - maybe best to align w/ what precisely is being monitored in your compute environment for such case?

Ironically the current description is aligned with at least one use case for what is monitored in one of our network environments. I do agree that SMART is a better data set to base this on and will refactor for that.

@dplore dplore changed the title Add storage counters io-errors Add storage counters related to errors Aug 27, 2024
@dplore
Copy link
Member Author

dplore commented Aug 27, 2024

Updated this PR to use a select few SMART counters. I appreciate any feedback on this approach.

Note, I used https://en.wikipedia.org/wiki/Self-Monitoring,_Analysis_and_Reporting_Technology#Known_ATA_S.M.A.R.T._attributes as a reference for picking counters which are related to storage failures. There are 10 attributes noted for predicting/measuring failures. I modeled 4 so far. If we like this approach, I could add all 10 or so subset based on feedback.

@dplore dplore requested a review from earies September 5, 2024 22:18
@earies
Copy link
Contributor

earies commented Sep 5, 2024

Overall I think your latest patch is a better approach but it should be noted that this is up to the drive manufacturer as far as implementation which is going to vary within platforms of the same vendor and across vendors.

For example, just grabbing one variant of SSD we ship today, 2 of the 4 attributes listed here are supported. This structure should be noted that these leaf nodes should be supported if the underlying hardware supports, otherwise optional/excluded.

@dplore dplore added the last-call PR that is in final review before merging. label Sep 12, 2024
@dplore
Copy link
Member Author

dplore commented Sep 12, 2024

This was reviewed in the OC Community meeting on Sep 12, 2024 without objection. Setting last-call to Sep 19 to allow a little more time for public review.

@s19nal do you have any comments? Can you approve?

Copy link

@s19nal s19nal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - all comments addressed

@dplore dplore merged commit 960cfd9 into master Sep 24, 2024
14 checks passed
@dplore dplore deleted the dplore/mount-errors branch September 24, 2024 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
last-call PR that is in final review before merging. non-breaking
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

7 participants