Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: SMART support #19103

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

tomasmatus
Copy link
Member

@tomasmatus tomasmatus commented Jul 17, 2023

@tomasmatus
Copy link
Member Author

image

For now I added a new card to the disk detail which shows basic information and allows to run/abort self test.

Note that for now only (S)ATA devices are supported by Udisks, NVMe support will be added in the latest release https://github.com/storaged-project/udisks/releases/tag/udisks-2.10.0

@garrett I believe you wanted to work on mockups for this, any chance you could look into proper design?

Initial SMART self test support
Comment on lines +82 to +84
<StackItem>
<SmartDetails smartInfo={drive_ata} />
</StackItem>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 3 added lines are not executed by any test. Details

Comment on lines +42 to +44
const SmartActions = ({ smartInfo }) => {
const [isKebabOpen, setKebabOpen] = useState(false);
const smartSelftestStatus = smartInfo.SmartSelftestStatus;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 3 added lines are not executed by any test. Details

Comment on lines +46 to +47
const runSmartTest = async (type) => {
await smartInfo.SmartSelftestStart(type, {});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 2 added lines are not executed by any test. Details

Comment on lines +50 to +51
const abortSmartTest = async () => {
await smartInfo.SmartSelftestAbort({});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 2 added lines are not executed by any test. Details

Comment on lines +54 to +69
const actions = [
<DropdownItem key="run-short-test"
isDisabled={smartSelftestStatus === "inprogress"}
onClick={() => { setKebabOpen(false); runSmartTest('short') }}>
{_("Run short test")}
</DropdownItem>,
<DropdownItem key="run-extended-test"
isDisabled={smartSelftestStatus === "inprogress"}
onClick={() => { setKebabOpen(false); runSmartTest('extended') }}>
{_("Run extended test")}
</DropdownItem>,
<DropdownItem key="run-conveyance-test"
isDisabled={smartSelftestStatus === "inprogress"}
onClick={() => { setKebabOpen(false); runSmartTest('conveyance') }}>
{_("Run conveyance test")}
</DropdownItem>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 16 added lines are not executed by any test. Details

Comment on lines +72 to +77
if (smartInfo.SmartSelftestStatus === "inprogress") {
actions.push(
<DropdownItem key="abort-smart-test"
onClick={() => { setKebabOpen(false); abortSmartTest('conveyance') }}>
{_("Abort test")}
</DropdownItem>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 6 added lines are not executed by any test. Details

Comment on lines +81 to +87
return (
<Dropdown toggle={<KebabToggle onToggle={(_, isOpen) => setKebabOpen(isOpen)} />}
isPlain
isOpen={isKebabOpen}
position="right"
id="smart-actions"
dropdownItems={actions} />
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 7 added lines are not executed by any test. Details

Comment on lines +91 to +94
export const SmartDetails = ({ smartInfo }) => {
const SmartDetailRow = ({ title, value }) => {
if (value === undefined)
return null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 4 added lines are not executed by any test. Details

Comment on lines +96 to +100
return (
<DescriptionListGroup>
<DescriptionListTerm>{title}</DescriptionListTerm>
<DescriptionListDescription>{value}</DescriptionListDescription>
</DescriptionListGroup>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 5 added lines are not executed by any test. Details

Comment on lines +104 to +118
return (
<Card>
<CardHeader actions={{ actions: <SmartActions smartInfo={smartInfo} /> }}>
<CardTitle component="h2">{_("S.M.A.R.T")}</CardTitle>
</CardHeader>
<CardBody>
<DescriptionList isHorizontal horizontalTermWidthModifier={{ default: '20ch' }}>
<SmartDetailRow title={_("Power on hours")} value={cockpit.format(_("$0 hours"), Math.round(smartInfo.SmartPowerOnSeconds / 3600))} />
<SmartDetailRow title={_("Last updated")} value={timeformat.dateTime(new Date(smartInfo.SmartUpdated * 1000))} />
<SmartDetailRow title={_("Smart selftest status")} value={selftestStatusDescription[smartInfo.SmartSelftestStatus]} />
<SmartDetailRow title={_("Number of bad sectors")} value={smartInfo.SmartNumBadSectors} />
<SmartDetailRow title={_("Atributes failing")} value={smartInfo.SmartNumAttributesFailing} />
</DescriptionList>
</CardBody>
</Card>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 15 added lines are not executed by any test. Details

Copy link
Member

@jelly jelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creating tests for this would be great, maybe udisks already has a way to mock it?

/*
* This file is part of Cockpit.
*
* Copyright (C) 2017 Red Hat, Inc.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a new file, should be 2023.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, we live in ${current_year}

return (
<Card>
<CardHeader actions={{ actions: <SmartActions smartInfo={smartInfo} /> }}>
<CardTitle component="h2">{_("S.M.A.R.T")}</CardTitle>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's really something one can translate?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm force of habit

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in non-latin languages? Like chinese or russian or so?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, if in doubt, keep it translatable. I can't say I like the many dots, this looks a bit like a 90's superhero comic 😁 So please either "SMART", or at least make it consistent and add a period to the T as well.

@martinpitt martinpitt marked this pull request as draft July 24, 2023 08:55
@martinpitt
Copy link
Member

Creating tests for this would be great, maybe udisks already has a way to mock it?

Yes, it does. SmartUpdate() has an atasmart_blob attribute where you can load a previously skdump --save'd blob. Of course you can also edit this in between, to have a few scenarios (passing and failing).

@tomasmatus
Copy link
Member Author

tomasmatus commented Jul 24, 2023

I am not sure how to test this. In order to run SmartUpdate udisks needs to expose org.freedesktop.UDisks2.Drive.Ata which is only available when the drive is (S)ATA and is likely to support SMART.

@jelly
Copy link
Member

jelly commented Jul 24, 2023

I am not sure how to test this. In order to run SmartUpdate udisks needs to expose org.freedesktop.UDisks2.Drive.Ata which is only available when the drive is (S)ATA and is likely to support SMART.

Neither does udisks test it (only if they find SMART disks), maybe because of the same reason?

Random googling https://stackoverflow.com/questions/48351096/how-to-emulate-a-sata-disk-drive-in-qemu
Hmm :)

-drive id=disk,file=IMAGE.img,if=none \
-device ahci,id=ahci \
-device ide-hd,drive=disk,bus=ahci.0

So my suggestion would be to yolo hack Virtmachine.add_disk in (bots), make sure your cockpit checkout uses the changed bots and try to get TEST_DISK_XML to use AHCI. If that even supports SMART faking, if not we can't do this I suppose :(.

@jelly
Copy link
Member

jelly commented Nov 17, 2023

FYI udiskctl has a smart simulate options:

[root@archlinux ~]# udisksctl --help
Unknown command `--help'
Usage:
  udisksctl COMMAND

Commands:
  help            Shows this information
  info            Shows information about an object
  dump            Shows information about all objects
  status          Shows high-level status
  monitor         Monitor changes to objects
  mount           Mount a filesystem
  unmount         Unmount a filesystem
  unlock          Unlock an encrypted device
  lock            Lock an encrypted device
  loop-setup      Set-up a loop device
  loop-delete     Delete a loop device
  power-off       Safely power off a drive
  smart-simulate  Set SMART data for a drive

@martinpitt
Copy link
Member

@tomasmatus Are you still interested in working on this? It's quite a nice feature.

@tomasmatus
Copy link
Member Author

Wow its been a year already... yes, this is something i should revive @martinpitt

@jelly
Copy link
Member

jelly commented Jul 5, 2024

If i recall correctly the issue was tests, so trying to experiment with udisksctl smart-simulate sounds like a good idea to move this forward.

@garrett
Copy link
Member

garrett commented Jul 8, 2024

So, I think having SMART info is good. I remember I suggested it a long time ago in some issue and/or mockup. I don't remember where or what though. Where would this go?

It's usually stylized as SMART, not S.M.A.R.T. (which is just awkward). At least in English, acronyms and initialisms usually do not include periods. https://en.wikipedia.org/wiki/Acronym — and acronyms that are proper names (including SMART) or commonly used are generally not supposed to even have periods... and they're usually (but not always) all in uppercase. It's kind of weird with all kinds of special-casing, however; Wikipedia's acronym page actually has a "nomenclature" section with some examples: https://en.wikipedia.org/wiki/Acronym#Nomenclature — and it even covers pronunciation a bit (when some are pronounced and some are spelled out... and even some that are a mixture, like "JPEG" and "MS-DOS").

As far as being detailed; I think we should show an overview and use a "progressive disclosure" concept. (That is, show the important, common stuff first and then click somewhere to see everything.)

Since it looks like you're picking this up, it would be great to talk about the design of it. 👍

I know we had the concept of SMART errors propagating all the way up to the health card on the overview too (which would then link to the details).

@SchoolGuy
Copy link

I am interested in this feature. Is there anything that can be done to progress the PR?

@garrett
Copy link
Member

garrett commented Jul 18, 2024

@SchoolGuy: A recent update related to this PR: @tomasmatus is back and is planning on picking this up (as you can see in an above comment). I will be working with him on the design. If you'd like to help out, that'd be fine as well.

In what ways are you interested in SMART? What data are you looking for, specifically? And when do you consult SMART data, usually? (From time to time or only when there are strange things going on as a diagnosis tool?)

@SchoolGuy
Copy link

@garrett I am no designer by trade as such I am not sure on what part of the design I may be of assistance. I should be able to dig into the code and help out with implementing an idea that somebody else had. If that is not available I may be able to design things on my own but they may be hard to use...


I am interested in SMART in two ways:

  1. The binary quick one: Is the drive currently "failed" due to one of the measurements failing a threshold or is everything fine.
  2. The longer one: Use the WebUI as a replacement for the CLI, aka look at the individual output of the metrics and decide for myself if I want to replace a drive due to me interpreting the values myself.

The idea for this specific project is to have Cockpit running as a WebUI for a single host with an mdadm array running and that resulting device is exposed as an NFS share that other hosts will use. In short: I want to use the host as a NAS.

I will grab the SMART values additionally with the smartd exporter and will let Prometheus grab that but since that will be hosted by one of the other hosts I want to be able to know if a failure there is related to the mdadm array being down or by a configuration error.

I am a quite paranoid person (regarding my infrastructure) and as such I am checking my monitoring quite regularly but it is not a fixed schedule I have.

@hobbes1069
Copy link

I'm not a developer but I'm a Fedora packager and would be very interested in seeing this in Fedora/RHEL.

@garrett
Copy link
Member

garrett commented Jul 22, 2024

I can try to find some time to get back around to this to make some mockups. But the gist is that we need it in two places:

  1. Detailed view with just a few of the most important items shown by default, with a way to expand or view more to get to all of the stats.
  2. Current status in the health card on the overview page.

FWIW, here are the design ideas for disk issues (from a long time ago): #8787 (comment)

SMART is mentioned, with regard to showing it in the health card:

  • warning: SMART errors that don't stop functioning of disk
  • error: SMART found bad sectors (disk is dying; data loss may have happened

We probably want to have some kind of status API where the overview page can query the storage page to get the SMART info and show it on the health card if there is a problem (and only if there is a problem), similar to the updates being on the health card when there are updates. (The software updates page is queried for this.)

@SchoolGuy
Copy link

Perfect. This is already a great amount of detail.

@tomasmatus how to you want to organize so we don't run into conflicts working on the same files?

@jelly
Copy link
Member

jelly commented Jul 22, 2024

@garrett I am no designer by trade as such I am not sure on what part of the design I may be of assistance. I should be able to dig into the code and help out with implementing an idea that somebody else had. If that is not available I may be able to design things on my own but they may be hard to use...

I am interested in SMART in two ways:

1. The binary quick one: Is the drive currently "failed" due to one of the measurements failing a threshold or is everything fine.

That fails into the scope of Cockpit!

2. The longer one: Use the WebUI as a replacement for the CLI, aka look at the individual output of the metrics and decide for myself if I want to replace a drive due to me interpreting the values myself.

What metrics are you interested in? UDisks provides some basic health metrics as seen in the screenshot on top. For other attributes we have to use SmartGetAttributes, so if you can give us a list of interesting attributes and maybe a busctl command with some data, that would be interesting and helpful.

The idea for this specific project is to have Cockpit running as a WebUI for a single host with an mdadm array running and that resulting device is exposed as an NFS share that other hosts will use. In short: I want to use the host as a NAS.

I will grab the SMART values additionally with the smartd exporter and will let Prometheus grab that but since that will be hosted by one of the other hosts I want to be able to know if a failure there is related to the mdadm array being down or by a configuration error.

Cockpit will only support showing SMART values for a single host, the host you connect too. We don't want to support anything outside of the UDisks API

@garrett
Copy link
Member

garrett commented Jul 23, 2024

For reference, udisks reports only this information from SMART:

$ udisksctl dump | grep -i smart
    SmartCriticalWarning:               
    SmartPowerOnHours:                  71
    SmartSelftestPercentRemaining:      -1
    SmartSelftestStatus:                success
    SmartTemperature:                   319
    SmartUpdated:                       1721742479

It's not much, but it's probably enough in most cases? If this is all, then we can't do the expand idea, unless we get more information, like from smartctl.

@hobbes1069
Copy link

Is it worth looking to see how gnome-disk-utility looks up the extra data?
https://gitlab.gnome.org/GNOME/gnome-disk-utility

@jelly
Copy link
Member

jelly commented Jul 24, 2024

Is it worth looking to see how gnome-disk-utility looks up the extra data? https://gitlab.gnome.org/GNOME/gnome-disk-utility

See my comment:

What metrics are you interested in? UDisks provides some basic health metrics as seen in the screenshot on top. For other attributes we have to use SmartGetAttributes, so if you can give us a list of interesting attributes and maybe a busctl command with some data, that would be interesting and helpful.

@jelly
Copy link
Member

jelly commented Jul 26, 2024

I've looked into testing this in a test virtual machine, the main issue is that even if I add a sata disk, udisks does not seem to identify this as a ATA drive:

On my nas

/org/freedesktop/UDisks2/drives/$ID:
  org.freedesktop.UDisks2.Drive:
    CanPowerOff:                true
  <snip>
  org.freedesktop.UDisks2.Drive.Ata:
    AamEnabled:                                 false

Attaching a SATA disk does allow me to use smartctl on it.
image

@hobbes1069
Copy link

I like the data that is presented here, including the ability to start a SMART test:
image

@garrett
Copy link
Member

garrett commented Jul 29, 2024

the data that is presented here

Right, that's the extended data that we're definitely not going to show by default. It's way too verbose and misleading.

We could, in a third stage, add all the details similar to that, possibly in a modal dialog similar to that. But, we should handle SMART in these steps:

  1. Basic details
    • filtered list of what udisks gives us by default (mentioned above)... filtered to show only relevant information (for example: don't show progress of scan unless a scan is happening; don't show errors unless there are errors, etc.)
    • the basic list udisks gives us by default seems to be reasonable
    • this would likely be a key + value pair in a description list with a heading
  2. Overview integration
    • only show with warnings and errors, with a link back to the basic details
  3. A way to get detailed information
    • a "progressive disclosure" concept, where we show the important first (already implemented in step 1) and a way to see more information
    • examples: this could be something like an expander widget at the bottom of the list or a modal (we'll need to figure this out still, but it can be deferred until after step 1 and 2 land)

This first PR should implement step # 1. Then we need to do a follow-up for step # 2 (overview). Then eventually add a step # 3 for more details.

We should not do this all in one go, nor should we just show all the data points SMART can give us. (Many are very misleading, scary, or not relevant depending on the circumstance. And it varies per manufacturer and storage medium (spinning disks of different types, SSD, NVMe, etc.). It's a mess, really.)

(I think @tomasmatus and I will talk about this more either later this week or next week.)

@jelly
Copy link
Member

jelly commented Jul 29, 2024

For testing SMART in CI, there is a udisks issue now https://issues.redhat.com/browse/STORAGECFG-801

@tbzatek
Copy link

tbzatek commented Oct 3, 2024

For testing SMART in CI, there is a udisks issue now https://issues.redhat.com/browse/STORAGECFG-801

Please check this release out: https://github.com/storaged-project/udisks/releases/tag/udisks-2.10.90

Random googling https://stackoverflow.com/questions/48351096/how-to-emulate-a-sata-disk-drive-in-qemu Hmm :)

-drive id=disk,file=IMAGE.img,if=none \
-device ahci,id=ahci \
-device ide-hd,drive=disk,bus=ahci.0

This works now in UDisks, however it's still an ide-hd device emulation and even though it's connected through an emulated AHCI controller, it still doesn't talk the SAT (SCSI / ATA Translation) protocol. This was the culprit and to some degree it still is as UDisks only supports SAT and nothing else (looking for an expert to make this happen!). Similarly, udev's ata_id also supports SAT only but has a bug in the code that makes this work.

However I've found this hidden workaround in libatasmart to make this work with qemu devices: https://github.com/storaged-project/libblockdev/blob/master/src/plugins/smart/smart-common.c#L131-L167

Let me know if it helps your testing use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants