$Id: disk-health,v 1.6 2021/11/23 02:53:38 nanons Exp $

Monitoring disk health
======================

OpenBSD can take advantage of SMART capabilities available on many disk
drives to automatically check for early warning signs of failure.

"sd0" will be used as the example disk here, though the same steps can
be repeated to monitor multiple disks.  To list attached disks:

	$ dmesg | egrep "^[sw]d[0-9]+ "

Identifiers like "sd0" may change if disks are detached.  Replace the
identifier with its DUID throughout the rest of this guide to avoid
conflicts.  To show the DUID of "sd0":

	# disklabel sd0 | grep duid

First, SMART must be enabled:

	# atactl sd0 smartenable

If this returns an error, the disk doesn't support health monitoring.

Automatic health checks
=======================

To check the disk for early signs of failure every hour, append to the
root user's crontab(5) (/var/cron/tabs/root):

	0 * * * * /sbin/atactl sd0 smartstatus > /dev/null

If the disk exceeds one of the SMART thresholds, cron(8) will mail
"SMART threshold exceeded!" to the root user.  This warning is a strong
indication of imminent disk failure.  The disk should be checked as
often as every hour because in some cases warnings could be emitted
less than 24 hours before failure.

To receive convenient notifications of when this has happened, see the
comment header in the etc/daily.local file from this repository.

To list SMART attributes and thresholds:

	# atactl sd0 readattr

Self-tests
==========

Disks can test their own electrical, mechanical and read performance
on demand.  Self-testing is required to update the disk's performance
related SMART thresholds and get more early warning signs of failure.
This can be automated with cron(8) like above or by making the disk
automatically self-test; see the section below for the latter.

a) To do a short self-test (usually under 10 minutes):

	# atactl sd0 smartoffline shortoffline

b) To do an extended self-test (tens of minutes to several hours):

	# atactl sd0 smartoffline extenoffline

To show the test results after the self-tests are finished:

	# atactl sd0 smartreadlog selftest

To abort unfinished self-tests:

	# atactl sd0 smartoffline abort

Fine-grained monitoring
=======================

The smartctl utility can be used for more SMART control.  To install it:

	# pkg_add smartmontools

To instruct the disk to automatically self-test every 4 hours (may not
be supported by all disks):

	# smartctl -o on /dev/sd0c

To show SMART information, including self-test progress:

	# smartctl -c /dev/sd0c

To list SMART attributes and thresholds (this may have more attributes
than atactl(8)'s "readattr"):

	# smartctl -A /dev/sd0c

See the smartctl(8) and atactl(8) man pages for more commands.
