Disk Health Monitoring with smartctl and smartd

Introduction

smartctl is a console utility from the smartmontools package, designed to work with S.M.A.R.T. (Self-Monitoring, Analysis and Reporting Technology) technology implemented in modern storage drives.

Use cases:

servers of hosting companies;
VPS/VDS nodes;
dedicated servers;
file servers;
backup systems;
corporate infrastructures.

What tasks it solves:

early detection of disk degradation;
prediction of drive failures;
reduction of data loss risk;
automation of HDD, SSD, NVMe health monitoring;
analysis of physical problems in the disk subsystem.

Info

Proactive SMART monitoring allows you to identify a problem before an actual disk failure, which is critically important for production environments.

Requirements and prerequisites

Supported OS and versions:

Linux:

Debian 10+
Ubuntu 18.04+
RHEL / AlmaLinux / Rocky Linux 8+
CentOS 7 (supported, but outdated)
FreeBSD 12+

Windows (via smartmontools, limited use)

Examples below are provided for Linux.

Required software and packages:

smartmontools package
access to /dev/sdX, /dev/nvmeX
installed systemd (for smartd)

Access rights:

root privileges required
or access via sudo

Preliminary checks

Viewing the list of disks in the system:

lsblk -d -o NAME,MODEL

Overview and basic concepts

Key terms

SMART – built-in disk self-diagnosis system
Attributes – state attributes (Reallocated_Sector_Ct, Pending_Sectors, etc.)
Self-test – built-in drive tests
smartctl – CLI utility for SMART management
smartd – automatic monitoring daemon

How it works

The disk independently collects statistics
smartctl reads this data
smartd analyzes threshold values and events
Notifications are sent when problems occur

Workflow logic

Disk > SMART attributes > smartctl > smartd > log / email / monitoring

Basic setup and use of smartctl

Installing the smartmontools utility

Working with SMART is impossible without the installed smartmontools package, which includes the smartctl utilities (manual work with disks) and smartd (background monitoring daemon).

Installation is performed using the standard package manager tools of the distribution.

Debian / Ubuntu

sudo apt update

sudo apt install smartmontools

RHEL / AlmaLinux / Rocky Linux

sudo dnf install smartmontools

After installation, the utilities become available in the system and are ready for use without additional initialization.

Checking the version and availability of smartctl

As a first step, it is recommended to verify that the utility is correctly installed and available in the system:

smartctl --version

The command outputs the package version and a list of supported technologies.

This allows you to:

ensure that an up-to-date version is being used;
check for support of NVMe, RAID, and other device types.

Checking SMART support on a specific disk

Next, it is necessary to check whether the drive itself supports SMART technology and whether it is enabled at the device level.

sudo smartctl -i /dev/sda

Example of correct output:

SMART support is: Available - device has SMART capability. SMART support is: Enabled

Available – the disk physically supports SMART;
Enabled – SMART data collection is turned on and available for reading.

If SMART is supported but disabled, this is often seen on new or previously unused disks. In such a case, it needs to be enabled manually:

sudo smartctl -s on /dev/sda

After this, it is recommended to re-run the smartctl -i command to ensure SMART is activated.

Viewing SMART attributes and initial health assessment

The main practical value of SMART lies in its attributes – numerical values reflecting the condition of the disk's surface, mechanics, and electronics.

To view the attributes, use the command:

sudo smartctl -A /dev/sda

The output contains a table of attributes with their current values and history. First and foremost, attention should be paid to the following indicators:

Key fields in the output:

VALUE (Current Value): The normalized value of the attribute (usually from 1 to 100, 100 being ideal). The disk is considered faulty if VALUE ≤ THRESH.
WORST (Worst Value): The worst value that has been reached during the disk's operation.
THRESH (Threshold): The minimum allowable value for VALUE. Exceeding the threshold (VALUE ≤ THRESH) is a sign of a critical condition.
RAW_VALUE: The "raw", non-normalized value of the attribute. This is what needs to be analyzed to assess wear and count events.

Key attributes for HDD (traditional hard drives):

Reallocated_Sector_Ct: An increase indicates physical surface degradation.
Current_Pending_Sector (Sectors pending reallocation): Unstable sectors. Even a single non-zero value is a warning sign.
Offline_Uncorrectable (Uncorrectable errors): Sectors that could not be read.
Power_On_Hours: The total operating time of the disk.

Key attributes for SSDs:

Retired_Block_Count: The equivalent of Reallocated_Sector_Ct for HDDs. Shows the number of blocks taken out of service. Even a low value with VALUE=100 can be normal.
Reallocated_Event_Count: The number of reallocation events.
SSD_Life_Left or Percentage Used/Media Wearout Indicator: The percentage of remaining life (or wear). A low value (e.g., <10%) is a sign of imminent failure.
Wear_Range_Delta: An indicator of the evenness of wear across memory cells.
Power_On_Hours_and_Msec: The total operating time.
Lifetime_Writes_GiB / Lifetime_Reads_GiB (Attributes 241, 242): The total volume of data written/read.

Key attributes for NVMe (via smartctl -a /dev/nvme0):

Percentage Used: The percentage of the write endurance consumed. The primary indicator of wear.
Media and Data Integrity Errors: Data integrity errors.
Critical Warning: Critical warning flags.
Temperature: Current temperature.

At this stage, the administrator gains a general understanding of the disk's condition and can identify obvious signs of problems.

Advanced configuration and practical scenarios

SMART supports built-in self-tests, which are performed by the drive itself without operating system involvement.

The short test is designed for a quick check of key components:

sudo smartctl -t short /dev/sda

The long test performs a full surface scan and takes significantly more time:

sudo smartctl -t long /dev/sda

After the test is complete, the results need to be checked:

sudo smartctl -l selftest /dev/sda

The output indicates:

the type of test;
the completion status;
the presence or absence of errors.

An unsuccessful test is a direct reason to prepare for disk replacement.

Working with RAID controllers

Hardware RAID controllers often hide SMART data from the system. In such cases, the device type must be explicitly specified.

Example for an LSI controller:

smartctl -a -d megaraid,0 /dev/sda

Where:

-a – key to output all available SMART information (attributes, logs, errors, overall health assessment).
-d – key to specify the device type.
megaraid – tells the SMART driver that the disk is behind an LSI/Broadcom controller (commonly used in servers).
0 – the physical disk number (PD, Physical Drive) in the RAID array. This is not sda, but a unique ID assigned by the controller. It can be found using the controller management utility (e.g., storcli or MegaCLI).
/dev/sda – in this context, this is not the real disk, but a pseudo-device representing the RAID controller itself in the system. Typically, this is /dev/sgX (SCSI Generic) or simply /dev/sda if the controller has created a virtual disk.

Typical error when the device type is not specified:

SMART support is: Unavailable

This does not mean SMART is unavailable – only that smartctl could not automatically determine the path to the physical disk. The solution is to correctly specify the -d parameter.

Diagnostics and troubleshooting

Signs of possible malfunctions:

increase in the Reallocated_Sector_Ct value;
non-zero Current_Pending_Sector;
program/erase errors (Program_Fail_Count, Erase_Fail_Count);
self-test errors;
increase in I/O latency;
error messages in system logs.

Log analysis:

journalctl -u smartd

dmesg | grep -i error

Explanation:
Pending Sectors > 0 – high risk of failure;
Reallocated Sectors are increasing – progressive degradation;
Self-test FAILED – the disk must be replaced.

Identifying sources of problems

To rule out false positives, it is important to correlate SMART data with actual load.

iostat -x 1

iotop

Checking where the disk is mounted:

lsblk -o NAME,SERIAL,MOUNTPOINT

Identifying controllers:

lspci | grep -i raid

Additional metrics:

temperature above 50 °C;
increase in CRC errors;
unstable SMART values.

Configuring administrator notifications when SMART metrics approach threshold values

The mere presence of SMART data does not yet guarantee infrastructure safety. A key element of monitoring is timely notification of the administrator at the moment when the disk's condition begins to deteriorate, but failure has not yet occurred.

The notification mechanism allows you to:

detect drive degradation at an early stage;
plan disk replacement in advance;
avoid emergency downtime and data loss;
operate within scheduled maintenance windows.

In smartmontools, the smartd daemon is responsible for sending notifications. It automatically tracks changes in SMART attributes and responds to deviations from the norm.

Operating principle of smartd notifications

The smartd daemon functions as a background service and performs the following tasks:

Periodically polls disk SMART attributes.
Compares current values with: factory thresholds, previous values (change dynamics).
Detects: growth of critical attributes, appearance of new errors, self-test failures.
Generates a notification and sends it to the administrator.

Requirements for notifications to work

Before configuration, it is necessary to ensure the following:

an MTA (Postfix, Exim, Sendmail, ssmtp) is installed and correctly configured in the system;
the server is capable of sending outgoing mail;
the administrator's email address for receiving notifications is defined.

Example of configuring ssmtp – a lightweight and simple MTA for sending mail from the system.

Installation:

Debian/Ubuntu:

sudo apt update && sudo apt install ssmtp mailutils -y

RHEL:

sudo dnf install ssmtp mailx

Create the configuration file:

sudo nano /etc/ssmtp/ssmtp.conf

and edit the content:

# Default sender address
[email protected]

# SMTP server and port of your email provider
mailhub=smtp.your-domain.com:587
# Alternative example:
# mailhub=smtp.gmail.com:587        # For Gmail

# Authentication credentials
[email protected]
AuthPass=your-password

# Encryption settings
UseSTARTTLS=YES    # Use STARTTLS
UseTLS=YES         # Use TLS
FromLineOverride=YES  # Allow overriding the sender address

# Hostname (specify your server's name)
hostname=server1.your-domain.com
# you can use hostname=localhost or specify the system's actual hostname

Save the file and configure access permissions:

sudo chmod 640 /etc/ssmtp/ssmtp.conf

sudo chown root:mail /etc/ssmtp/ssmtp.conf

Configure senders (revaliases):

sudo nano /etc/ssmtp/revaliases

root:[email protected]:smtp.your-domain.com:587
www-data:[email protected]:smtp.your-domain.com:587

For successful message sending, the following ports must be open on the server: 587 (primary for sending with STARTTLS encryption), or 25 (standard SMTP), 465 (secure SMTP with SSL), if they are provided by the configuration.

Basic mail sending test:

echo "SMART test message" | mail -s "SMART notification test" [email protected]

You can explicitly specify the sender:

echo "SMART test message" | mail -s "SMART notification test" -a "From: [email protected]" [email protected]

Or via ssmtp directly:

echo "SMART test message" | ssmtp [email protected]

[email protected] – the recipient address where the message will be sent.

Info

If the email is not delivered, further configuration of smartd is pointless until the mail delivery issues are resolved.

SMART notification configuration is done in the file:

/etc/smartd.conf

Example of a simple working configuration:

/dev/sda -a -o on -S on -m [email protected]

Parameters:

/dev/sda – the disk being monitored;
-a – full set of checks;
-S on – attribute saving between reboots is enabled;
-o on – automatic offline data collection is activated;
-m – notifications are sent to the specified email.

From this point, smartd will start monitoring the disk state in the background.

Notifications when approaching threshold values

A key feature of smartd is that it monitors changes in attribute values, not just their critical exceedance.

In practice, this means a notification can be sent:

upon the first appearance of Current_Pending_Sector;
upon an increase in Reallocated_Sector_Ct, even if the threshold has not yet been reached;
upon detection of self-test errors;
upon degradation of NVMe parameters.

The most significant attributes of early failure:

Reallocated_Sector_Ct
Current_Pending_Sector
Offline_Uncorrectable
Media and Data Integrity Errors (NVMe)
Percentage Used (SSD/NVMe)

Even minimal changes in these parameters should be considered a reason for attention.

Using self-tests as a notification source

To increase informativeness, it is recommended to combine attribute monitoring with regular self-tests.

Example configuration with a schedule:

/dev/sda -a -o on -S on \
    -s (S/../.././02|L/../../6/03) \
    -m [email protected]

Logic of operation:

a short test is performed daily;
a full test is performed once a week;
upon any test failure, the administrator receives a notification.

Managing notification frequency and volume

To avoid excessive alerts, the -M once parameter is used:

/dev/sda -a -m [email protected] -M once

In this mode:

a notification is sent upon the first detection of a problem;
subsequent messages are not duplicated until the cause is resolved.

To test the notification system, you can use -M test. This allows you to verify that smartd is capable of sending messages without waiting for an actual error.

Conclusion

Within the scope of this manual, the full cycle of implementation and operation of smartctl and the smartd daemon as a tool for proactive disk health monitoring has been systematically reviewed. The basic principles of SMART operation, practical methods for attribute analysis, launching and interpreting self-tests, specifics of working with NVMe drives and RAID controllers, as well as diagnostic methods and techniques for identifying root causes of problems have been covered. Special attention has been paid to configuring notifications, which allow for the detection of drive degradation at early stages, even before a critical failure occurs.

Properly configured SMART monitoring is an integral part of a reliable server infrastructure and should be considered a mandatory operational standard. The use of smartctl and smartd allows the system administrator to move from reactive incident resolution to conscious, manageable maintenance of the disk subsystem, reducing the risks of downtime, data loss, and unplanned incidents, while also creating a solid foundation for further automation and integration with centralized monitoring systems.

Content

Introduction
Requirements and prerequisites
Preliminary checks
Overview and basic concepts
Key terms
How it works
Workflow logic
Basic setup and use of smartctl
Installing the smartmontools utility
Checking the version and availability of smartctl
Checking SMART support on a specific disk
Viewing SMART attributes and initial health assessment
Advanced configuration and practical scenarios
Working with RAID controllers
Diagnostics and troubleshooting
Identifying sources of problems
Configuring administrator notifications when SMART metrics approach threshold values
Operating principle of smartd notifications
Requirements for notifications to work
Notifications when approaching threshold values
Using self-tests as a notification source
Managing notification frequency and volume
Conclusion