HLRL: Disk Storage and Failure Rates

From the HoffmanLabs Reading List…

Consumer-grade and enterprise-grade disk drives can have similar failure rates? Really?

Cooler temperatures don't significantly increase drive lifetimes? Up through 35°C or so, temperature doesn't matter to average drive lifetime? Ok, but what about the computers themselves?

With a few very specific exceptions, Disk SMART monitoring isn't predictive? Do watch for scan errors in particular, and for reallocation count, offline reallocation and probational count errors. See these? Swap the drive. Otherwise, SMART doesn't correlate with and doesn't necessarily reliably predict failures.

Disks of just three years usage are heading toward increasing failure? Failure rates spike to 6% annually, and high-use disks stay in the same 6% failure range. Or even increase. The back side of the expected bathtub curve is as early as two years?

Average disk MTBF rates are typically 3.4 times higher than published vendor MTBF documentation, and sometimes as much as 15 times higher. For drives five to eight years old, failure rates are as much as thirty times the MTBF rates.

If you should loose a drive in a RAID set, expect you might well have multiple concurrent disk spindle failures. Yes, during the replacement and recovery. Within a RAID5 (RAID 5) set, the average failure rate on the remaining drive spindles spikes to 4x its average during the RAID recovery from a failed spindle, and RAID5 cannot recover from failures of multiple spindles; the recovery can encounter and can trigger multiple spindle failures.

And here are the reading materials:

Updates

Most recent changes first.

Comments

Solid State Disk Error Study

From CMU and Facebook, A Large-Scale Study of Flash Memory Failures in the Field, looking at the failures and errors involved with SSD flash storage used in servers.

Available for download from here or here.

Backblaze Disk Failure Data

Empirical Backblaze disk failure rate data for Seagate, WD and Hitachi disk drives.

Backblaze September 2014 Update

Consumer-grade disk drives are doing equal or better than Enterprise-grade drives, and the Seagate and WD Western Digital are faring much worse than the Hitachi disk drives, per the Backblaze September 2014 data.

Microsoft Hardware Failure Data

From Microsoft Research, Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs.

An aside: one of the coauthors of that paper is Vince Orgovan, and among those that moved from DEC to Microsoft.

The Failure Curve

Some offline discussions pointed to some confusion around the article here; the materials referenced here point to failure rates on even newer and current-generation disk storage gear.

These are not to the classic failure rates that most of us had assumed with the older disks; the rates and the failures and the failure rates are different, and the vendor-published MTBF values are (per the studies) optimistic. The failure rate curve by age certainly isn't at all shaped as I had expected and assumed.

While replacing older drives is goodness, so can be RAID and archival storage and related mechanisms to protect data.

The above is for classic rotating-rust hard disk drives. If you've seen device failure rates or bit-error rates listed on solid-state disk (SSD) storage widgets, do please pass that along.

Inconsistently SMART, Too

And the specific organization and operation of SMART codes — the data points and sensors being monitored — vary from vendor to vendor.

SMART is not a silver bullet

As I discovered with the internal disk in my iBook, a SMART disk can fail without reporting any SMART errors. Even after total failure, the SMART reports produced nothing.

Fortunately, traditional symptoms in the form of nasty clicking sounds accompanied by system freezes prompted retirement of the disk in question, so no data was lost.

In other words, I don't entirely trust the technology.