Last year’s Google and CMU papers on disk failure rates (see Everything you know about disks is wrong and Google’s Disk Failure Experience) made the points that a) annual disk failure rates are significantly higher than manufacturers admit and b) that enterprise drives aren’t more reliable than consumer drives.

But in An Analysis of Latent Sector Errors in Disk Drives Lakshmi N. Bairavasundaram, Garth R. Goodson, Shankar Pasupathy and Jiri Schindler analyzed the error logs on over 50,000 arrays covering 1.53 million enterprise and consumer drives disks. It looks like the largest such study ever published.

Lakshmi was with the U of Wisconsin-Madison while the latter 3 work at NetApp. They published at the Sigmetrics ’07 conference last June.

A different kind of latency
Unreported or latent disk errors are real. That’s why vendors have stopped recommending RAID 5 on SATA drives.

Disks have a lot of errors, most of them transient. This study focused on Latent Sector Errors (LSE), defined as:

. . . when a particular disk sector cannot be read or written, or when there is an uncorrectable ECC error. Any data previously stored in the sector is lost.

They don’t say so explicitly, but these are surely NetApp arrays. They also comment on the effectiveness of media and disk scrubbing, a feature of high-end arrays.

Results

  • Yes, there are “bad” disks: 0.2% of the drives had more than 1000 errors.
  • 3.45% of the entire population had LSE over the 32 month study period.
  • 8.5% of the consumer disks had LSE
  • 1.9% of the enterprise disks had LSE
  • In their first 12 months 3.15% of consumer and 1.46% develop at least one LSE

Causation
The team found several factors that contribute to LSE.

  • Size matters. As disk size increases, so does the fraction of disks with LSE.
  • Age matters. LSE rates climbed with age. 20% of some – but not all – consumer disks had LSE after 24 months. Rates climbed faster for consumer drives than for enterprise drives.
  • Vendor matters. They also found that some vendors had much higher LSE than others. Due to the industry omerta they don’t rat out the offenders.
  • Errors matter. A drive that develops one error is much more likely to develop a second. The second error is likely to be close to the first error. Once a drive develops an error, both enterprise and consumer drives are equally likely to develop a 2nd error.

Annual sector error rates
This figure from the paper indicates the variability in age-related error rates


The caption states:

For each disk model that has been in the field for at least two years, the first bar represents Year 1 and the second represents Year 2. The NL and ES bars represent weighted averages for nearline and enterprise class drives respectively.

Consumer/SOHO users with large, cheap, old disks will see LSE. Another reason Desktop RAID is a bad idea. Not many consumers replace their drives every 24 months.

File system implications
File systems rely on disk-based data structures to keep track of your stuff. One of the key findings of the team is that disk errors tend to congregate near each other, like congressmen and lobbyists.

Therefore, file systems that replicate critical data across the disk are much less likely to lose your data than those, like ReiserFS, place critical structures in one contiguous area. Related issue: since disks virtualize the block structure, how do FS designers know where their data structures actually go on disk?

Media and data scrubbing
What’s the difference?

Media scrubs use a SCSI Verify command to validate a disk sector’s integrity. This command performs an ECC check of the sector’s content from within the disk without transferring data to the storage layer. On failure, the command returns a latent sector error.

While

A data scrub is primarily used to detect data corruption. This scrub issues read operations for each disk sector, computes a checksum over its data, compares the checksum to the on-disk 8-byte checksum, and reconstructs the sector from other disks in the RAID group if the checksum comparison fails. Latent sector errors discovered by data scrubs appear as read errors.

In the analyzed drives over 60% of LSE were found by scrubbing. Scrubbing is a high-end feature that works.

The StorageMojo take
The consistency of LSE as disk capacity increased suggests that there is a constant head/media issue. Since consumer drives are larger than enterprise drives, part of the higher LSE rate is explicable.

The higher LSE rate increase for aging consumer drives suggests that enterprise drives are higher quality. Or maybe their error correction is better.

Finally, drive vendors need to re-think their ECC strategies. As capacities increase so will LSE. Higher quality ECC comes at the cost of capacity. It is time to start paying that price.

Comments welcome, of course. Download the article pdf here.