Biz & IT —

Bitrot and atomic COWs: Inside “next-gen” filesystems

We look at the amazing features in ZFS and btrfs—and why you need them.

Bitrot and atomic COWs: Inside “next-gen” filesystems

Most people don't care much about their filesystems. But at the end of the day, the filesystem is probably the single most important part of an operating system. A kernel bug might mean the loss of whatever you're working on right now, but a filesystem bug could wipe out everything you've ever done... and it could do so in ways most people never imagine.

Sound too theoretical to make you care about filesystems? Let's talk about "bitrot," the silent corruption of data on disk or tape. One at a time, year by year, a random bit here or there gets flipped. If you have a malfunctioning drive or controller—or a loose/faulty cable—a lot of bits might get flipped. Bitrot is a real thing, and it affects you more than you probably realize. The JPEG that ended in blocky weirdness halfway down? Bitrot. The MP3 that startled you with a violent CHIRP!, and you wondered if it had always done that? No, it probably hadn't—blame bitrot. The video with a bright green block in one corner followed by several seconds of weird rainbowy blocky stuff before it cleared up again? Bitrot.

The worst thing is that backups won't save you from bitrot. The next backup will cheerfully back up the corrupted data, replacing your last good backup with the bad one. Before long, you'll have rotated through all of your backups (if you even have multiple backups), and the uncorrupted original is now gone for good.

Contrary to popular belief, conventional RAID won't help with bitrot, either. "But my raid5 array has parity and can reconstruct the missing data!" you might say. That only works if a drive completely and cleanly fails. If the drive instead starts spewing corrupted data, the array may or may not notice the corruption (most arrays don't check parity by default on every read). Even if it does notice... all the array knows is that something in the stripe is bad; it has no way of knowing which drive returned bad data—and therefore which one to rebuild from parity (or whether the parity block itself was corrupt).

What might save your data, however, is a "next-gen" filesystem.

Let's look at a graphic demonstration. Here's a picture of my son Finn that I like to call "Genesis of a Supervillain." I like this picture a lot, and I'd hate to lose it, which is why I store it on a next-gen filesystem with redundancy. But what if I didn't do that?

As a test, I set up a virtual machine with six drives. One has the operating system on it, two are configured as a simple btrfs-raid1 mirror, and the remaining three are set up as a conventional raid5. I saved Finn's picture on both the btrfs-raid1 mirror and the conventional raid5 array, and then I took the whole system offline and flipped a single bit—yes, just a single bit from 0 to 1—in the JPG file saved on each array. Here's the result:

The raid5 array didn't notice or didn't care about the flipped bit in Finn's picture any more than a standard single disk would. The next-gen btrfs-raid1 system, however, immediately caught and corrected the problem. The results are pretty obvious. If you care about your data, you want a next-gen filesystem. Here, we'll examine two: the older ZFS and the more recent btrfs.

What is a “next-generation” filesystem, anyway?

"Next-generation" is a phrase that gets handed out like sales flyers in a mall parking lot. But in this case, it actually means something. I define a "generation" of filesystems as a group that uses a particular "killer feature"—or closely related set of them—that earlier filesystems don't but that later filesystems all do. Let's take a quick trip down memory lane and examine past and current generations:

Generation 0: No system at all. There was just an arbitrary stream of data. Think punchcards, data on audiocassette, Atari 2600 ROM carts.

Generation 1: Early random access. Here, there are multiple named files on one device with no folders or other metadata. Think Apple ][ DOS (but not ProDOS!) as one example.

Generation 2: Early organization (aka folders). When devices became capable of holding hundreds of files, better organization became necessary. We're referring to TRS-DOS, Apple //c ProDOS, MS-DOS FAT/FAT32, etc.

Generation 3: Metadata—ownership, permissions, etc. As the user count on machines grew higher, the ability to restrict and control access became necessary. This includes AT&T UNIX, Netware, early NTFS, etc.

Generation 4: Journaling! This is the killer feature defining all current, modern filesystems—ext4, modern NTFS, UFS2, you name it. Journaling keeps the filesystem from becoming inconsistent in the event of a crash, making it much less likely that you'll lose data, or even an entire disk, when the power goes off or the kernel crashes.

So if you accept my definition, "next-generation" currently means "fifth generation." It's defined by an entire set of features: built-in volume management, per-block checksumming, self-healing redundant arrays, atomic COW snapshots, asynchronous replication, and far-future scalability.

That's quite a laundry list, and one or two individual features from it have shown up in some "current-gen" systems (Windows has Volume Shadow Copy to correspond with snapshots, for example). But there's a strong case to be made for the entire list defining the next generation.

Justify your generation

The quickest objection you could make to defining "generations" like this would be to point at NTFS' Volume Snapshot Service (VSS) or at the Linux Logical Volume Manager (LVM), each of which can take snapshots of filesystems mounted beneath them. However, these snapshots can't be replicated incrementally, meaning that backing up 1TB of data requires groveling over 1TB of data every time you do it. (FreeBSD's UFS2 also offered limited snapshot capability.) Worse yet, you generally can't replicate them as snapshotswith references intact—which means that your remote storage requirements increase exponentially, and the difficulty of managing backups does as well. With ZFS or btrfs replicated snapshots, you can have a single, immediately browsable, fully functional filesystem with 1,000+ versions of the filesystem available simultaneously. Using VSS with Windows Backup, you must use VHD files as a target. Among other limitations, VHD files are only supported up to 2TiB in size, making them useless for even a single backup of a large disk or array. They must also be mounted with special tools not available on all versions of Windows, which goes even further to limit them as tools for specialists only.

Finally, Microsoft's VSS typically depends on "writer" components that interface with applications (such as MS SQL Server) which can themselves hang up, making it difficult to successfully create a VSS snapshot in some cases. To be fair, when working properly, VSS writers offer something that simple snapshots don't—application-level consistency. But VSS writer bugs are a real problem, and I've encountered lots of Windows Servers which were quietly failing to create Shadow Copies. (VSS does not automatically create a writer-less Shadow Copy if the system times out; it just logs the failure and gives up.) I have yet to encounter a ZFS or btrfs filesystem or array that won't immediately create a valid snapshot.

At the end of the day, both LVM and VSS offer useful features that a lot of sysadmins do use, but they don't jump right out and demand your attention the way filenames, folders, metadata, or journaling did when they came onto the market. Still, this is only one feature out of the entire laundry list. You could make a case that snapshots made the fifth generation, and the other features in ZFS and btrfs make the sixth. But by the time you finish this article, you'll see there's no way to argue that btrfs and ZFS definitely constitute a new generation that is easily distinguishable from everything before them.

Channel Ars Technica