The Tux3 filesystem returns

[Posted January 1, 2013 by corbet]

From:		Daniel Phillips <lkml-AT-phunq.net>
To:		linux-kernel-AT-vger.kernel.org
Subject:		Tux3 report: New news for the new year
Date:		Tue, 01 Jan 2013 03:52:43 -0800
Message-ID:		<2597526.usDRg4h3X1@mars>
Archive‑link:		Article

Hi everybody,

The Tux3 project has some interesting news to report for the new year. In 
brief, the first time Hirofumi ever put together all the kernel pieces in his 
magical lab over in Tokyo, our Tux3 rocket took off and made it straight to 
orbit. Or in less metaphorical terms, our first meaningful benchmarks turned in 
numbers that meet or even slightly beat the illustrious incumbent, Ext4:

    fsstress -f dread=0 -f dwrite=0 -f fsync=0 -f fdatasync=0 \
          -s 1000 -l 200 -n 200 -p 3

    ext4

        time       cpu      wait
        46.338, 1.244, 5.096
        49.101, 1.144, 5.896
        49.838, 1.152, 5.776

    tux3

        time       cpu      wait
        46.684, 0.592, 1.860
        44.011, 0.684, 1.764
        43.773, 0.556, 1.888

Fsstress runs a mix of filesystem operations typical of a Linux system under 
heavy load. In this test, Tux3 spends less time waiting than Ext4, uses less 
CPU (see below) and finishes faster on average. This was exciting for us, 
though we must temper our enthusiasm by noting that these are still early 
results and several important bits of Tux3 are as yet unfinished. While we do 
not expect the current code to excel at extreme scales just yet, it seems we 
are already doing well at the scale that resembles computers you are running 
at this very moment.

About Tux3

Here is a short Tux3 primer. Tux3 is a general purpose LInux filesystem 
developed by a group of us mainly for the fun of it. Tux3 started in summer of 
2008, as a container for a new storage versioning algorithm originally meant 
to serve as a new engine for the ddsnap volume snapshot virtual device:

    http://lwn.net/Articles/288896/
    "Versioned pointers: a new method of representing snapshots"

As design work proceeded on a suitably simple filesystem with modern features, 
the focus shifted from versioning to the filesystem itself, as the latter is a 
notoriously challenging and engaging project. Initial prototyping was done in 
user space by me and others, and later ran under Fuse, a spectacular driveby 
contribution from one Tero Roponen. Hirofumi joined the team with an amazing 
utility that makes graphs of the disk structure of Tux3 volumes, and soon took 
charge of the kernel port. I stand in awe of Hirofumi's design sense, detail 
work and general developer prowess.

Like a German car, Tux3 is both old school and modern. Closer in spirit to 
Ext4 than Btrfs, Tux3 sports an inode table, allocates blocks with bitmaps, 
puts directories in files, and stores attributes in inodes. Like Ext4 and 
Btrfs, Tux3 uses extents indexed by btrees. Source file names are familiar: 
balloc.c, namei.c etc. But Tux3 has some new files like filemap.c and log.c that 
help make it fast, compact, and very ACID.

Unlike Ext4, Tux3 keeps inodes in a btree, inodes are variable length, and all 
inode attributes are variable length and optional. Also unlike Ext4, Tux3 
writes nondestructively and uses a write-anywhere log instead of a journal.

Differences with Btrfs are larger. The code base is considerably smaller, 
though to be sure, some of that can be accounted for by incomplete features. 
The Tux3 filesystem tree is single-rooted, there is no forest of shared trees. 
There is no built-in volume manager. Names and inodes are stored separately. 
And so on. But our goal is the same: a modern, snapshotting, replicating 
general purpose filesystem, which I am happy to say, seems to have just gotten 
a lot closer.

Front/Back Separation

At the heart of Tux3's kernel implementation lies a technique we call 
"front/back separation", which partly accounts for the surprising kernel CPU 
advantage in the above benchmark results. Tux3 runs as two, loosely coupled 
pieces: the frontend, which handles Posix filesystem operations entirely in 
cache, and the backend, which does the brute work of preparing dirty cache for 
atomic transfer to media. The frontend shows up as kernel CPU accounted to the 
Fsstress task, while the backend is largely invisible, running on some 
otherwise idle CPU. We suspect that the total of frontend and backend CPU is 
less than Ext4 as well, but so far nobody has checked. What we do know, is 
that filesystem operations tend to complete faster when they only need to deal 
with cache and not little details such as backing store.

Front/back separtion is like taking delayed allocation to its logical 
conclusion: every kind of structural change is delayed, not just block 
allocation. I credit Matt Dillon of Dragonfly fame for this idea. He described 
the way he used it in Hammer as part of this dialog:

   http://kerneltrap.org/Linux/Comparing_HAMMER_And_Tux3
   "Comparing HAMMER And Tux3"

Hammer is a cluster filesystem, but front/back separation turns out to be 
equally effective on a single node. Of course, the tricky part is making the 
two pieces run asynchronously without stalling on each other. Which brings us 
to...

Block Forking

Block forking is an idea that has been part of Tux3 from the beginning, and 
roughly resembles the "stable pages" work now underway. Unlike stable pages, 
block forking does not reduce performance. Quite the contrary - block forking 
enables front/back separation, which boosted Tux3 Fsstress performance about 
40%. The basic idea of block forking is to never wait on pages under IO, but 
clone them instead. This protects in-flight pages  from damage by VFS syscalls 
without forcing page cache updates to stall on writeback.

Implementing this simple idea is harder than it sounds. We need to deal with 
multiple blocks being accessed asynchronously on the same page, and we need to 
worry a lot about cache object lifetimes and locking. Especially in truncate, 
things can get pretty crazy. Hirofumi's work here can only be described by one 
word:  brilliant.

Deltas and Strong Consistency

Tux3 groups frontend update transactions into "deltas". According to some 
heuristic, one delta ends and the next one begins, such that all dirty cache 
objects affected by the operations belonging to a given delta may be 
transferred to media in a single atomic operation. In particular, we take care 
that directory updates always lie in the same delta as associated updates such 
as creating or deleting inode representations in the inode table.

Tux3 always cleans dirty cache completely on each delta commit. This is not 
traditional behavior for Linux filesystems, which normally let the core VM 
memory flusher tell them which dirty pages of which inodes should be flushed to 
disk. We largely ignore the VM's opinion about that and flush everything, every 
delta. You might think this would hurt performance, but apparently it does 
not. It does allow us to implement stronger consistency guarantees than 
typical for Linux.

We provide two main guarantees:

   *  Atomicity: File data never appears on media in an intermediate state, 
      with the single exception of large file writes, which may be broken
      across multiple deltas, but with write ordering preserved.

   * Ordering: If one filesystem transaction ends before another transaction 
      begins, then the second transaction will never appear on durable media
      unless the first does too.

Our atomicity guarantee resembles Ext4's data=journal but performs more like 
data=ordered. This is interesting, considering that Tux3 always writes 
nondestructively. Finding a new, empty location for each block written and 
updating the associated metadata would seem to carry a fairly hefty cost, but 
apparently it does not.

Our ordering guarantee has not been seen on Linux before, as far as we know. 
We get it "for free" from Tux3's atomic update algorithm. This could possibly 
prove useful to developers of file-based databases, for example, mailers and 
MTAs. (Kmail devs, please take note!)

Logging and Rollup

Tux3 goes out of its way to avoid recursive copy on write, that is, the 
expensive behavior where a change to a data leaf must be propagated all the 
way up the filesystem tree to the root, to avoid altering data that belongs to 
a previously committed consistent filesystem image. (Btrfs extends this 
recursive copy on write idea to implement snapshots, but Tux3 does not.)

Instead of writing out changes to parents of altered blocks, Tux3 only changes 
the parents in cache, and writes a description of each change to a log on 
media. This prevents recursive copy-on-write. Tux3 will eventually write out 
such retained dirty metadata blocks in a process we call "rollup", which 
retires log blocks and writes out dirty metadata blocks in full. A delta 
containing a rollup also tidily avoids recursive copy on write: just like any 
other delta, changes to the parents of redirected blocks are made only in 
cache, and new log entries are generated.

Tux3 further employs logging to make the allocation bitmap overhead largely 
vanish. Tux3 retains dirty bitmaps in memory and writes a description of each 
allocate/free to the log. It is much cheaper to write out one log block than 
potentially many dirty bitmap blocks, each containing only a few changed bits.

Tux3's rollup not only avoids expensive recursive copy on write, it optimizes 
updating in a least three ways.

  * Multiple deltas may dirty the same metadata block multiple times but
     rollup only writes those blocks once.

  * Multiple metadata blocks may be written out in a single, linear pass
     across spinning media.

  * Backend structure changes are batched in a cache friendly way.

One curious side effect of Tux3's log+rollup strategy is that in normal 
operation, the image of a Tux3 filesystem is never entirely consistent if 
considered only as literal block images. Instead, the log must be replayed in 
order to reconstruct dirty cache, then the view of the filesystem tree from 
dirty cache is consistent.

This is more or less the inverse of the traditional view where a replay 
changes the media image. Tux3 replay is a true read-only operation that leaves 
media untouched and changes cache instead. In fact, this theme runs 
consistently through Tux3's entire design. As a filesystem, Tux3 cares about 
updating cache, moving data between cache and media, and little else.

Tux3 does not normally update the media view of its filesystem tree  even at 
unmount. Instead, it replays the log on each mount. One excellent reason for 
doing this is to exercise our replay code. (You surely would not want to 
discover replay flaws only on the rare occasions you crash.) Another reason is 
that we view sudden interruption as the normal way a filesystem should shut 
down. We uphold your right to hit the power switch on a computing device and 
expect to find nothing but consistent data when you turn it back on.

Fast Sync

Tux3 can sync a minimal file data change to disk by writing four blocks, or a 
minimal file create and write with seven blocks:

    http://phunq.net/pipermail/tux3/2012-December/000011.html
    "Full volume sync performance"

This is so fast that we are tempted to implement fsync as sync. However, we 
intend to resist that temptation in the long run, and implement an optimized 
fsync that "jumps the queue" of Tux3's delta update pipeline and completes 
without waiting for a potentially large amount of unrelated dirty cache to be 
flushed to media.

Still to do

There is a significant amount of work still needed to bring Tux3 to a 
production state. As of today, Tux3 does not have snapshots, in spite of that 
being the main motivation for starting on this in the first place. The new 
PHtree directory index is designed, not implemented. Freespace management 
needs acceleration before it will benchmark well at extreme scale. Block 
allocation needs to be much smarter before it will age well and resist read 
fragmentation. There are several major optimizations still left to implement. 
We need a good fsck that approaches the effectiveness of e2fsck. There is a 
long list of shiny features to add:  block migration, volume growing and 
shrinking, defragmentation, dedupilcation, replication, and so on.

We have made plausible plans for all of the above, but indeed the devil is in 
the doing. So we are considering the merits of invoking the "many hands make 
light work" principle. Tux3 is pretty well documented and the code base is, if 
not completely obvious, at least small and orthogonal. Tux3 runs in userspace 
in two different ways: the tux3 command and fuse. Prototyping in user space is 
a rare luxury that could almost make one lazy. Tux3 is an entirely grassroots 
effort driven by volunteers. Nonetheless, we would welcome offers of 
assistance from wherever they may come, especially testers.

Regards,

Daniel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

(Log in to post comments)

The Tux3 filesystem returns

Posted Jan 1, 2013 19:04 UTC (Tue) by juliank (guest, #45896) [Link]

If everything is delayed, will this make it possible to add support to the file system that allows it to commit data only every 5 minutes or something like this -- that would allow the disk to be powered down in between, and thus reduce power usage on laptops.

I don't know whether any file system supports something like this already, but I know that ext4 doesn't do it (even if you set commit=<time> and prevent fsync and stuff, it will still commit before <time> has passed).

The Tux3 filesystem returns

Posted Jan 1, 2013 21:01 UTC (Tue) by daniel (guest, #3181) [Link]

Tux3 takes complete control of when delta transitions take place. After a delta transition (end of one delta, beginning of another) the "marshal" step sets up and queues a delta commit. No delta transition, no marshal, no delta commit, no disk writes, simple. I'm sure other filesystems are capable of this level of control, but in Tux3 the entire commit pipeline is controlled by this one thing.

The Tux3 filesystem returns

Posted Jan 1, 2013 21:58 UTC (Tue) by Richard_J_Neill (guest, #23093) [Link]

Would 5-minute commits still make sense on SSD?
By the time this hits mainstream, will there be any laptops with rotating storage?

The Tux3 filesystem returns

Posted Jan 1, 2013 23:36 UTC (Tue) by dgm (subscriber, #49227) [Link]

Millions. Seriously.

Displacing HDs

Posted Jan 2, 2013 0:19 UTC (Wed) by man_ls (guest, #15091) [Link]

Even in the current generation of ultrabooks SSDs and rotating HDDs are split in similar proportions -- HDDs retain the majority in my latest completely unscientific survey. Perhaps in 4 years SSDs will displace HDDs in the high end, and in 8 years in the low end. Then give it some 4 more years to have a reasonable replacement of the existing base, and I will say that a considerable proportion of laptops will have rotating disks until at least 2025. I certainly hope that tux3 reaches the mainline before that.

This may look like a very long time, but please look at the facts. Standard SSDs are 128 GB, while standard HDDs are 500 GB. Apple e.g. overcharges €300 to have a 256 GB SDD, and €800 for 512 GB; street prices for the latter are at least €300~800. I assume the most expensive disks are worth their higher price in reliability, as in HDDs there is no appreciable price difference. People often need those 500 GB and will buy an external HDD if the internal SSD cannot have them, moving the spinning rust outside the main unit but not eliminating it.

Best case, Moore's Law would suggest a 1/2 price drop every 1.5 years, so it would take about 6 years for SSDs to reach current HDD prices. And HDDs are also getting better all the time, so in 6 years we would probably be looking at 1 GB HDDs as the standard. Say in 7.5 years SSDs reach price parity with HDDs; still it would take an upgrade cycle to remove all the legacy HDDs, which we can optimistically set to 4 years. Again, 12 years.

And this is just in the first world...

Displacing HDs

Posted Jan 2, 2013 4:50 UTC (Wed) by daniel (guest, #3181) [Link]

Absolutely agreed about the need to support spinning media well for the next long time. Workstations, personal mass storage, cloud, developing economies, this all adds up to 10 years more of spinning disk optimization. Fortunately, optimizing for spinning media usuaalso optimizes for flash, example: keeping related blocks close together not only reduces seeks, but reduces false sharing on erase blocks, thus reducing write multiplication in the FTL

-- Sent from my Desire Z, Powered by Linux

Displacing HDs

Posted Jan 2, 2013 21:31 UTC (Wed) by cmccabe (guest, #60281) [Link]

I don't think so.

http://arstechnica.com/gadgets/2012/12/ssd-prices-are-low...

Even in 2012, if a laptop didn't have an SSD, it wasn't high-end. Similarly for a database box.

Remember that most desktop users aren't running Linux with a minimal desktop environment. They're running some iteration of Windows. For a lot of these users, adding more cores won't help them go faster, since the giant precompiled binaries they run can't (yet) use all the cores. Adding an SSD always helps, though. That's why users are buying SSDs in droves.

Hard drive prices also never quite recovered from the flooding in Thailand. This shouldn't really be surprising: why should manufacturers plow more investment into a dying technology?

http://www.techspot.com/news/48930-hard-drive-prices-to-r...

Hard drives will remain relevant for a while for people who want to store huge amounts of data. But they'll be a specialty item, like liquid-cooled cases or CRT monitors (remember those?) In 5 to 10 years, talking about hard drives will immediately identify you as an old person, just like reminiscing about those blurry CRT monitors we stared at in our youth.

Displacing HDDs

Posted Jan 2, 2013 22:26 UTC (Wed) by man_ls (guest, #15091) [Link]

Note that I am not discussing the relative merits of SSDs and HDDs, just the hard realities at this point in time. I spent more than €100 on a 20 GB SLC drive because I also think that a good SSD is the best investment for a new computer; but they will not be a true replacement until prices are comparable. Just as when LCD monitors cost over €1000 (or $1000, prices are equivalent), they were not a replacement for CRTs but an expensive luxury.

I see nothing in your links that contradicts what I am saying. The first article acknowledges that SSD prices are about $1/GB, while HDDs are $0.05~0.10/GB -- about 10~20 times lower. The second link states that HDDs will not recover until next year, hardly the time frames we are talking about.

A low end laptop is about €500, and a cheap 500 GB SDD is bound to cost almost that much. Until a comparable SSD costs about 10% of the laptop price it will not be a replacement -- people will just have an external HDD for bulk data. As to the high end, you can redefine it to be "laptop with SSD", but if you look around you will see plenty of €1000+ models with an HDD -- starting with the lower MacBook Pro. There are also many models with a 128 GB SSD, but at 1/4th the capacity they are hardly a replacement.

We are just not there yet, sadly. A breakthrough may come and shorten the time frame, but I don't think that TLCs are it: they will just lower the reliability and may give the whole SSD category a bad name. Not even high-temperature NAND annealing will help as it is (from your first link) at least 10 years off. Perhaps we can agree on a time frame of about a decade until HDDs become a rarity.

Displacing HDDs

Posted Jan 2, 2013 23:02 UTC (Wed) by martinfick (subscriber, #4455) [Link]

But unlike CRTs, spinning disks will likely have a place along side SSDs for a long time since they will likely always provide more storage than SSDs. And when does anyone ever say "I will never need more storage"? Our fish grow to fill our fishbowls with storage more and more. And since spinning disks are getting SSDs built into them, I supect that lone SSDs might actually see a drop before spinning disks do. So while perhaps every laptop will soon have an SSD, those SSD will likely be hybrids so they will also have spinning disks.

Displacing HDDs

Posted Jan 3, 2013 3:32 UTC (Thu) by cmccabe (guest, #60281) [Link]

SSDs don't have to reach price parity with hard drives to displace them. They just have to be the best decision for someone building a new computer. For most users, it's better to redirect some of the dollars you could spend on a giant CPU towards an SSD because that will actually make a difference to performance. In the same way, if you are building a gaming box, you want to buy a reasonable graphics card, rather than a giant CPU and a crummy one. (Personally, I don't play games so much any more, but I know some people who do.)

There was a time when graphics accelerators were unheard-of and everyone just used onboard video. But graphics accelerators didn't have to reach price parity with onboard video to win. They just had to become the best choice for most users. (Nowadays, onboard video is making a comeback due to some excellent chipsets from Intel, but that's another story...)

To sum up: in laptops, phones, music players, and tablets, hard drives are either entirely phased out or nearly there. HDDs will be with us for a while on desktops and servers, but mostly as the new tape drive.

I wonder what kind of performance Tux3 will get on SSD. SSDs do benefit from larger contiugous writes as opposed to small, scattered writes. So to the extent Tux3 can provide that with its deferred I/O mechanism, SSDs will benefit. Of course, SSD firmwares also try to accomplish the same general goal of write coalescing. So there is a risk of burning CPU on something that firmware is already doing.

Displacing HDDs

Posted Jan 3, 2013 7:27 UTC (Thu) by daniel (guest, #3181) [Link]

Just for the record, we're serious about optimizing for SSD as well as HDD. A lot of optimizations apply to both, or at least, one doesn't hurt the other. Eventually there might be tuning parameters, if we hit any optimizations that are mutually incompatible. Optimizing for SSD is made a lot harder because everybody has their own FTL algorithm, we need to guess how they're trying to guess what we're trying to do, if you know what I mean. But writing less total data is always going to be a win on any device, and we think we're doing a pretty good job of that right now.

Displacing HDDs

Posted Jan 3, 2013 22:50 UTC (Thu) by bronson (subscriber, #4806) [Link]

> we need to guess how they're trying to guess what we're trying to do

A perfect description of the emerging FTL insanity. Great quote.

The Tux3 filesystem returns

Posted Jan 2, 2013 7:56 UTC (Wed) by keeperofdakeys (guest, #82635) [Link]

Unfortunately, if you power down the drive, then you can no longer read from it. Considering that your RAM is usually smaller then the amount of data you need to access at once, this makes turning the drive off infeasible. The exception are storage drives, which usually don't have many continuous writes anyway.

The Tux3 filesystem returns

Posted Jan 2, 2013 8:43 UTC (Wed) by daniel (guest, #3181) [Link]

Maybe that could be fixed by having a way to preload vast amounts of disk volume to cache. Anyway, I agree, spinning down is useful mainly for volumes that are either completely offline or have long periods of inactivity. Like three of the four disks in my workstation right now, which spend most of their time spun down courtesy of noflushd. BTW, the fourth disk will be changed out for a SSD pretty soon. Not doing so is just cruel and unusual self inflicted punishment.

The Tux3 filesystem returns

Posted Jan 2, 2013 8:47 UTC (Wed) by keeperofdakeys (guest, #82635) [Link]

Unless 'cache' is an SSD (which is quite rare), you aren't going to be able to spin down a drive anyway (you also can't control the cache located in most harddrives, since it's for a different purpose). Personally I'm not a fan of this SSD cache phase, waste of an SSD when you could dedicate it to your root OS.

The Tux3 filesystem returns

Posted Jan 2, 2013 21:40 UTC (Wed) by daniel (guest, #3181) [Link]

Note that I'm already getting very effective results out of spinning down my disks. I just take care which data I put on which disk. Spinning down the root volume effectively would be trickier, but doable.

By aggressive caching I meant preemptively reading big linear chunks of volume into host memory, a behaviour we plan to support after more pressing issues are under control. Spindown isn't our immediate priority by any means, but we can probably help make it practical in more situations.

The Tux3 filesystem returns

Posted Jan 2, 2013 11:01 UTC (Wed) by juliank (guest, #45896) [Link]

Consider browsers like Chrome that will write to your disk while you are browsing in order to store the current session. This will in normal use only result in writes to the disk, but no reads at all, allowing you to spin down the drive.

The Tux3 filesystem returns

Posted Jan 2, 2013 11:22 UTC (Wed) by keeperofdakeys (guest, #82635) [Link]

The session data involves a few things, including a cache. The on-disk cache is read many times, especially when you go to another web-page on a site you are visiting. You are also forgetting all the other things that need to read your disk, like Chrome loading all the dynamic libraries (you may not have everything loaded into RAM at that time), and the rest of your OS.

For safety, I doubt the kernel would allow a device to spin down while there are dirty pages for that disk anyway.

The Tux3 filesystem returns

Posted Jan 2, 2013 0:03 UTC (Wed) by Tet (subscriber, #5433) [Link]

I have a lot of time for Daniel and have followed the development of Tux3 with interest. I was sad to see it fade from the radar and I'm glad it's resurfaced again. It's a clever design, and one that IMHO has a lot of potential.

The Tux3 filesystem returns

Posted Jan 2, 2013 4:30 UTC (Wed) by rsidd (subscriber, #2582) [Link]

It looked very promising back in 2008, but the Linux focus was on btrfs. The problem, I assume, is the available spare time of the handful of developers. Now it's 2013, the Linux focus is still on btrfs, which is still not ready, and we still don't have a modern filesystem (it is true that Windows and Mac don't, either). I'm more and more tempted to switch my work machine to Dragonfly (Hammer, which Daniel references) or FreeBSD (ZFS) -- but even better would be if one of the big companies would throw some resources at Daniel and friends to help finish the job!

The Tux3 filesystem returns

Posted Jan 2, 2013 5:13 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

Btrfs is now more 'ready' than ZFS on FreeBSD. It just takes a lot of time to create a good filesystem - about 5 years. So btrfs is actually pretty much on schedule here.

The Tux3 filesystem returns

Posted Jan 3, 2013 17:21 UTC (Thu) by nye (guest, #51576) [Link]

>we still don't have a modern filesystem (it is true that Windows and Mac don't, either)

Windows now has ReFS, which at first appearances looks pretty good - though it also has a rather 1.0 feel to it, as you might expect. I've not used it myself though as it's only available in Server 2012 (and IIRC Windows 8 can be persuaded to use it, but it's not supported; I may be mistaken there)

To be honest, ZFS isn't exactly a magic bullet either; it's an 'enterprise' filesystem, in both the good and bad senses of the word. Although I do think ZFS is probably the best filesystem currently available, it feels like it was never quite finished[0] and has a very uncertain future.

I was following Tux3 with interest the first time it came around, and only just discovered thanks to this article that the restarted efforts are now on a different mailing list; the website still points to the old one that hasn't had anything posted in ages.

[0] Perhaps that's a little unfair; if I were feeling more generous I might s/finished/polished/. Most notably there are a number of things you might expect to be able to do that require you to destroy your pool and restore the data from backup, which for most people outside of the enterprise environment is likely to be a time-consuming and stressful experience with a lot of downtime.

The Tux3 filesystem returns

Posted Jan 2, 2013 4:48 UTC (Wed) by butlerm (subscriber, #13312) [Link]

> "Instead of writing out changes to parents of altered blocks, Tux3 only changes the parents in cache, and writes a description of each change to a log on media. This prevents recursive copy-on-write. Tux3 will eventually write out such retained dirty metadata blocks in a process we call 'rollup', which retires log blocks and writes out dirty metadata blocks in full"

I am glad to hear that this technique has been taken up by filesystem developers, and that it is performing well so far. Block modification journaling is relatively common (if not par for the course) in database implementation, for exactly the same reason. The redo log can be written to and committed in a hurry, and the modified blocks forced to disk at any convenient time.

Most databases I am familiar with only force versions of all blocks to disk in a checkpoint process once every thirty minutes or so, because the blocks can be reconstructed from a clean backup by applying the redo log entries. Those reconstituted blocks do not need to be forced to disk then and there either.

Of course the trick with a filesystem is that one is typically journaling only metadata, not data updates, and in a conventional non-copy-on-write filesystem that causes interesting consistency issues that have to be dealt with on recovery.

I am curious, however, how Tux3 deals with recovery of a trashed metadata block without a clean prior image to apply journaled block changes to. Some filesystems journal entire block images for that reason, with the obvious downside of substantially increased log traffic. Is Tux3 using some sort of copy-on-write scheme for the meta-data blocks themselves so that clean prior versions can be obtained for the deltas to be applied to?

The Tux3 filesystem returns

Posted Jan 2, 2013 8:32 UTC (Wed) by daniel (guest, #3181) [Link]

Your clueful comments from a database perspective are greatly appreciated. I am relieved to hear that I did reinvent the wheel instead of going out on an untried limb. For the record, no study of database techniques was involved, it was just obviously a fast way to get changes onto disk.

We are thinking in terms of far more frequent rollups than every thirty minutes. More like every two or three seconds. We need to experiment and learn where the knee of the efficiency curve lies. This can easily be a mount time tunable. Real time is not the only criterion, rate of writing and amount of cache and cache pressure from other tasks also matter.

Tux3 is definitely not journalling only metadata (or the logging equivalent) and we have no plans to even provide an option for that. I wouldn't even know how to do that within the Tux3 model. It seems that we are able to provide full data consistency, nearly for free.

One surprise was, we have a compile option to overwrite file data instead of redirect writes nondestructively, and that failed to affect performance detectably in either fsx or fsstress. Perhaps we did not measure hard enough, but for now we seem to get full data consistency more or less for free. We still plan to offer a per-file overwrite vs redirect flag because somebody might want that for reasons other than performance, or maybe there are cases where it really does improve performance.

To clarify, Tux3 always maintains a clean prior image in case the latest commit fails. If the log gets trashed then for sure we have a difficult recovery task ahead of us, which is harder than Ext4 because we don't have fixed positions for different metadata bytes. So we will try to compensate in various ways, with the goal of being able to fsck approximately as reliably. Though to get there we might need to bring in the big guns and convince Ted or Andreas to help out. We should have a nice, upgraded directory index by that time to offer in trade :)

The Tux3 filesystem returns

Posted Jan 2, 2013 20:41 UTC (Wed) by jeltz (guest, #88600) [Link]

If the journal of tux3 behaves like for databases some of the same optimizations could work there too. Common ideas are to have the journal (for databases often called REDO log or WAL) on a separate faster storage medium to reduce commit latency and increase throughput, and compression of the journal (either using smart encoding or a fast compression algorithm). It is also often an important place for lock contention in write loads.

The Tux3 filesystem returns

Posted Jan 3, 2013 7:33 UTC (Thu) by daniel (guest, #3181) [Link]

We don't journal, we write a chained log. We do moderate compression on the log just by keeping the records small. If log transfer rate proves to move the needle on performance we can compress more aggressively. My guess is, there will be bigger gains to get elsewhere for some time to come. We have zero lock contention on log blocks because only the backend writes them and we never read them except for mount or recovery.

The Tux3 filesystem returns

Posted Jan 6, 2013 3:19 UTC (Sun) by jeltz (guest, #88600) [Link]

Thanks for correcting me. Logs and journals are different beasts. And nice solution to use the frontend/backend split to solve the contention problem. I guess there must be some trade off here though since PostgreSQL normally has the connection processes writing to the log causing lock contention. Maybe it is to support some feature which a SQL database must support while a FS does not.

The Tux3 filesystem returns

Posted Jan 6, 2013 13:47 UTC (Sun) by andresfreund (subscriber, #69562) [Link]

> And nice solution to use the frontend/backend split to solve the contention problem. I guess there must be some trade off here though since PostgreSQL normally has the connection processes writing to the log causing lock contention.

Postgres actually uses a similar split - the individual backends insert the log entry into memory and the "wal writer" writes them to disk in the background. Only when an individual backend requires some wal records to be written/fsynced (e.g. because it wants the commit record to be safely on disk) and that portion of the wal has not yet been written/fsynced the individual backends will do so.
There are two separate areas of contention here:
- synchronous disk writes/syncs. That can be eased by stuff like batching the required syncs for several backends/transactions in one sync and by lowering consistency requirements a bit (like setting synchronous_commit=off).
- contention around the WAL datastructures. For relatively obvious reasons only one backend can insert into the WAL (in memory!) at the same time, so there can be rather heavy contention around the locks protecting it. There's some work going on to make locking more fine grained, but its complicated stuff and the price of screwing up would be way too high, so it might take some time.

I don't think there is some fundamental difference here. PG will always have a higher overhead than a filesystem because the guarantees it gives (by default at least) are far stricter and more essentially because its layered *above* a filesystem so it shares that overhead in the first place.

.oO(Don't mention postgres if you don't want to be talked to death :P)

The Tux3 filesystem returns

Posted Jan 6, 2013 21:29 UTC (Sun) by raven667 (subscriber, #5198) [Link]

> because its layered *above* a filesystem so it shares that overhead in the first place.

Only in a few places, when new files are created or the metadata for existing files is changed, by changing the file size for example. Files that are preallocated have very little overhead to (re)write, except for maybe mtime updates.

The Tux3 filesystem returns

Posted Jan 2, 2013 21:46 UTC (Wed) by kjp (guest, #39639) [Link]

This was a nice read... anything involving ACID makes me happy (er, that sounds bad), but I missed a key point somehow. Are you writing data blocks once or twice (once to redo log, then much later check pointed somewhere else?)

It does seem an awful lot like a RDBMS. But even postgres has some quirks, like indexes never shrinking, vacuum sometimes always running in the background and thus taking forever to free up space on large tables... so your blurb about 'defragmentation' made me realize that for open source projects, zero-admin-maintenance is certainly a very high goal. Cheers.

The Tux3 filesystem returns

Posted Jan 2, 2013 23:11 UTC (Wed) by moltonel (guest, #45207) [Link]

It took postgres a lot of time, but vacuum-related annoyances get fixed one by one with each release, to the point that it is now pretty much troublefree. Hopefully tux3 can progress in a similar fashion. Good luck !

The Tux3 filesystem returns

Posted Jan 3, 2013 0:52 UTC (Thu) by butlerm (subscriber, #13312) [Link]

Most databases have to deal with multiple independent transactions, where most filesystems, thankfully, do not. A couple of decades ago almost every database out there would acquire page locks that would block all other page access until the locking transaction was committed.

You can just imagine if access to a directory stalled because another pending transaction had an uncommitted modification to the same directory. Or if pending additions to a directory had to be invisible to other readers, data writes had to be invisible to other readers until committed and so on. There are ways to solve that problem in database system design, but fortunately filesystems generally speaking don't have to deal with it.

The Tux3 filesystem returns

Posted Jan 3, 2013 8:11 UTC (Thu) by daniel (guest, #3181) [Link]

Filesystems actually do need to deal with multiple independent transactions because of shared resources like bitmaps and the inode table. We need to ensure that for each delta the bitmaps exactly match the file data and directory entries exactly match the inode table, and do that without holding long duration, performance killing locks.

The Tux3 filesystem returns

Posted Jan 3, 2013 16:33 UTC (Thu) by butlerm (subscriber, #13312) [Link]

A modern database deals with multiple long lived independent transactions that can independently be committed or rolled back, involve independent, isolated views of what the contents of the database look like that evolve as each transaction progresses, row level locking so that no more than one uncommitted version of a row exists at any given time, and multiple version concurrency control so that all other readers only see the previously committed versions of rows.

POSIX filesystem semantics don't have anything like it - in fact they actually forbid it. There is only one view of the filesystem to all user processes, and everything that happens takes effect instantaneously. No isolation, no process specific view of the contents of files or directories, no (user visible) transaction commit or rollback.

Everything a strict POSIX filesystem does with transactions is for recovery purposes, in an attempt to recreate something resembling a consistent snapshot of the system at the time it failed, or at least a working inconsistent one. And certainly one would want to pipeline the assembly and retiring of those transactions, so that the next one can be constructed while previous ones are being retired. I am glad to hear that Tux3 does just that.

The Tux3 filesystem returns

Posted Jan 3, 2013 8:46 UTC (Thu) by daniel (guest, #3181) [Link]

We write blocks once. The only duplication is, sometimes we will write one or more log records to avoid writing a full block that has only a few bytes changed, and flush the retained dirty block much later. This is a small multiplier in terms of total blocks written, so maybe our average write multiplication factor is 1.01. It is possible to construct cases where it is higher. We will only ever know the actual multiplier by measuring it under typical loads, which we have not done yet. But rounding down, the answer to your question is "once".

For example we might write ten megabyte data extents, plus four blocks to complete an atomic commit, including one log block, giving a write multiplier of 1.0016. Note that this multiplier includes all metadata. In our "rollup" a few more blocks may need to be accounted to the multiplier, maybe pushing it up to 1.003, or maybe not moving the needle at all because those retained blocks may be amortized over many deltas. In other words, we often write dirty metadata blocks zero times, rounding down modestly.

The Tux3 filesystem returns

Posted Jan 3, 2013 16:45 UTC (Thu) by butlerm (subscriber, #13312) [Link]

I am curious what you use as a key or identifier to a changed metadata block, e.g. one that has at least one consistent prior location, one delta entry in the log somewhere, and one pending rewrite in a future rollup.

Are you pre-allocating the location of the next version, and identifying it that way? Or perhaps referring to the existing version (by location), plus a list of the (accumulated) deltas? Or do you use some other meta-data-block identifier to correlate the different versions of the meta data block and any logged deltas for recovery purposes?

The Tux3 filesystem returns

Posted Jan 3, 2013 23:07 UTC (Thu) by daniel (guest, #3181) [Link]

We preallocate the location of the next version, your first guess. Let's call this a "retained" metadata block that will be flushed in a future rollup. We say "redirect" when we allocate a new volume position for a read-only block. When we need to modify a clean btree node we redirect it in cache, emit the corresponding balloc log record (thus avoiding writing out the bitmap), modify it, emit the node modification log record, then recursively redirect its parents in the btree access path until we hit a dirty one (usually the immediate parent).

Though you didn't ask, we can estimate the per-delta writeout cost. We emit an allocation record and a modify record per redirect, which can actually be one record with some modest optimizing. Given a reasonably local btree modification pattern, the number of recursive redirects rounds to zero, so our theoretical cost is roughly one log record per dirty per rollup. Number of dirty metadata blocks written per rollup given a localized burst of modifications is roughly the filesystem tree depth, log_base_196(tree_leaves), a small number that rounds to zero which amortized across many deltas. We can further improve this (rounding closer to zero) by permitting certain dirty metadata to survive rollup, for example, when we need to redirect the root of a file index tree or a high level node of the inode table btree. This is a future optimization. We can round even closer to zero by bumping our average branching factor up from 196 to 213 using b+tree splitting, if anybody cares to add that.

The Tux3 filesystem returns

Posted Jan 4, 2013 3:26 UTC (Fri) by butlerm (subscriber, #13312) [Link]

That sounds great. As I mentioned earlier, many databases do something similar, except on logical overwrite rather than copy on write basis. If you have a transaction that needs to modify an b-tree index block (for example), a copy of the index block is modified in RAM, a log record describing the block change is created, and the redo log is forced to disk whenever a transaction (or group of transactions) is committed.

Then at some future point the database server process forces a more recent version of the index block back to disk at the same location as the old one, either because it needs to free up pages, is running a checkpoint, or is in the process of shutting down. After a checkpoint is complete the versions of the blocks on disk are up-to-date up through the logical sequence number of the checkpoint.

The vulnerability is that if any block is corrupted, it is highly likely that the database will have to be recovered from a coherent snapshot, and the archived redo information generated in the interim re-applied to restore the database to the state it was at the moment of the crash, and then the uncommitted transactions rolled back, typically by reversing or undo-ing all the block changes made by each one.

Filesystems, of course, generally do not have the luxury of requiring administrators to restore a clean snapshot of the underlying block device on such occasions. And that is why (I understand) the EXT4 designers adopted block image journaling rather than block modification journaling, with its accompanying cost in journal traffic. Many databases, when put into "backup mode" switch to block image journaling for the same reason - the backup software will not necessarily acquire a coherent image of each block.

And of course when such a mode is engaged, redo log traffic skyrockets, giving administrators a healthy incentive to take the database out of backup mode as soon as possible, something made much more practical by the advent of filesystem snapshots - if at the cost of pushing much of the overhead (for a database under heavy write load) onto the volume manager or snapshot-capable filesystem instead.

The Tux3 filesystem returns

Posted Jan 6, 2013 3:29 UTC (Sun) by jeltz (guest, #88600) [Link]

PostgreSQL has full_page_writes on by default which I assume is the setting for controlling "block image journaling". Though I am pretty certain it does not matter for making backups.

The Tux3 filesystem returns

Posted Jan 6, 2013 13:34 UTC (Sun) by andresfreund (subscriber, #69562) [Link]

full_page_writes doesn't mean that full pages are written all the time though - it means that the first time a page is touched after a checkpoint it will be logged entirely. If you have write intensive workloads the system should be configured in a way it doesn't checkpoint permanently...
The reason its on by default is that its important for consistency in the face of OS/HW crashes. Postgres' pages are normally bigger than the disks sectors and you could get into problems if e.g. one half of a page was written but the other was not and the WAL record only contains information about the one of them. As recovery always starts at a checkpoint its enough to log the page fully once after each checkpoint.