The Tux3 filesystem returns
From: | Daniel Phillips <lkml-AT-phunq.net> | |
To: | linux-kernel-AT-vger.kernel.org | |
Subject: | Tux3 report: New news for the new year | |
Date: | Tue, 01 Jan 2013 03:52:43 -0800 | |
Message-ID: | <2597526.usDRg4h3X1@mars> | |
Archive‑link: | Article |
Hi everybody, The Tux3 project has some interesting news to report for the new year. In brief, the first time Hirofumi ever put together all the kernel pieces in his magical lab over in Tokyo, our Tux3 rocket took off and made it straight to orbit. Or in less metaphorical terms, our first meaningful benchmarks turned in numbers that meet or even slightly beat the illustrious incumbent, Ext4: fsstress -f dread=0 -f dwrite=0 -f fsync=0 -f fdatasync=0 \ -s 1000 -l 200 -n 200 -p 3 ext4 time cpu wait 46.338, 1.244, 5.096 49.101, 1.144, 5.896 49.838, 1.152, 5.776 tux3 time cpu wait 46.684, 0.592, 1.860 44.011, 0.684, 1.764 43.773, 0.556, 1.888 Fsstress runs a mix of filesystem operations typical of a Linux system under heavy load. In this test, Tux3 spends less time waiting than Ext4, uses less CPU (see below) and finishes faster on average. This was exciting for us, though we must temper our enthusiasm by noting that these are still early results and several important bits of Tux3 are as yet unfinished. While we do not expect the current code to excel at extreme scales just yet, it seems we are already doing well at the scale that resembles computers you are running at this very moment. About Tux3 Here is a short Tux3 primer. Tux3 is a general purpose LInux filesystem developed by a group of us mainly for the fun of it. Tux3 started in summer of 2008, as a container for a new storage versioning algorithm originally meant to serve as a new engine for the ddsnap volume snapshot virtual device: http://lwn.net/Articles/288896/ "Versioned pointers: a new method of representing snapshots" As design work proceeded on a suitably simple filesystem with modern features, the focus shifted from versioning to the filesystem itself, as the latter is a notoriously challenging and engaging project. Initial prototyping was done in user space by me and others, and later ran under Fuse, a spectacular driveby contribution from one Tero Roponen. Hirofumi joined the team with an amazing utility that makes graphs of the disk structure of Tux3 volumes, and soon took charge of the kernel port. I stand in awe of Hirofumi's design sense, detail work and general developer prowess. Like a German car, Tux3 is both old school and modern. Closer in spirit to Ext4 than Btrfs, Tux3 sports an inode table, allocates blocks with bitmaps, puts directories in files, and stores attributes in inodes. Like Ext4 and Btrfs, Tux3 uses extents indexed by btrees. Source file names are familiar: balloc.c, namei.c etc. But Tux3 has some new files like filemap.c and log.c that help make it fast, compact, and very ACID. Unlike Ext4, Tux3 keeps inodes in a btree, inodes are variable length, and all inode attributes are variable length and optional. Also unlike Ext4, Tux3 writes nondestructively and uses a write-anywhere log instead of a journal. Differences with Btrfs are larger. The code base is considerably smaller, though to be sure, some of that can be accounted for by incomplete features. The Tux3 filesystem tree is single-rooted, there is no forest of shared trees. There is no built-in volume manager. Names and inodes are stored separately. And so on. But our goal is the same: a modern, snapshotting, replicating general purpose filesystem, which I am happy to say, seems to have just gotten a lot closer. Front/Back Separation At the heart of Tux3's kernel implementation lies a technique we call "front/back separation", which partly accounts for the surprising kernel CPU advantage in the above benchmark results. Tux3 runs as two, loosely coupled pieces: the frontend, which handles Posix filesystem operations entirely in cache, and the backend, which does the brute work of preparing dirty cache for atomic transfer to media. The frontend shows up as kernel CPU accounted to the Fsstress task, while the backend is largely invisible, running on some otherwise idle CPU. We suspect that the total of frontend and backend CPU is less than Ext4 as well, but so far nobody has checked. What we do know, is that filesystem operations tend to complete faster when they only need to deal with cache and not little details such as backing store. Front/back separtion is like taking delayed allocation to its logical conclusion: every kind of structural change is delayed, not just block allocation. I credit Matt Dillon of Dragonfly fame for this idea. He described the way he used it in Hammer as part of this dialog: http://kerneltrap.org/Linux/Comparing_HAMMER_And_Tux3 "Comparing HAMMER And Tux3" Hammer is a cluster filesystem, but front/back separation turns out to be equally effective on a single node. Of course, the tricky part is making the two pieces run asynchronously without stalling on each other. Which brings us to... Block Forking Block forking is an idea that has been part of Tux3 from the beginning, and roughly resembles the "stable pages" work now underway. Unlike stable pages, block forking does not reduce performance. Quite the contrary - block forking enables front/back separation, which boosted Tux3 Fsstress performance about 40%. The basic idea of block forking is to never wait on pages under IO, but clone them instead. This protects in-flight pages from damage by VFS syscalls without forcing page cache updates to stall on writeback. Implementing this simple idea is harder than it sounds. We need to deal with multiple blocks being accessed asynchronously on the same page, and we need to worry a lot about cache object lifetimes and locking. Especially in truncate, things can get pretty crazy. Hirofumi's work here can only be described by one word: brilliant. Deltas and Strong Consistency Tux3 groups frontend update transactions into "deltas". According to some heuristic, one delta ends and the next one begins, such that all dirty cache objects affected by the operations belonging to a given delta may be transferred to media in a single atomic operation. In particular, we take care that directory updates always lie in the same delta as associated updates such as creating or deleting inode representations in the inode table. Tux3 always cleans dirty cache completely on each delta commit. This is not traditional behavior for Linux filesystems, which normally let the core VM memory flusher tell them which dirty pages of which inodes should be flushed to disk. We largely ignore the VM's opinion about that and flush everything, every delta. You might think this would hurt performance, but apparently it does not. It does allow us to implement stronger consistency guarantees than typical for Linux. We provide two main guarantees: * Atomicity: File data never appears on media in an intermediate state, with the single exception of large file writes, which may be broken across multiple deltas, but with write ordering preserved. * Ordering: If one filesystem transaction ends before another transaction begins, then the second transaction will never appear on durable media unless the first does too. Our atomicity guarantee resembles Ext4's data=journal but performs more like data=ordered. This is interesting, considering that Tux3 always writes nondestructively. Finding a new, empty location for each block written and updating the associated metadata would seem to carry a fairly hefty cost, but apparently it does not. Our ordering guarantee has not been seen on Linux before, as far as we know. We get it "for free" from Tux3's atomic update algorithm. This could possibly prove useful to developers of file-based databases, for example, mailers and MTAs. (Kmail devs, please take note!) Logging and Rollup Tux3 goes out of its way to avoid recursive copy on write, that is, the expensive behavior where a change to a data leaf must be propagated all the way up the filesystem tree to the root, to avoid altering data that belongs to a previously committed consistent filesystem image. (Btrfs extends this recursive copy on write idea to implement snapshots, but Tux3 does not.) Instead of writing out changes to parents of altered blocks, Tux3 only changes the parents in cache, and writes a description of each change to a log on media. This prevents recursive copy-on-write. Tux3 will eventually write out such retained dirty metadata blocks in a process we call "rollup", which retires log blocks and writes out dirty metadata blocks in full. A delta containing a rollup also tidily avoids recursive copy on write: just like any other delta, changes to the parents of redirected blocks are made only in cache, and new log entries are generated. Tux3 further employs logging to make the allocation bitmap overhead largely vanish. Tux3 retains dirty bitmaps in memory and writes a description of each allocate/free to the log. It is much cheaper to write out one log block than potentially many dirty bitmap blocks, each containing only a few changed bits. Tux3's rollup not only avoids expensive recursive copy on write, it optimizes updating in a least three ways. * Multiple deltas may dirty the same metadata block multiple times but rollup only writes those blocks once. * Multiple metadata blocks may be written out in a single, linear pass across spinning media. * Backend structure changes are batched in a cache friendly way. One curious side effect of Tux3's log+rollup strategy is that in normal operation, the image of a Tux3 filesystem is never entirely consistent if considered only as literal block images. Instead, the log must be replayed in order to reconstruct dirty cache, then the view of the filesystem tree from dirty cache is consistent. This is more or less the inverse of the traditional view where a replay changes the media image. Tux3 replay is a true read-only operation that leaves media untouched and changes cache instead. In fact, this theme runs consistently through Tux3's entire design. As a filesystem, Tux3 cares about updating cache, moving data between cache and media, and little else. Tux3 does not normally update the media view of its filesystem tree even at unmount. Instead, it replays the log on each mount. One excellent reason for doing this is to exercise our replay code. (You surely would not want to discover replay flaws only on the rare occasions you crash.) Another reason is that we view sudden interruption as the normal way a filesystem should shut down. We uphold your right to hit the power switch on a computing device and expect to find nothing but consistent data when you turn it back on. Fast Sync Tux3 can sync a minimal file data change to disk by writing four blocks, or a minimal file create and write with seven blocks: http://phunq.net/pipermail/tux3/2012-December/000011.html "Full volume sync performance" This is so fast that we are tempted to implement fsync as sync. However, we intend to resist that temptation in the long run, and implement an optimized fsync that "jumps the queue" of Tux3's delta update pipeline and completes without waiting for a potentially large amount of unrelated dirty cache to be flushed to media. Still to do There is a significant amount of work still needed to bring Tux3 to a production state. As of today, Tux3 does not have snapshots, in spite of that being the main motivation for starting on this in the first place. The new PHtree directory index is designed, not implemented. Freespace management needs acceleration before it will benchmark well at extreme scale. Block allocation needs to be much smarter before it will age well and resist read fragmentation. There are several major optimizations still left to implement. We need a good fsck that approaches the effectiveness of e2fsck. There is a long list of shiny features to add: block migration, volume growing and shrinking, defragmentation, dedupilcation, replication, and so on. We have made plausible plans for all of the above, but indeed the devil is in the doing. So we are considering the merits of invoking the "many hands make light work" principle. Tux3 is pretty well documented and the code base is, if not completely obvious, at least small and orthogonal. Tux3 runs in userspace in two different ways: the tux3 command and fuse. Prototyping in user space is a rare luxury that could almost make one lazy. Tux3 is an entirely grassroots effort driven by volunteers. Nonetheless, we would welcome offers of assistance from wherever they may come, especially testers. Regards, Daniel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
(Log in to post comments)
The Tux3 filesystem returns
Posted Jan 1, 2013 19:04 UTC (Tue) by juliank (guest, #45896) [Link]
I don't know whether any file system supports something like this already, but I know that ext4 doesn't do it (even if you set commit=<time> and prevent fsync and stuff, it will still commit before <time> has passed).
The Tux3 filesystem returns
Posted Jan 1, 2013 21:01 UTC (Tue) by daniel (guest, #3181) [Link]
The Tux3 filesystem returns
Posted Jan 1, 2013 21:58 UTC (Tue) by Richard_J_Neill (guest, #23093) [Link]
By the time this hits mainstream, will there be any laptops with rotating storage?
The Tux3 filesystem returns
Posted Jan 1, 2013 23:36 UTC (Tue) by dgm (subscriber, #49227) [Link]
Displacing HDs
Posted Jan 2, 2013 0:19 UTC (Wed) by man_ls (guest, #15091) [Link]
Even in the current generation of ultrabooks SSDs and rotating HDDs are split in similar proportions -- HDDs retain the majority in my latest completely unscientific survey. Perhaps in 4 years SSDs will displace HDDs in the high end, and in 8 years in the low end. Then give it some 4 more years to have a reasonable replacement of the existing base, and I will say that a considerable proportion of laptops will have rotating disks until at least 2025. I certainly hope that tux3 reaches the mainline before that.This may look like a very long time, but please look at the facts. Standard SSDs are 128 GB, while standard HDDs are 500 GB. Apple e.g. overcharges €300 to have a 256 GB SDD, and €800 for 512 GB; street prices for the latter are at least €300~800. I assume the most expensive disks are worth their higher price in reliability, as in HDDs there is no appreciable price difference. People often need those 500 GB and will buy an external HDD if the internal SSD cannot have them, moving the spinning rust outside the main unit but not eliminating it.
Best case, Moore's Law would suggest a 1/2 price drop every 1.5 years, so it would take about 6 years for SSDs to reach current HDD prices. And HDDs are also getting better all the time, so in 6 years we would probably be looking at 1 GB HDDs as the standard. Say in 7.5 years SSDs reach price parity with HDDs; still it would take an upgrade cycle to remove all the legacy HDDs, which we can optimistically set to 4 years. Again, 12 years.
And this is just in the first world...
Displacing HDs
Posted Jan 2, 2013 4:50 UTC (Wed) by daniel (guest, #3181) [Link]
-- Sent from my Desire Z, Powered by Linux
Displacing HDs
Posted Jan 2, 2013 21:31 UTC (Wed) by cmccabe (guest, #60281) [Link]
http://arstechnica.com/gadgets/2012/12/ssd-prices-are-low...
Even in 2012, if a laptop didn't have an SSD, it wasn't high-end. Similarly for a database box.
Remember that most desktop users aren't running Linux with a minimal desktop environment. They're running some iteration of Windows. For a lot of these users, adding more cores won't help them go faster, since the giant precompiled binaries they run can't (yet) use all the cores. Adding an SSD always helps, though. That's why users are buying SSDs in droves.
Hard drive prices also never quite recovered from the flooding in Thailand. This shouldn't really be surprising: why should manufacturers plow more investment into a dying technology?
http://www.techspot.com/news/48930-hard-drive-prices-to-r...
Hard drives will remain relevant for a while for people who want to store huge amounts of data. But they'll be a specialty item, like liquid-cooled cases or CRT monitors (remember those?) In 5 to 10 years, talking about hard drives will immediately identify you as an old person, just like reminiscing about those blurry CRT monitors we stared at in our youth.
Displacing HDDs
Posted Jan 2, 2013 22:26 UTC (Wed) by man_ls (guest, #15091) [Link]
Note that I am not discussing the relative merits of SSDs and HDDs, just the hard realities at this point in time. I spent more than €100 on a 20 GB SLC drive because I also think that a good SSD is the best investment for a new computer; but they will not be a true replacement until prices are comparable. Just as when LCD monitors cost over €1000 (or $1000, prices are equivalent), they were not a replacement for CRTs but an expensive luxury.I see nothing in your links that contradicts what I am saying. The first article acknowledges that SSD prices are about $1/GB, while HDDs are $0.05~0.10/GB -- about 10~20 times lower. The second link states that HDDs will not recover until next year, hardly the time frames we are talking about.
A low end laptop is about €500, and a cheap 500 GB SDD is bound to cost almost that much. Until a comparable SSD costs about 10% of the laptop price it will not be a replacement -- people will just have an external HDD for bulk data. As to the high end, you can redefine it to be "laptop with SSD", but if you look around you will see plenty of €1000+ models with an HDD -- starting with the lower MacBook Pro. There are also many models with a 128 GB SSD, but at 1/4th the capacity they are hardly a replacement.
We are just not there yet, sadly. A breakthrough may come and shorten the time frame, but I don't think that TLCs are it: they will just lower the reliability and may give the whole SSD category a bad name. Not even high-temperature NAND annealing will help as it is (from your first link) at least 10 years off. Perhaps we can agree on a time frame of about a decade until HDDs become a rarity.
Displacing HDDs
Posted Jan 2, 2013 23:02 UTC (Wed) by martinfick (subscriber, #4455) [Link]
Displacing HDDs
Posted Jan 3, 2013 3:32 UTC (Thu) by cmccabe (guest, #60281) [Link]
There was a time when graphics accelerators were unheard-of and everyone just used onboard video. But graphics accelerators didn't have to reach price parity with onboard video to win. They just had to become the best choice for most users. (Nowadays, onboard video is making a comeback due to some excellent chipsets from Intel, but that's another story...)
To sum up: in laptops, phones, music players, and tablets, hard drives are either entirely phased out or nearly there. HDDs will be with us for a while on desktops and servers, but mostly as the new tape drive.
I wonder what kind of performance Tux3 will get on SSD. SSDs do benefit from larger contiugous writes as opposed to small, scattered writes. So to the extent Tux3 can provide that with its deferred I/O mechanism, SSDs will benefit. Of course, SSD firmwares also try to accomplish the same general goal of write coalescing. So there is a risk of burning CPU on something that firmware is already doing.
Displacing HDDs
Posted Jan 3, 2013 7:27 UTC (Thu) by daniel (guest, #3181) [Link]
Displacing HDDs
Posted Jan 3, 2013 22:50 UTC (Thu) by bronson (subscriber, #4806) [Link]
A perfect description of the emerging FTL insanity. Great quote.
The Tux3 filesystem returns
Posted Jan 2, 2013 7:56 UTC (Wed) by keeperofdakeys (guest, #82635) [Link]
The Tux3 filesystem returns
Posted Jan 2, 2013 8:43 UTC (Wed) by daniel (guest, #3181) [Link]
The Tux3 filesystem returns
Posted Jan 2, 2013 8:47 UTC (Wed) by keeperofdakeys (guest, #82635) [Link]
The Tux3 filesystem returns
Posted Jan 2, 2013 21:40 UTC (Wed) by daniel (guest, #3181) [Link]
By aggressive caching I meant preemptively reading big linear chunks of volume into host memory, a behaviour we plan to support after more pressing issues are under control. Spindown isn't our immediate priority by any means, but we can probably help make it practical in more situations.
The Tux3 filesystem returns
Posted Jan 2, 2013 11:01 UTC (Wed) by juliank (guest, #45896) [Link]
The Tux3 filesystem returns
Posted Jan 2, 2013 11:22 UTC (Wed) by keeperofdakeys (guest, #82635) [Link]
For safety, I doubt the kernel would allow a device to spin down while there are dirty pages for that disk anyway.
The Tux3 filesystem returns
Posted Jan 2, 2013 0:03 UTC (Wed) by Tet (subscriber, #5433) [Link]
The Tux3 filesystem returns
Posted Jan 2, 2013 4:30 UTC (Wed) by rsidd (subscriber, #2582) [Link]
The Tux3 filesystem returns
Posted Jan 2, 2013 5:13 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]
The Tux3 filesystem returns
Posted Jan 3, 2013 17:21 UTC (Thu) by nye (guest, #51576) [Link]
Windows now has ReFS, which at first appearances looks pretty good - though it also has a rather 1.0 feel to it, as you might expect. I've not used it myself though as it's only available in Server 2012 (and IIRC Windows 8 can be persuaded to use it, but it's not supported; I may be mistaken there)
To be honest, ZFS isn't exactly a magic bullet either; it's an 'enterprise' filesystem, in both the good and bad senses of the word. Although I do think ZFS is probably the best filesystem currently available, it feels like it was never quite finished[0] and has a very uncertain future.
I was following Tux3 with interest the first time it came around, and only just discovered thanks to this article that the restarted efforts are now on a different mailing list; the website still points to the old one that hasn't had anything posted in ages.
[0] Perhaps that's a little unfair; if I were feeling more generous I might s/finished/polished/. Most notably there are a number of things you might expect to be able to do that require you to destroy your pool and restore the data from backup, which for most people outside of the enterprise environment is likely to be a time-consuming and stressful experience with a lot of downtime.
The Tux3 filesystem returns
Posted Jan 2, 2013 4:48 UTC (Wed) by butlerm (subscriber, #13312) [Link]
I am glad to hear that this technique has been taken up by filesystem developers, and that it is performing well so far. Block modification journaling is relatively common (if not par for the course) in database implementation, for exactly the same reason. The redo log can be written to and committed in a hurry, and the modified blocks forced to disk at any convenient time.
Most databases I am familiar with only force versions of all blocks to disk in a checkpoint process once every thirty minutes or so, because the blocks can be reconstructed from a clean backup by applying the redo log entries. Those reconstituted blocks do not need to be forced to disk then and there either.
Of course the trick with a filesystem is that one is typically journaling only metadata, not data updates, and in a conventional non-copy-on-write filesystem that causes interesting consistency issues that have to be dealt with on recovery.
I am curious, however, how Tux3 deals with recovery of a trashed metadata block without a clean prior image to apply journaled block changes to. Some filesystems journal entire block images for that reason, with the obvious downside of substantially increased log traffic. Is Tux3 using some sort of copy-on-write scheme for the meta-data blocks themselves so that clean prior versions can be obtained for the deltas to be applied to?
The Tux3 filesystem returns
Posted Jan 2, 2013 8:32 UTC (Wed) by daniel (guest, #3181) [Link]
We are thinking in terms of far more frequent rollups than every thirty minutes. More like every two or three seconds. We need to experiment and learn where the knee of the efficiency curve lies. This can easily be a mount time tunable. Real time is not the only criterion, rate of writing and amount of cache and cache pressure from other tasks also matter.
Tux3 is definitely not journalling only metadata (or the logging equivalent) and we have no plans to even provide an option for that. I wouldn't even know how to do that within the Tux3 model. It seems that we are able to provide full data consistency, nearly for free.
One surprise was, we have a compile option to overwrite file data instead of redirect writes nondestructively, and that failed to affect performance detectably in either fsx or fsstress. Perhaps we did not measure hard enough, but for now we seem to get full data consistency more or less for free. We still plan to offer a per-file overwrite vs redirect flag because somebody might want that for reasons other than performance, or maybe there are cases where it really does improve performance.
To clarify, Tux3 always maintains a clean prior image in case the latest commit fails. If the log gets trashed then for sure we have a difficult recovery task ahead of us, which is harder than Ext4 because we don't have fixed positions for different metadata bytes. So we will try to compensate in various ways, with the goal of being able to fsck approximately as reliably. Though to get there we might need to bring in the big guns and convince Ted or Andreas to help out. We should have a nice, upgraded directory index by that time to offer in trade :)
The Tux3 filesystem returns
Posted Jan 2, 2013 20:41 UTC (Wed) by jeltz (guest, #88600) [Link]
The Tux3 filesystem returns
Posted Jan 3, 2013 7:33 UTC (Thu) by daniel (guest, #3181) [Link]
The Tux3 filesystem returns
Posted Jan 6, 2013 3:19 UTC (Sun) by jeltz (guest, #88600) [Link]
The Tux3 filesystem returns
Posted Jan 6, 2013 13:47 UTC (Sun) by andresfreund (subscriber, #69562) [Link]
Postgres actually uses a similar split - the individual backends insert the log entry into memory and the "wal writer" writes them to disk in the background. Only when an individual backend requires some wal records to be written/fsynced (e.g. because it wants the commit record to be safely on disk) and that portion of the wal has not yet been written/fsynced the individual backends will do so.
There are two separate areas of contention here:
- synchronous disk writes/syncs. That can be eased by stuff like batching the required syncs for several backends/transactions in one sync and by lowering consistency requirements a bit (like setting synchronous_commit=off).
- contention around the WAL datastructures. For relatively obvious reasons only one backend can insert into the WAL (in memory!) at the same time, so there can be rather heavy contention around the locks protecting it. There's some work going on to make locking more fine grained, but its complicated stuff and the price of screwing up would be way too high, so it might take some time.
I don't think there is some fundamental difference here. PG will always have a higher overhead than a filesystem because the guarantees it gives (by default at least) are far stricter and more essentially because its layered *above* a filesystem so it shares that overhead in the first place.
.oO(Don't mention postgres if you don't want to be talked to death :P)
The Tux3 filesystem returns
Posted Jan 6, 2013 21:29 UTC (Sun) by raven667 (subscriber, #5198) [Link]
Only in a few places, when new files are created or the metadata for existing files is changed, by changing the file size for example. Files that are preallocated have very little overhead to (re)write, except for maybe mtime updates.
The Tux3 filesystem returns
Posted Jan 2, 2013 21:46 UTC (Wed) by kjp (guest, #39639) [Link]
It does seem an awful lot like a RDBMS. But even postgres has some quirks, like indexes never shrinking, vacuum sometimes always running in the background and thus taking forever to free up space on large tables... so your blurb about 'defragmentation' made me realize that for open source projects, zero-admin-maintenance is certainly a very high goal. Cheers.
The Tux3 filesystem returns
Posted Jan 2, 2013 23:11 UTC (Wed) by moltonel (guest, #45207) [Link]
The Tux3 filesystem returns
Posted Jan 3, 2013 0:52 UTC (Thu) by butlerm (subscriber, #13312) [Link]
You can just imagine if access to a directory stalled because another pending transaction had an uncommitted modification to the same directory. Or if pending additions to a directory had to be invisible to other readers, data writes had to be invisible to other readers until committed and so on. There are ways to solve that problem in database system design, but fortunately filesystems generally speaking don't have to deal with it.
The Tux3 filesystem returns
Posted Jan 3, 2013 8:11 UTC (Thu) by daniel (guest, #3181) [Link]
The Tux3 filesystem returns
Posted Jan 3, 2013 16:33 UTC (Thu) by butlerm (subscriber, #13312) [Link]
POSIX filesystem semantics don't have anything like it - in fact they actually forbid it. There is only one view of the filesystem to all user processes, and everything that happens takes effect instantaneously. No isolation, no process specific view of the contents of files or directories, no (user visible) transaction commit or rollback.
Everything a strict POSIX filesystem does with transactions is for recovery purposes, in an attempt to recreate something resembling a consistent snapshot of the system at the time it failed, or at least a working inconsistent one. And certainly one would want to pipeline the assembly and retiring of those transactions, so that the next one can be constructed while previous ones are being retired. I am glad to hear that Tux3 does just that.
The Tux3 filesystem returns
Posted Jan 3, 2013 8:46 UTC (Thu) by daniel (guest, #3181) [Link]
For example we might write ten megabyte data extents, plus four blocks to complete an atomic commit, including one log block, giving a write multiplier of 1.0016. Note that this multiplier includes all metadata. In our "rollup" a few more blocks may need to be accounted to the multiplier, maybe pushing it up to 1.003, or maybe not moving the needle at all because those retained blocks may be amortized over many deltas. In other words, we often write dirty metadata blocks zero times, rounding down modestly.
The Tux3 filesystem returns
Posted Jan 3, 2013 16:45 UTC (Thu) by butlerm (subscriber, #13312) [Link]
Are you pre-allocating the location of the next version, and identifying it that way? Or perhaps referring to the existing version (by location), plus a list of the (accumulated) deltas? Or do you use some other meta-data-block identifier to correlate the different versions of the meta data block and any logged deltas for recovery purposes?
The Tux3 filesystem returns
Posted Jan 3, 2013 23:07 UTC (Thu) by daniel (guest, #3181) [Link]
Though you didn't ask, we can estimate the per-delta writeout cost. We emit an allocation record and a modify record per redirect, which can actually be one record with some modest optimizing. Given a reasonably local btree modification pattern, the number of recursive redirects rounds to zero, so our theoretical cost is roughly one log record per dirty per rollup. Number of dirty metadata blocks written per rollup given a localized burst of modifications is roughly the filesystem tree depth, log_base_196(tree_leaves), a small number that rounds to zero which amortized across many deltas. We can further improve this (rounding closer to zero) by permitting certain dirty metadata to survive rollup, for example, when we need to redirect the root of a file index tree or a high level node of the inode table btree. This is a future optimization. We can round even closer to zero by bumping our average branching factor up from 196 to 213 using b+tree splitting, if anybody cares to add that.
The Tux3 filesystem returns
Posted Jan 4, 2013 3:26 UTC (Fri) by butlerm (subscriber, #13312) [Link]
Then at some future point the database server process forces a more recent version of the index block back to disk at the same location as the old one, either because it needs to free up pages, is running a checkpoint, or is in the process of shutting down. After a checkpoint is complete the versions of the blocks on disk are up-to-date up through the logical sequence number of the checkpoint.
The vulnerability is that if any block is corrupted, it is highly likely that the database will have to be recovered from a coherent snapshot, and the archived redo information generated in the interim re-applied to restore the database to the state it was at the moment of the crash, and then the uncommitted transactions rolled back, typically by reversing or undo-ing all the block changes made by each one.
Filesystems, of course, generally do not have the luxury of requiring administrators to restore a clean snapshot of the underlying block device on such occasions. And that is why (I understand) the EXT4 designers adopted block image journaling rather than block modification journaling, with its accompanying cost in journal traffic. Many databases, when put into "backup mode" switch to block image journaling for the same reason - the backup software will not necessarily acquire a coherent image of each block.
And of course when such a mode is engaged, redo log traffic skyrockets, giving administrators a healthy incentive to take the database out of backup mode as soon as possible, something made much more practical by the advent of filesystem snapshots - if at the cost of pushing much of the overhead (for a database under heavy write load) onto the volume manager or snapshot-capable filesystem instead.
The Tux3 filesystem returns
Posted Jan 6, 2013 3:29 UTC (Sun) by jeltz (guest, #88600) [Link]
The Tux3 filesystem returns
Posted Jan 6, 2013 13:34 UTC (Sun) by andresfreund (subscriber, #69562) [Link]
The reason its on by default is that its important for consistency in the face of OS/HW crashes. Postgres' pages are normally bigger than the disks sectors and you could get into problems if e.g. one half of a page was written but the other was not and the WAL record only contains information about the one of them. As recovery always starts at a checkpoint its enough to log the page fully once after each checkpoint.