PostgreSQL's fsync() surprise

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

By Jonathan Corbet
April 18, 2018

Developers of database management systems are, by necessity, concerned about getting data safely to persistent storage. So when the PostgreSQL community found out that the way the kernel handles I/O errors could result in data being lost without any errors being reported to user space, a fair amount of unhappiness resulted. The problem, which is exacerbated by the way PostgreSQL performs buffered I/O, turns out not to be unique to Linux, and will not be easy to solve even there.

Craig Ringer first reported the problem to the pgsql-hackers mailing list at the end of March. In short, PostgreSQL assumes that a successful call to fsync() indicates that all data written since the last successful call made it safely to persistent storage. But that is not what the kernel actually does. When a buffered I/O write fails due to a hardware-level error, filesystems will respond differently, but that behavior usually includes discarding the data in the affected pages and marking them as being clean. So a read of the blocks that were just written will likely return something other than the data that was written.

What about error status reporting? One year ago, the Linux Filesystem, Storage, and Memory-Management Summit (LSFMM) included a session on error reporting, wherein it was described as "a mess"; errors could easily be lost so that no application would ever see them. Some patches merged during the 4.13 development cycle improved the situation somewhat (and 4.16 had some changes to improve it further), but there are still ways for error notifications to be lost, as will be described below. If that happens to a PostgreSQL server, the result can be silent corruption of the database.

PostgreSQL developers were not pleased. Tom Lane described it as "kernel brain damage", while Robert Haas called it "100% unreasonable". In the early part of the discussion, the PostgreSQL developers were clear enough on what they thought the kernel's behavior should be: pages that fail to be written out should be kept in memory in the "dirty" state (for later retries), and the relevant file descriptor should be put into a permanent error state so that the PostgreSQL server cannot miss the existence of a problem.

Where things go wrong

Even before the kernel community came into the discussion, though, it started to become clear that the situation was not quite as simple as it might seem. Thomas Munro reported that Linux is not unique in behaving this way; OpenBSD and NetBSD can also fail to report write errors to user space. And, as it turns out, the way that PostgreSQL handles buffered I/O complicates the picture considerably.

That mechanism was described in detail by Haas. The PostgreSQL server runs as a collection of processes, many of which can perform I/O to the database files. The job of calling fsync(), however, is handled in a single "checkpointer" process, which is concerned with keeping on-disk storage in a consistent state that can recover from failures. The checkpointer doesn't normally keep all of the relevant files open, so it often has to open a file before calling fsync() on it. That is where the problem comes in: even in 4.13 and later kernels, the checkpointer will not see any errors that happened before it opened the file. If something bad happens before the checkpointer's open() call, the subsequent fsync() call will return successfully. There are a number of ways in which an I/O error can happen outside of an fsync() call; the kernel could encounter one while performing background writeback, for example. Somebody calling sync() could also encounter an I/O error — and consume the resulting error status.

Haas described this behavior as failing to live up to what PostgreSQL expects:

What you have (or someone has) basically done here is made an undocumented assumption about which file descriptors might care about a particular error, but it just so happens that PostgreSQL has never conformed to that assumption. You can keep on saying the problem is with our assumptions, but it doesn't seem like a very good guess to me to suppose that we're the only program that has ever made them.

Joshua Drake eventually moved the conversation over to the ext4 development list, bringing in part of the kernel development community. Dave Chinner quickly described this behavior as "a recipe for disaster, especially on cross-platform code where every OS platform behaves differently and almost never to expectation". Ted Ts'o, instead, explained why the affected pages are marked clean after an I/O error occurs; in short, the most common cause of I/O errors, by far, is a user pulling out a USB drive at the wrong time. If some process was copying a lot of data to that drive, the result will be an accumulation of dirty pages in memory, perhaps to the point that the system as a whole runs out of memory for anything else. So those pages cannot be kept if the user wants the system to remain usable after such an event.

Both Chinner and Ts'o, along with others, said that the proper solution is for PostgreSQL to move to direct I/O (DIO) instead. Using DIO gives a greater level of control over writeback and I/O in general; that includes access to information on exactly which I/O operations might have failed. Andres Freund, like a number of other PostgreSQL developers, has acknowledged that DIO is the best long-term solution. But he also noted that getting there is "a metric ton of work" that isn't going to happen anytime soon. Meanwhile, he said, there are other programs (he mentioned dpkg) that are also affected by this behavior.

Toward a short-term solution

As the discussion went on, a fair amount of attention was paid to the idea that write failures should result in the affected pages being kept in memory, in their dirty state. But the PostgreSQL developers had quickly moved on from that idea and were not asking for it. What they really need, in the end, is a reliable way to know that something has gone wrong. Given that, the normal PostgreSQL mechanisms for dealing with errors can take over; in its absence, though, there is little that can be done.

One idea that came up a few times was to respond to an I/O error by marking the file itself (in the inode) as being in a persistent error state. Such a change, though, would take Linux behavior further away from what POSIX mandates and would raise some other questions, including: when and how would that flag ever be cleared? So this change seems unlikely to happen.

At one point in the discussion, Ts'o mentioned that Google has its own mechanism for handling I/O errors. The kernel has been instrumented to report I/O errors via a netlink socket; a dedicated process gets those notifications and responds accordingly. This mechanism has never made it upstream, though. Freund indicated that this kind of mechanism would be "perfect" for PostgreSQL, so it may make a public appearance in the near future.

Meanwhile, Jeff Layton pondered another idea: setting a flag in the filesystem superblock when an I/O error occurs. A call to syncfs() would then clear that flag and return an error if it had been set. The PostgreSQL checkpointer could make an occasional syncfs() call as a way of polling for errors on the filesystem holding the database. Freund agreed that this might be a viable solution to the problem.

Any such mechanism will only appear in new kernels, of course; meanwhile, PostgreSQL installations tend to run on old kernels maintained by enterprise distributions. Those kernels are likely to lack even the improvements merged in 4.13. For such systems, there is little that can be done to help PostgreSQL detect I/O errors. It may come down to running a daemon that scans the system log, looking for reports of I/O errors there. Not the most elegant solution, and one that is complicated by the fact that different block drivers and filesystems tend to report errors differently, but it may be the best option available.

The next step is likely to be a discussion at the 2018 LSFMM event, which happens to start on April 23. With luck, some sort of solution will emerge that will work for the parties involved. One thing that will not change, though, is the simple fact that error handling is hard to get right.

Index entries for this article
Kernel	Block layer/Error handling

(Log in to post comments)

PostgreSQL's fsync() surprise

Posted Apr 18, 2018 18:03 UTC (Wed) by cesarb (subscriber, #6266) [Link]

> A call to syncfs() would then clear that flag and return an error if it had been set. The PostgreSQL checkpointer could make an occasional syncfs() call as a way of polling for errors on the filesystem holding the database.

What if two programs did that?

PostgreSQL calls syncfs() and receives no error indication; an error happens, but another process calls syncfs() and clears the flag; PostgreSQL calls syncfs() and receives no error indication again. Oops!

It would be better to have a per-filesystem error counter, instead of a flag: if the error count didn't increase when PostgreSQL checks it again, no error occurred in the meantime, no matter how many other processes have checked for errors.

PostgreSQL's fsync() surprise

Posted Apr 18, 2018 18:41 UTC (Wed) by corbet (editor, #1) [Link]

The counter is probably how it will actually be implemented, from my (re)reading of the discussion. I didn't quite describe the mechanism correctly — I think. It's hard to tell for sure since no patches have actually been posted yet.

PostgreSQL's fsync() surprise

Posted Apr 18, 2018 18:43 UTC (Wed) by jlayton (subscriber, #31672) [Link]

I think Jon may have misunderstood what I was proposing (and in truth, Willy first proposed it upstream).

In practice, we'd want to keep an errseq_t in the superblock instead of a flag. That would allow us to ensure that we report an error to syncfs only once per file description. The big issue there though is that we also need another 32-bits per file description (aka struct file) to act as its "cursor" in the error stream, or we need to figure out some way to share the file->f_wb_err field that we use for fsync.

I proposed a draft patch earlier this week (which I meant to send as an RFC) that does the latter. It's based on Willy's suggestion to only report errors from the errseq_t when the fd is an O_PATH open. You can't call fsync on an O_PATH open, so that should be safe (though it is horribly non-obvious from a userland API standpoint):

https://www.spinics.net/lists/linux-fsdevel/msg124527.html

Déjà vu?

Posted Apr 18, 2018 18:50 UTC (Wed) by marcH (subscriber, #57642) [Link]

Asynchronous execution is:

- a concurrency nightmare
- an error reporting nightmare
- not so greatly supported by programming languages and systems
- critical for acceptable performance...

Error handling generally has:
- near zero test coverage

:-(

PS: the quality and clarity of this article are stunning. Made me feel once again good paying my own subscription (as opposed to just use my company's)

Déjà vu?

Posted Apr 18, 2018 21:23 UTC (Wed) by wahern (subscriber, #37304) [Link]

SQLite has abnormally comprehensive test coverage. For both OOM and I/O errors it tests the immediate error path after every possible failure point. See Section 3, Anomaly Testing, at https://www.sqlite.org/testing.html . Unfortunately, it can't simulate errors in the kernel code itself.

Déjà vu?

Posted Apr 19, 2018 14:27 UTC (Thu) by ringerc (subscriber, #3071) [Link]

It could, to a degree. When testing this, I used dmsetup and the 'error' target to introduce errors. The dmsetup 'flakey' target is also spectacularly useful.

Déjà vu?

Posted Apr 20, 2018 7:46 UTC (Fri) by marcH (subscriber, #57642) [Link]

> error reporting nightmare

By the way Java makes a decent attempt with "Futures"
https://docs.oracle.com/javase/7/docs/api/java/util/concu...

Java was also the first language to have a formal memory model. These "performance" features may explain why Java was more successful on the server side than in embedded for which it was targeted initially.

Ugly[*] and not fun but doing the job!

[*] https://steve-yegge.blogspot.com/2006/03/execution-in-kin...

Déjà vu?

Posted May 4, 2018 4:53 UTC (Fri) by ncm (guest, #165) [Link]

Java was also the first language to have an unimplementable memory model. Oops!

They tried, bless their hearts.

Déjà vu?

Posted May 6, 2018 1:46 UTC (Sun) by marcH (subscriber, #57642) [Link]

Indeed it took Java a long time between trying and succeeding - to be expected when you're first to do... both? Years before others started to merely express a decent level of interest for formalization and standardization?

PostgreSQL's fsync() surprise

Posted Apr 18, 2018 20:30 UTC (Wed) by flussence (subscriber, #85566) [Link]

>Such a change, though, would take Linux behavior further away from what POSIX mandates and would raise some other questions [...]
I don't think that's necessarily a bad thing. POSIXly correct filesystems have surprised users in unpleasant ways in the past; recall early ext4 eating people's DE config files, all because the standard had some undefined behaviour around file writes and renames.

PostgreSQL's fsync() surprise

Posted Apr 18, 2018 21:35 UTC (Wed) by wahern (subscriber, #37304) [Link]

POSIX is a *standard*. Standards include standard pitfalls--whether presently known or not. The alternative is yet another standard (de facto or otherwise) or even more chaos and unpredictability. Also, specifying undefined behavior is significantly different than deviating from mandatory semantics.

The fact that these issues have gone undiscovered and subsequently unattended for so long should disabuse people of the notion that there's sufficient interest or resources in supplanting POSIX as a standard. How many file systems have come and gone. ext4 is the closest thing to a de facto standard in Linux but it has survived precisely because of it's simplicity and by having POSIX compliance as a guide star (intentionally or not), as opposed to chasing new ideas.

The biggest hurdle for any large or complex project is coordinating effort and maintaining focus. Standards help immensely in this regard, even flawed ones.

PostgreSQL's fsync() surprise

Posted Apr 19, 2018 12:12 UTC (Thu) by eru (subscriber, #2753) [Link]

Why not a a flag to enable the nonstandard error behaviour for a particular file? It is not like Linux already didn't have lots of flags enabling interesting nonstandard features.

PostgreSQL's fsync() surprise

Posted Apr 20, 2018 10:19 UTC (Fri) by anton (subscriber, #25547) [Link]

POSIXly correct filesystems have surprised users in unpleasant ways in the past; recall early ext4 eating people's DE config files, all because the standard had some undefined behaviour around file writes and renames.

If a standard does not define something, it's up to the implementation to do it; i.e., it's their responsibility. Sufficiently bloody-minded implementors produce unpleasant surprises, and then point to standards or benchmarks as an excuse; but as long as the standard does not require the unpleasant behaviour (in which case it would be defined, not undefined), the implementator has the choice, and therefore the responsibility. Of course, implementors who blame the standard don't want you to recognize this, and often argue as if lack of definition in the standard required them to behave unpleasantly. It doesn't.

I wonder if the "what POSIX mandates" in the article really refers to a mandate by POSIX, or another case of lack of definition that an implementator sees as a welcome opportunity for an unpleasant surprise.

PostgreSQL's fsync() surprise

Posted Apr 20, 2018 18:12 UTC (Fri) by zlynx (guest, #2285) [Link]

If the user-friendly, pleasant behavior is expected then it should be in the standard. If it isn't, there's a reason for that and implementors should be able to be as bloody-minded as they please.

If everyone is expected to be nice instead of following the standards, then there's no point in the current standard and it should be replaced with the "be nice" version.

For example, there are people who expect TCP/IP to deliver their packets in the same sized chunks they were sent. These people are simple wrong. But by the "be nice" standard we'd have to write stupid networking stacks because some people expect behavior that isn't required.

Maybe it's time for a POSIX 2020 standard. But if it isn't in there, don't expect it to work like anything else.

PostgreSQL's fsync() surprise

Posted Apr 21, 2018 14:54 UTC (Sat) by anton (subscriber, #25547) [Link]

Yes, ideally standards would be complete. In practice, they tend to specify just the intersection of the behaviour of the existing implementations (in line with the requirement that a standard should standardize common practice), as well as considering various constraints on outlier systems; e.g., "We want this standard to be implementable on a system with 64KB RAM, and mandating the pleasant behaviour would cost several KB for this subfeature alone, so we leave the behaviour unspecified." And then a bloody-minded implementor for systems that use multiple GBs of RAM uses the lack of specification as justification to implement unpleasant behaviour.

And don't forget that standards are decided through consensus in the committee, so it takes just a few bloody-minded implementors on the standards committee to block any progress towards pleasantness.

If everyone is expected to be nice instead of following the standards

That's an excellent example of what I mean with "hiding behind the standard", and why I suspect that "what POSIX mandates" is in reality different from what was claimed in the discussion described in the arcticle. If the standards do not specify what the implementation should do ("undefined behaviour" or somesuch), there is nothing in the standard that the implementation could follow, and it's the sole responsibility of the implementor to choose a particular behaviour. If, in such a situation, the implementor chooses to implement unpleasant behaviour, it's his fault, and his fault alone; the standard did not make him do it.

PostgreSQL's fsync() surprise

Posted Apr 24, 2018 16:32 UTC (Tue) by nybble41 (subscriber, #55106) [Link]

> If the standards do not specify what the implementation should do ("undefined behaviour" or somesuch), there is nothing in the standard that the implementation could follow, and it's the sole responsibility of the implementor to choose a particular behaviour. If, in such a situation, the implementor chooses to implement unpleasant behaviour, it's his fault, and his fault alone; the standard did not make him do it.

All true, of course, but "unpleasant behavior" can still be a reasonable choice. Any application which *relied* on system-specific "pleasant" behavior would necessarily be non-portable. If "pleasant" behavior is desirable then, IMHO, the right solution is to standardize the behavior so that applications can be written against the standard and not one particular implementation. In the meantime, the most productive choice when undefined behavior is detected is to complain as loudly as possible, or even terminate the process, rather than allow the application to silently continue in an undefined state. This ensures that the application developer is made aware of the issue and has both the opportunity and incentive to fix it. (However, this outcome should remain *undefined* behavior so that this can be changed in the future if and when more pleasant behavior is standardized.) Going out of one's way to make undefined behavior "pleasant" is a form of attractive nuisance, in that it tends to encourage non-portable code.

In the end, an application which relies on a specific implementation of undefined behavior, pleasant or unpleasant, is broken. A particular installation may do the right thing for certain known inputs; one may even be able to prove that it does the right thing for all possible inputs given perfect knowledge of the implementation in use on a particular system. However, the third layer of software[1]—design/logic—is missing: since the application is not in compliance with the standard, one cannot prove that it will work on any standard-compliant system, including future versions of the same system.

[1] http://www.pathsensitive.com/2018/01/the-three-levels-of-...

PostgreSQL's fsync() surprise

Posted Apr 26, 2018 16:22 UTC (Thu) by anton (subscriber, #25547) [Link]

All true, of course, but "unpleasant behavior" can still be a reasonable choice.

Yes, as mentioned, when implementing on a system with 64KB, you may not be able to afford the pleasantness. But we would not be discussing this topic if all cases of unpleasant behaviour were reasonable.

Any application which *relied* on system-specific "pleasant" behavior would necessarily be non-portable.

It would be *potentially* non-portable, not necessarily. It would become actually non-portable if an unpleasant implementation appears. But so what? I am pretty keen on portability, but life's too short for unreasonably unpleasant implementations. If your program does not run in 64KB anyway, there is no need to cater to that reasonable unpleasantness; and if you want to cater to unreasonable unpleasantness, it's your time and money to waste (after all, some people write programs in Brainfuck), but I would not recommend it to anyone else.

If "pleasant" behavior is desirable then, IMHO, the right solution is to standardize the behavior so that applications can be written against the standard and not one particular implementation.

If you think so, go ahead and work on standardizing pleasant behaviours. But as mentioned, there is the issue of constrained systems where you cannot afford the pleasantness. One solution is to specify several levels of the standard. The minimal level allows unpleasantness that is reasonable on constrained systems; a higher level specifies more pleasantness. However, if you have unreasonable implementors in the standards committee, you will be out of luck in your standardization effort.

Concerning reporting when undefined behaviour is performed, that's a relatively pleasant way to deal with the situation. It's not appropriate when the application developer actually wants to rely on a specific behaviour and does not want to "fix" it, but it certainly makes it clear that your implementation is not pleasant enough to run this application.

In the end, an application which relies on a specific implementation of undefined behavior, pleasant or unpleasant, is broken.

No, it isn't. If it behaves as intended in a specific setting, it's working, not broken. It may be unportable, but that does not make it broken.

since the application is not in compliance with the standard, one cannot prove that it will work on any standard-compliant system

Most programmers do not formally verify their programs, but instead test them. There is no way to prove that a program is in compliance with a standard by testing, even if the programmer intends to avoid undefined behavior. But even the few programmers that actually use formal verification for their programs cannot prove that their programs comply with most standard (e.g., POSIX), because most standard are not formally specified. So this whole proof issue is a red herring.

including future versions of the same system.

Any system worth using (e.g., Linux) maintains in future versions the pleasantness it has supported in earlier versions.

PostgreSQL's fsync() surprise

Posted Apr 26, 2018 17:06 UTC (Thu) by zlynx (guest, #2285) [Link]

> Any system worth using (e.g., Linux) maintains in future versions the pleasantness it has supported in earlier versions.

No, because that is an unreasonable limit.

Simply because of implementation limits, ext3 serialized file and directory updates in a certain way and for many years. So people got used to it. But it never applied to ext2, XFS or FAT or literally ANY other filesystem. Not to mention BSD's UFS or Hammer2, or Apple's HFS. Heck, it didn't even apply to ext3 in certain configurations.

And then people tried to require that ext4 work the same way. And btrfs. And even wanted to go back to force XFS to work that way too.

The correct answer is to fsync() everything, which would show how bad ext3 was at that particular operation. All those fsyncs make things slower for people using ext3, but that does not mean fsync is the wrong answer. It just means ext3 was a filesystem with a terrible fsync() implementation that people got used to using.

"Pleasant behavior" is often simply what programmers have become used to. It doesn't make it correct or actually pleasant.

PostgreSQL's fsync() surprise

Posted Apr 26, 2018 22:42 UTC (Thu) by nybble41 (subscriber, #55106) [Link]

> It would be *potentially* non-portable, not necessarily. It would become actually non-portable if an unpleasant implementation appears.

See, you're talking about level 2 (particular implementations). Portable program *design* happens at level 3 (design/logic). If your program relies on behavior which is undefined according to the standard then it is non-portable, regardless of whether other implementations behave the same way. You can't say "this program works on any POSIX-compatible system", for example. You know that it works on Linux version X and maybe BSD version Y, but if someone puts together a new OS which follows all the relevant standards neither you nor they can be confident that your program will work on it unmodified.

> Most programmers do not formally verify their programs, but instead test them.

Formal verification in this context is a red herring. Tests are also a form of proof, albeit in the weaker courtroom-style, balance-of-evidence sense rather than the strict mathematical sense. The point is that without a standard you don't have a sound basis for reasoning "I called the function with these arguments, therefore the implementer and I both know that it should do this." Standards are how users and implementers of an API communicate. Relying on undefined behavior in your program is like speaking gibberish and expecting the listener to guess what you meant; there is a breakdown in communication, and the problem isn't on the implementer's end.

> Any system worth using (e.g., Linux) maintains in future versions the pleasantness it has supported in earlier versions.

As zlynx already explained, that is an unreasonable expectation and even Linux doesn't always operate that way.

PostgreSQL's fsync() surprise

Posted Feb 14, 2019 21:21 UTC (Thu) by dvdeug (subscriber, #10998) [Link]

A POSIX-compliant system could have malloc just return an error for all calls. A POSIX system that could be reasonable in some circumstances could have malloc return an error for any malloc over a megabyte; the first port of Unix was the Interdata 8/32, with 256kb of memory. There is no non-trivial Unix program that doesn't make assumptions about the POSIX system it's running on.

> if someone puts together a new OS which follows all the relevant standards neither you nor they can be confident that your program will work on it unmodified.

Even POSIX-compatible systems aren't perfectly interchangable. In the case of a program like PostgreSQL, it's usually important not just that it runs, but it runs well, and POSIX can not and does not guarantee speed constraints; even Linux alone can store its filesystems in many different ways on many different media, and some of those combinations may not work in practice for PostgreSQL.

> Standards are how users and implementers of an API communicate.

In theory, but not in reality. Most of the APIs a major program depends on are implemented by one library and have but vague descriptions of how it works outside the source code and behavior of that library. There were many Unixes before POSIX, many C and C++ compilers before the first standard was written down. Many people still depend on specialized features of GNU C, enough that several compilers have to copy those unstandardized features. Standards are wonderful if they're followed, but many are underspecified or just usually ignored. New versions of the C, C++ and Scheme standard have removed features that older standards have mandated because they were not well supported.

A huge example is the fact that most of these standards are written in English, an unstandardized language, not Lojban or even French. How can we know what a standard means if the language it is written in is unstandardized? But, for the most part, we manage.

PostgreSQL's fsync() surprise

Posted Feb 18, 2019 20:09 UTC (Mon) by nybble41 (subscriber, #55106) [Link]

> A POSIX-compliant system could have malloc just return an error for all calls. ... Even POSIX-compatible systems aren't perfectly interchangable.

True, but irrelevant. I only mentioned POSIX as an example. No one is expecting a complex project like PostgreSQL to work equally well under all POSIX-compliant operating systems; there will be other dependencies.

Regarding the first point, a POSIX-compliant program would check for malloc() errors and either recover or terminate in a well-defined way. The program is portable as long as the behavior is well-defined for all conforming implementations; this is a separate consideration from being *useful*.

>> Standards are how users and implementers of an API communicate.
> In theory, but not in reality. Most of the APIs a major program depends on are implemented by one library and have but vague descriptions of how it works outside the source code and behavior of that library.

What you are describing is a failure to communicate. Programs written this way are inherently non-portable because they are written to fit the specifics of particular implementations. Any change to an implementation can cause any program to break in unspecified ways. This is the problem which standards exist to solve. They allow implementers and users of an interface to agree on roles and responsibilities; implementers can improve their code without worrying about breaking standards-compliant users, and users know which parts of the interface they can rely on and which parts may vary from one implementation (or version) to the next.

> How can we know what a standard means if the language it is written in is unstandardized?

"How can digital logic exist when all electronic components have analog characteristics?" This is bordering on abstract philosophy in the "can two people ever truly communicate" sense, but I'll try to answer it seriously anyway: We distinguish between parts of the language we can rely on for clear communication and parts which, while perhaps useful in other contexts, fail to clearly convey our intent, and build up more complex constructs from elements of the first set. The subset of natural language used for formal standards is actually pretty tightly constrained compared to literature in general. Even so, the dependency on natural language for formal specifications is a weak point and communication does occasionally break down as a result. We have feedback mechanisms in place to detect such breakdowns and correct them by issuing clarifications or revising the standards.

PostgreSQL's fsync() surprise

Posted Feb 19, 2019 0:42 UTC (Tue) by dvdeug (subscriber, #10998) [Link]

Of what interest is a portable program that is not useful? It's trivial to write a portable program; just check uname at the start and exit out on any but the system you're written for. Nobody does that, because it's not useful.

As for which parts may vary from version to version, version 3 may adhere to an entirely different standard than version 2. The fact that there is a standard may do you no good if it's evolving rapidly along with the software.

From the other side, even if you are standards conforming, that may not be enough. A user can expect that qsort sorts, but can they expect that it does so reasonably quickly? How often can you call fsync to maintain a reasonable balance between speed and safety? That's never going to be defined by the standard, but an understanding needs to be reached by the authors of a program like PostgreSQL.

I don't believe it's a question of abstract philosophy. It's one thing if standards were a tool used some places and not others in the computer world, that at their best were understood not to be sufficient to be binding on either implementer or user, then it would be reasonable to use unstandardized language in writing standards. But if _all_ APIs should depend on standards, then using an unstandardized language, when, again, formal languages like Lojban or simply standardized ones like French exist.

PostgreSQL's fsync() surprise

Posted Feb 19, 2019 22:39 UTC (Tue) by nybble41 (subscriber, #55106) [Link]

> Of what interest is a portable program that is not useful?

It's not a matter of either/or. Programs should be both portable *and* useful.

> A user can expect that qsort sorts, but can they expect that it does so reasonably quickly? How often can you call fsync to maintain a reasonable balance between speed and safety? That's never going to be defined by the standard...

Why not? Standards do sometimes specify things like algorithmic complexity. C doesn't specify that for qsort(), unfortunately, but C++ does require std::sort() to be O(n log n) in the number of comparisons. What constitutes a "reasonable balance" is up to the user, but there is no reason in principle why there couldn't be a standard for "filesystems useable with PostgreSQL" which defines similar timing requirements for fsync().

PostgreSQL's fsync() surprise

Posted Apr 26, 2018 16:29 UTC (Thu) by Wol (subscriber, #4433) [Link]

> I wonder if the "what POSIX mandates" in the article really refers to a mandate by POSIX, or another case of lack of definition that an implementator sees as a welcome opportunity for an unpleasant surprise.

As I understand it, POSIX explicitly *avoids* what happens when things go wrong, precisely because POSIX has no idea what's happened.

So a linux standard that says "this is the way we handle errors" will be completely orthogonal to POSIX. And would be a good thing ...

The trouble with POSIX is it's an old standard, that is out-of-date, and while I believe there is some effort at updating it, there is far too much undefined behaviour out there.

Cheers,
Wol

PostgreSQL's fsync() surprise

Posted Apr 18, 2018 20:47 UTC (Wed) by willy (subscriber, #9762) [Link]

I'm a little disappointed there's been no response to my suggestion that we're actually doing worse than we were before errseq_t went in, from a PostgreSQL PoV.

Before, we would mark the inode as having a writeback error and then at least one caller of fsync would receive that error. So if nobody other than checkpointer was calling fsync and the inode wasn't evicted from memory, checkpointer would see the error.

Now, we assume that anybody opening the file isn't interested in historical errors or they would have had the fd open. Clearly not true for PostgreSQL. What I suggested was that *if* nobody has seen the error yet, then the error is not so historical after all, and we should report it to the new opener.

As I alluded earlier, we can still lose errors this way if the inode was evicted under memory pressure. It just restores some of the earlier behaviour we had.

PostgreSQL's fsync() surprise

Posted Apr 19, 2018 11:12 UTC (Thu) by jlayton (subscriber, #31672) [Link]

I'm not sure I agree that it's really worse. Offering different behavior based on whether the inode got evicted from the cache is pretty nasty, as userland has no way to detect whether that has happened.

That said, I'm not opposed to re-enabling the ability to see unreported errors that occurred prior to the open (and your scheme to do that was pretty clever) if the Pg folks think it's of value in the near term. Maybe we could make that behavior opt-in based on a sysctl or something?

PostgreSQL's fsync() surprise

Posted Apr 18, 2018 21:20 UTC (Wed) by Sesse (subscriber, #53779) [Link]

Will you even see I/O errors that happen between open() and fsync(), if said I/O errors are caused by writes done in another process?

PostgreSQL's fsync() surprise

Posted Apr 18, 2018 21:29 UTC (Wed) by willy (subscriber, #9762) [Link]

Yes, and jlayton's errseq_t work makes this better; every open fd will see every writeback error.

PostgreSQL's fsync() surprise

Posted Apr 18, 2018 21:33 UTC (Wed) by Sesse (subscriber, #53779) [Link]

Hm, I wonder if I heard something about Windows Vista improving I/O error reporting… Who knows what Postgres on Windows experiences.

Failed writeback to removable devices

Posted Apr 19, 2018 6:36 UTC (Thu) by epa (subscriber, #39769) [Link]

The most common cause of I/O errors, by far, is a user pulling out a USB drive at the wrong time. If some process was copying a lot of data to that drive, the result will be an accumulation of dirty pages in memory, perhaps to the point that the system as a whole runs out of memory for anything else.

That suggests that removable devices should be handled a little differently for writeback. The number of dirty pages should be more strictly capped (perhaps ten megabytes per USB drive, or some other heuristic) so that writing becomes closer to synchronous. This is also a usability improvement: when copying files around on the hard disk, I don't care if they are still dirty pages in memory to be flushed to disk later. Once the file copy has 'completed' I can go on to the next task. But if copying to a USB stick, 99% of the time it's because you want to take the USB stick out of the computer and take the data somewhere else (or possibly to then boot from that device). Here the user really does need to wait for the data to be written to the device, and it doesn't help if the file copy dialogue box (or 'cp' command) appears to finish but the writeback happens in the background, with no indication to the user of progress so far or a notification when it completes. As the article notes, that can also lead users to remove the USB stick before the pages are flushed, thinking that the operation has completed -- and if lecturing users about this worked, we wouldn't still have the problem after twenty years.

So I suggest for the most common kinds of removable devices that can be identified as such, the kernel should keep a lid on writeback, both for the total number of dirty pages and how long they can hang around before being written out (I suggest one second is reasonable for USB sticks). For non-removable devices which can be identified as such, the kernel could be a bit more careful and not blithely clear the dirty bit on pages that couldn't be flushed because of I/O errors. Yes, I know that in principle you may not be able to tell in advance which devices are removable and which aren't, but this is more a theoretical than a practical concern: it would be sufficient to treat USB-attached drives as removable and ATA/SCSI/whatever ones as non-removable.

Failed writeback to removable devices

Posted Apr 19, 2018 7:07 UTC (Thu) by neilbrown (subscriber, #359) [Link]

One problem with any attempt to keep data around after a write failure is that it will upset memory allocation.
Memory management is predicated on the fact that dirty pages can be cleaned, and clean pages can be freed - in deterministic time. If you mess with that, then deadlocks are just around the corner. Possibly you could set a quota of unwriteable pages and keep pages around as long as there are fewer than the limit, but I doubt that would really end up being helpful.

Failed writeback to removable devices

Posted Apr 19, 2018 9:24 UTC (Thu) by epa (subscriber, #39769) [Link]

Right, but if writing to one of the main system disks fails with I/O errors then I think you have bigger problems and will soon need to reboot anyway?

For removable devices, the data does have to be thrown away after a write error (assuming that the error was due to unplugging the device).

Failed writeback to removable devices

Posted Apr 20, 2018 9:24 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

Can unflushable dirty pages be "poisoned" instead? Just re-use the hwpoison mechanism to kill all the processes that might refer to them. Perhaps make this killing optional through some prctl() option.

Failed writeback to removable devices

Posted Apr 20, 2018 11:43 UTC (Fri) by ringerc (subscriber, #3071) [Link]

From a PostgreSQL point of view that's actually nearly ideal, so long as we can protect the postmaster. Since it doesn't do much regular I/O that should be fine. Plus, on systemd systems the postmaster will get restarted if killed.

We'd receive SIGCHLD for the killed user backend worker(s)/checkpointer/etc, which would trigger crash recovery where we kill all other backends then execute redo. That's perfect. Something portable would be better, of course, but something that covers 95% of users is pretty darn good.

I was unaware of the hwpoison mechanism.

Failed writeback to removable devices

Posted Apr 19, 2018 13:20 UTC (Thu) by NRArnot (subscriber, #3033) [Link]

Um. Won't a dirty pages cap have horrible performance implications for those of us who use Terabyte USB3 disk drives with rsync for making backups to removeable media?

(Yes, they are fairly crappy by disk drive standards, but they have the advantage over any form of backup across a network that as soon as you pull the USB cable, the data is no longer vulnerable to hostile actors elsewhere on the internet. So as a secondary, last-resort backup of an entire system, they are valuable.)

Failed writeback to removable devices

Posted Apr 19, 2018 15:07 UTC (Thu) by farnz (subscriber, #17727) [Link]

Ideally, you'd be able to tune the dirty pages cap to match the throughput the target can handle in a sensible timescale - say 100 ms for removable devices. That way, your device still has a lot of data to handle compared to its throughput, but it'll block applications when they're able to generate dirty pages far faster than your device can handle - no more application finished with 60 seconds of data left to write out to your USB device.

Something like less-annoying background writeback definitely takes you in the right direction…

Failed writeback to removable devices

Posted Apr 27, 2018 10:39 UTC (Fri) by Wol (subscriber, #4433) [Link]

Dunno what relevance this has to my use case - I regularly do multi-gigabyte network copies (24MP raw camera images, HD video etc) - but this absolutely kills my laptop performance.

On a twin-core machine, load average will hit 4 or 5 or 6, and system response basically goes through the floor. Actually, the cause could well be that RAM is flooded, but whatever the cause, it's rather frustrating.

Cheers,
Wol

Failed writeback to removable devices

Posted Apr 19, 2018 16:43 UTC (Thu) by nix (subscriber, #2304) [Link]

External USB drives also have the advantage that they can be swapped out for offsite backup, without blowing your network usage cap trying to back up to some cloud service you don't control. (Also, you can be fairly sure you can get them *back* again, unlike some cloud service you don't control.)

Honestly, I suspect most of us here are doing last-ditch USB drive backups of *something*, at least.

Failed writeback to removable devices

Posted Dec 31, 2020 15:13 UTC (Thu) by andrit (guest, #143916) [Link]

I wonder how FreeBSD (who keeps the pages dirty in memory on writeback failures) handles this...

PostgreSQL's fsync() surprise

Posted Apr 19, 2018 6:52 UTC (Thu) by mjthayer (guest, #39183) [Link]

The first question which comes to my mind here (as always, hoping for educational corrections) is whether this should be PostgreSQL's problem at all. How much more common is the situation described here to say, the data being correctly written but then the storage going bad, or the hardware reporting that the data was correctly written when it was not? Perhaps the user should be responsible for providing a reliable layer under PostgreSQL? I would more expect PostgreSQL to provide for being as resiliant as possible if data corruption does occur, which I am sure it does.

PostgreSQL's fsync() surprise

Posted Apr 19, 2018 7:16 UTC (Thu) by neilbrown (subscriber, #359) [Link]

In a lot of situations, EIO errors a unlikely and when they happen it probably means that the whole device is lost. In those cases there isn't much postgresql could do except for rejecting all new requests.

In other situations, errors are more likely and less fatal. Writes to NFS don't necessarily return ENOSPC immediately - you might not get that until fsync. If you use thin-provisioning then (apparently) it is possible to get IO errors which will stop happening once the admin plugs in a new device.

Quoting from the email thread:

> This also means AFAICS that running Pg on NFS is extremely unsafe, you MUST
> make sure you don't run out of disk. Because the usual safeguard of space
> reservation against ENOSPC in fsync doesn't apply to NFS. (I haven't tested
> this with nfsv3 in sync,hard,nointr mode yet, *maybe* that's safe, but I
> doubt it). The same applies to thin-provisioned storage. Just. Don't.

However, the whole point of having an OS is to hide these details. You shouldn't *have* to care what sort of filesystem or storage you are using - behavior should be predictable. Unfortunately, that isn't how it works in the real world.

PostgreSQL's fsync() surprise

Posted May 6, 2018 6:04 UTC (Sun) by ssmith32 (subscriber, #72404) [Link]

Hmmm. I dunno, I feel like if you're running a DB of any type, with data that you care about, you definitely should be thinking about your storage layer (whatever is below the OS - in today's world you may not know the actual physical setup, but you should know the performance & reliability characteristics of it).

It's not really the OS's job to make unreliable hardware reliable or slow hardware performant.

And I do agree with the thin provisioning.. for critical data.. just don't.

PostgreSQL's fsync() surprise

Posted Apr 26, 2018 16:18 UTC (Thu) by ringerc (subscriber, #3071) [Link]

It's tricky. PostgreSQL would like not to have to completely crash and burn if one file on one tablespace becomes impossible to properly flush, so something that gives it options would be nice.

But it also needs to be able to know reliably that "all data from last successful flush is now fully flushed", so it can make decisions appropriately. Right now it turns out we can't know that.

Nobody really wants a kernel panic or database crash because we can't fsync() some random session table that gets nuked by the app every 15 minutes anyway, after all. In practice that won't happen because the table is usually created UNLOGGED but there are always going to be tables you don't want to lose, but don't want the whole system to grind to a halt over either.

PostgreSQL's fsync() surprise

Posted Apr 26, 2018 18:02 UTC (Thu) by andresfreund (subscriber, #69562) [Link]

> It's tricky. PostgreSQL would like not to have to completely crash and burn if one file on one tablespace becomes impossible to properly flush, so something that gives it options would be nice.

FWIW, I don't agree that that's a useful goal. It'd be nice in theory, but it's not even remotely worth the sort of engineering effort it'd require.

> Nobody really wants a kernel panic or database crash because we can't fsync() some random session table that gets nuked by the app every 15 minutes anyway, after all.

I don't think that's a realistic concern. If your storage fails, you're screwed. Continuing to behave well in the face of failing storage would require a *LOT* of work. We'd need timeouts everywhere, we'd need multiple copies of the data etc.

PostgreSQL's fsync() surprise

Posted Apr 19, 2018 13:26 UTC (Thu) by oseemann (subscriber, #6687) [Link]

In this context, a detailed and worthwhile writeup on file system error handling with links to the corresponding research papers can be found here:

https://danluu.com/filesystem-errors/

PostgreSQL's fsync() surprise

Posted Apr 19, 2018 14:32 UTC (Thu) by ringerc (subscriber, #3071) [Link]

Before any users panic, note that you will only run into problems if your storage system fails in an abnormal way, OR you're running a few potentially unsafe configurations that may raise errors on writeback during normal operation.

I suggest taking extra care and doing extra testing if you use:

* Any sort of network block device
* Thin-provisioned storage
* multipath I/O (especially if you haven't set queue_if_no_path etc)

Also, take care not to run out of space in your file system, or test disk-exhaustion behaviour in advance, if you use NFS. Or, preferably, don't do that.

But while this is not cool, it's NOT going to be randomly corrupting PostgreSQL installations all over the place. It's also likely that PostgreSQL is far from the only thing affected.

PostgreSQL's fsync() surprise

Posted Apr 19, 2018 20:30 UTC (Thu) by andresfreund (subscriber, #69562) [Link]

> Andres Freund, like a number of other PostgreSQL developers, has acknowledged that DIO is the best long-term solution.

Worth to note that that'll probably have to be an opt-in configuration. Using DIO one certainly has more control and can get higher performance, but it also requires that the database is more carefully configured. But a lot of people use PostgreSQL without configuring the size of it's own buffer cache at all - the OS adaptively providing a second level of caching makes that OK for a lot of scenarios. Postgres can't realistically figure out how much memory it should use on a given system. It doesn't, and shouldn't, have the information to make such a policy decision.

PostgreSQL's fsync() surprise

Posted Apr 20, 2018 14:38 UTC (Fri) by cornelio (guest, #117499) [Link]

It appears only FreeBSD got it (mostly) right.

The advantage of keeping around the most experienced filesystem developer ever.

PostgreSQL's fsync() surprise

Posted Apr 23, 2018 21:00 UTC (Mon) by helsleym (subscriber, #92730) [Link]

I am not contesting your assertion but I am curious -- would you care to elaborate on what FreeBSD does that "[gets] it (mostly) right"?

PostgreSQL's fsync() surprise

Posted Apr 23, 2018 21:19 UTC (Mon) by andresfreund (subscriber, #69562) [Link]

It's not optimal, but what freebsd appears to do is to just clear the error *and* mark the buffer as dirty again. So the error will be hit again and again (unless the device is gone):
https://github.com/freebsd/freebsd/blob/master/sys/kern/v...

PostgreSQL's fsync() surprise

Posted Apr 25, 2018 14:12 UTC (Wed) by xxiao (guest, #9631) [Link]

what about mysql, does it use DIO and never touches fsync()? Maybe this is not just postgresql-specific?

PostgreSQL's fsync() surprise

Posted May 9, 2018 20:14 UTC (Wed) by nilsmeyer (guest, #122604) [Link]

It can be configured in MySQL with variables like innodb_flush_method and sync_binlog, also MySQL uses a threading model instead of multiple processes so I suppose some of the issues regarding file descriptors don't crop up, and one would usually use direct io bypassing most other caches (O_DIRECT). Of course this assumes one runs InnoDB, I don't know how RocksDB/MyRocks behaves in this case.

Basically MySQL / InnoDB will manage all the buffering and try to bypass the kernel buffering as much as possible. This is why you usually try to allocate most (like 75/80%) of the memory on a MySQL server to the InnoDB buffer pool.

PostgreSQL's fsync() surprise

Posted May 2, 2018 23:56 UTC (Wed) by gerdesj (subscriber, #5446) [Link]

Perhaps it has been discussed to death before but why not put DBs on some sort of DB oriented storage instead of say xfs/ext{n}/btrfs/fat16?

PostgreSQL's fsync() surprise

Posted May 3, 2018 0:03 UTC (Thu) by andresfreund (subscriber, #69562) [Link]

Which would be?

PostgreSQL's fsync() surprise

Posted May 3, 2018 6:20 UTC (Thu) by zlynx (guest, #2285) [Link]

Raw, unformatted blocks I would suppose. We could call it postgresfs.

PostgreSQL's fsync() surprise

Posted Nov 12, 2018 4:03 UTC (Mon) by immibis (subscriber, #105511) [Link]

That is almost exactly the abstraction that a file is supposed to provide - except that it's a fixed size (and consequently you can't get ENOSPC because you are handling the space allocation yourself). You can still get EIO. Or EUSERPULLEDTHEDRIVEOUT.

PostgreSQL's fsync() surprise

Posted May 3, 2018 11:26 UTC (Thu) by james (subscriber, #1325) [Link]

That means you can only run them on systems with that sort of storage available -- which means
dnf install package-that-uses-postgresql-as-a-database-engine
doesn't have a chance of Just Working.

PostgreSQL's fsync() surprise: Patched proposed

Posted May 3, 2018 22:09 UTC (Thu) by tech2018 (guest, #124143) [Link]

Seems like a patch hast been already developed (April 24, 2018)
https://patchwork.kernel.org/patch/10358111/

PostgreSQL's fsync() surprise

Posted May 10, 2018 1:25 UTC (Thu) by ringerc (subscriber, #3071) [Link]