Who's afraid of a big bad optimizing compiler?
Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net. |
This article was contributed by Jade Alglave, Will Deacon, Boqun Feng, David Howells, Daniel Lustig, Luc Maranget, Paul E. McKenney, Andrea Parri, Nicholas Piggin, Alan Stern, Akira Yokosawa, and Peter Zijlstra.
When compiling Linux-kernel code that does a plain C-language load or store, as in "a=b", the C standard grants the compiler the right to assume that the affected variables are neither accessed nor modified by any other thread at the time of that load or store. The compiler is therefore permitted to carry out a large number of transformations, a couple of which were discussed in this ACCESS_ONCE() LWN article, and another of which is described in Dmitry Vyukov's KTSAN wiki page. However, our increasingly aggressive modern compilers produce increasingly surprising code optimizations. Some of these optimizations might be especially surprising to developers who assume that each plain C-language load or store will always result in an assembly-language load or store. Although this article is written for Linux kernel developers, many of these scenarios also apply to other concurrent code bases, keeping in mind that "concurrent code bases" also includes single-threaded code bases that use interrupts or signals.
The ongoing trend in compilers makes us wonder: "Just how afraid should we be?". The following sections should help answer this question:
- Load tearing
- Store tearing
- Load fusing
- Store fusing
- Code reordering
- Invented loads
- Invented stores
Quick quiz 1: But we shouldn't be afraid at all for things like on-stack or per-CPU variables, right?
Answer - Store-to-load transformations
- Dead-code elimination
- How real is all this?
This is followed by the ineludible answers to the quick quizzes.
Load tearing
Load tearing occurs when the compiler uses multiple load instructions for a single access. For example, the compiler could, in theory, compile the load from global_ptr on line 1 of the following code as a series of one-byte loads.
1 ptr = global_ptr; /* BUGGY if load tearing possible. */ 2 if (ptr != NULL && ptr < high_address) /* BUGGY!!! */ 3 do_low(ptr);If some other thread was concurrently setting global_ptr to NULL, the result might have all but one byte of the pointer set to zero, thus forming a "wild pointer". Stores using such a wild pointer could corrupt arbitrary regions of memory, resulting in rare and difficult-to-debug crashes.
Worse yet, on (say) an eight-bit system with 16-bit pointers, the compiler might have no choice but to use a pair of eight-bit instructions to access a given pointer. And even on today's 32-bit and 64-bit systems, a misaligned or too-large access could tear. Because the C standard must support all manner of systems, the standard cannot rule out load tearing in the general case.
Answer
Store tearing
Store tearing occurs when the compiler uses multiple store instructions for a single access. For example, one thread might store 0x12345678 to a four-byte integer variable at the same time as another thread stored 0xabcdef00. If the compiler used 16-bit stores for either access, the result might well be 0x1234ef00, which could come as quite a surprise to code loading from this integer. Nor is this a strictly theoretical issue. For example, there are CPUs that feature small immediate values, and on such CPUs, the compiler can be tempted to split a 64-bit store into two 32-bit stores in order to reduce the overhead of explicitly forming the 64-bit constant in a register, even on a 64-bit CPU. There are historical reports of this actually happening in the wild, but there is also a recent report. Note that this tearing can happen even on properly aligned and machine-word-sized accesses, and in this particular case, even for volatile stores. Some might argue that this behavior constitutes a bug in the compiler, but either way it illustrates the perceived value of store tearing from a compiler-writer viewpoint.
Of course, the compiler simply has no choice but to tear some stores in the general case, given the possibility of code using 64-bit integers running on a 32-bit system. But for properly aligned machine-sized stores, WRITE_ONCE() will prevent store tearing.
Load fusing
Load fusing occurs when the compiler uses the result of a prior load from a given variable instead of repeating the load. Not only is this sort of optimization just fine in single-threaded code, it is often just fine in multithreaded code. Unfortunately, the word "often" hides some truly annoying exceptions, including the one called out in the ACCESS_ONCE() article.
For example, suppose that a realtime system needs to invoke a function named do_something_quickly() repeatedly until the variable need_to_stop is set, and that the compiler can see that do_something_quickly() does not store to need_to_stop. One (unsafe) way to code this might look like:
1 while (!need_to_stop) /* BUGGY!!! */ 2 do_something_quickly();
The compiler might reasonably unroll this loop sixteen times in order to reduce the per-invocation overhead of the backwards branch at the end of the loop. Worse yet, because the compiler knows that do_something_quickly() does not store to need_to_stop, the compiler could quite reasonably decide to check this variable only once, resulting in the code shown below:
1 /* Optimized code */ 2 if (!need_to_stop) 3 for (;;) { 4 do_something_quickly(); 5 do_something_quickly(); 6 do_something_quickly(); 7 do_something_quickly(); 8 do_something_quickly(); 9 do_something_quickly(); 10 do_something_quickly(); 11 do_something_quickly(); 12 do_something_quickly(); 13 do_something_quickly(); 14 do_something_quickly(); 15 do_something_quickly(); 16 do_something_quickly(); 17 do_something_quickly(); 18 do_something_quickly(); 19 do_something_quickly(); 20 }
Once entered, the loop on lines 3-20 will never stop, regardless of how many times some other thread stores a non-zero value to need_to_stop. The result will at best be disappointment, and might also include severe physical damage.
The compiler can fuse loads across surprisingly large spans of code. For example, in this code:
1 int *gp; 2 3 void t0(void) 4 { 5 WRITE_ONCE(gp, &myvar); 6 } 7 8 void t1(void) 9 { 10 p1 = gp; /* BUGGY!!! */ 11 do_something(p1); 12 p2 = READ_ONCE(gp); 13 if (p2) { 14 do_something_else(); 15 p3 = *gp; /* BUGGY!!! */ 16 } 17 }t0() and t1() run concurrently, and do_something() and do_something_else() are inline functions. Line 1 declares the pointer gp, which C initializes to NULL by default. At some point, line 5 of t0() stores a non-NULL pointer to gp. Meanwhile, t1() loads from gp three times on lines 10, 12, and 15. Given that line 13 finds that gp is non-NULL, one might hope that the dereference on line 15 would be guaranteed never to fault.
Unfortunately, the compiler is within its rights to fuse the reads on lines 10 and 15 which means that if line 10 loads NULL and line 12 loads &myvar, line 15 could dereference NULL, resulting in a fault. Note that the intervening READ_ONCE() does not prevent the other two loads from being fused, despite the fact that all three are loading from the same variable. It might seem that no real compiler would ever do this, but Will Deacon reports that this has actually happened in the Linux kernel.
Answer
Store fusing
Store fusing can occur when the compiler notices a pair of successive stores to a given variable with no intervening loads from that variable. In this case, the compiler is within its rights to omit the first store. This is never a problem in single-threaded code, and in fact it is usually the case that it is not a problem in correctly written concurrent code. After all, if the two stores are executed in quick succession, there is very little chance that some other thread could load the value from the first store.
However, there are exceptions, for example as shown below:
1 void shut_it_down(void) 2 { 3 status = SHUTTING_DOWN; /* BUGGY!!! */ 4 start_shutdown(); 5 while (!other_task_ready) /* BUGGY!!! */ 6 continue; 7 finish_shutdown(); 8 status = SHUT_DOWN; /* BUGGY!!! */ 9 do_something_else(); 10 } 11 12 void work_until_shut_down(void) 13 { 14 while (status != SHUTTING_DOWN) /* BUGGY!!! */ 15 do_more_work(); 16 other_task_ready = 1; /* BUGGY!!! */ 17 }
The function shut_it_down() stores to the shared variable status on lines 3 and 8. Assuming that neither start_shutdown() nor finish_shutdown() access status, the compiler could reasonably remove the store to status on line 3. Unfortunately, this would mean that work_until_shut_down() would never exit its loop spanning lines 14 and 15, and thus would never set other_task_ready, which would in turn mean that shut_it_down() would never exit its loop spanning lines 5 and 6, even if the compiler chooses not to fuse the successive loads from other_task_ready on line 5. Although WRITE_ONCE() prevents store fusing, smp_store_release() (or stronger) is often preferable, to ensure that other changes made before the store will be visible to other threads that see the store.
And there are other problems with that code, including code reordering.
Code reordering
Code reordering is a common compilation technique used to combine common subexpressions, reduce register pressure, and improve utilization of the many functional units available on modern superscalar microprocessors. It is also another reason why the code above is buggy. For example, suppose that the do_more_work() function on line 15 does not access other_task_ready. Then the compiler would be within its rights to move the assignment to other_task_ready on line 16 to precede line 14, which might be a great disappointment for anyone hoping that the last call to do_more_work() on line 15 happens before the call to finish_shutdown() on line 7.
It might seem futile to prevent the compiler from changing the order of accesses in cases where the underlying hardware is free to reorder them. For example, even on a single-CPU machine, what would happen if the hardware reorders two accesses and then an interrupt occurs right in the middle? What values would the interrupt handler see?
As it turns out, this isn't a problem. Modern machines have "exact exceptions" and "exact interrupts", meaning that any interrupt or exception will appear to have happened at a specific place in the instruction stream. Consequently, the handler will see the effect of all prior instructions, but won't see the effect of any subsequent instructions. READ_ONCE(), WRITE_ONCE(), and barrier() can therefore be used to control communication between interrupted code and interrupt handlers, independent of any reordering carried out by the underlying hardware. That said, should you write user-space code, the various standards committees would prefer that you use atomics or variables of type sig_atomic_t instead of READ_ONCE() and WRITE_ONCE().
However, when interacting with some other CPU, stronger primitives are required, such as smp_load_acquire() and smp_store_release().
Invented loads
Invented loads are illustrated by the code below, in which the compiler has optimized away a temporary variable from the code shown in the load-tearing example above.
1 /* Optimized code */ 2 if (global_ptr != NULL && 3 global_ptr < high_address) 4 do_low(global_ptr);
Answer
Invented loads can also be a performance hazard. These hazards can occur when a load of variable in a "hot" cacheline is hoisted out of an if statement. These hoisting optimizations are not uncommon, and can cause significant increases in cache misses, and thus significant degradation of both performance and scalability.
Avoid invented loads by using READ_ONCE().
Invented stores
Invented stores can occur in a number of situations. For example, a compiler emitting code for work_until_shut_down() in the store-fusing example above might notice that other_task_ready is stored to on line 16 and is not accessed by do_more_work(). If do_more_work() was a complex inline function, it might be necessary to do a register spill, in which case one attractive place to use for temporary storage is other_task_ready. After all, there are no accesses to it, so what is the harm?
Of course, a non-zero store to this variable at just the wrong time would result in the while loop on line 5 terminating prematurely, again allowing finish_shutdown() to run concurrently with do_more_work(). Given that the entire point of this while appears to be to prevent such concurrency, this is not a good thing.
Using a stored-to variable as a temporary might seem outlandish, and we are not aware of any compilers that actually invent stores, but invented stores really are permitted by the standard. Nevertheless, readers might be justified in wanting a less outlandish example, which is duly provided below:
1 if (condition) 2 a = 1; /* BUGGY!!! */ 3 else 4 do_a_bunch_of_stuff();
A compiler emitting code for this example might know that the value of a is initially zero, which might tempt the compiler to optimize away one branch by transforming this code to something like:
1 /* Optimized code */ 2 a = 1; 3 if (!condition) { 4 a = 0; 5 do_a_bunch_of_stuff(); 6 }
Answer
Pre-C11 compilers could invent stores to unrelated variables that happened to be adjacent to written-to variables (see Section 4.2 of Hans Boehm's classic Threads cannot be implemented as a library). This variant of invented stores has been outlawed by the C11 prohibition against compiler optimizations that create data races.
Unfortunately, there is an exception to this rule: if there is a later plain store without some sort of ordering directive beforehand, then a data race involving an invented store necessarily implies that there was already a data race involving that later plain store. In this case, the compiler believes that it is not introducing a data race, but rather expanding on an already-existing data race. And the compiler is OK with this, even if your code is not. For example:
1 struct foo { 2 short a; 3 char b; 4 char c; 5 }; 6 7 void do_something(struct foo *fp) 8 { 9 fp->a = 0x1234; 10 fp->b = 0x56; 11 do_something_else(); 12 fp->c = 0x42; 13 }
If the definition of do_something_else() is visible to the compiler, and if it contains no ordering directives like barrier() or stronger, then the developer's write to fp->c tells the compiler that there are no concurrent reads or writes to that field, whether that was the developer's intention or not. The compiler would then be within its rights to do the following optimization (assuming a big-endian system):
1 struct foo { 2 short a; 3 char b; 4 char c; 5 }; 6 7 void do_something(struct foo *fp) 8 { 9 *(long *)fp = 0x123456ff; 10 do_something_else(); 11 fp->c = 0x42; 12 }
The momentary appearance of 0xff might come as quite a surprise to any concurrent loads from fp->c. Please note that this is not a theoretical transformation: A later store to a variable is taken as permission to clobber that variable. In addition, older compilers can and do invent stores to unrelated variables, even without the provocation of a later plain C-language store to such an unrelated variable. Use barrier() or WRITE_ONCE() to avoid all of these types of invented stores.
Store-to-load transformations
Store-to-load transformations can occur when the compiler notices that a plain C-language store might not actually change the value in memory. For example, consider this code:
1 int r1, x, y; 2 3 void cpu1(void) 4 { 5 WRITE_ONCE(y, 1); 6 smp_mb(); 7 WRITE_ONCE(x, 1); 8 } 9 10 void cpu2(void) 11 { 12 r1 = READ_ONCE(x); 13 if (r1 == 1) 14 y = 0; // BUGGY!!! 15 }
Here CPU 1 executes cpu1(), which uses WRITE_ONCE() to store the value one to each of y and then x, separated by a full memory barrier. CPU 2 executes cpu2(), which uses READ_ONCE() to load x, and only if the result is 1, line 14 does a plain store of zero to y. One would expect that if r1 ends up with the value one, that the final value of y must necessarily be zero.
Unfortunately, the compiler is within its rights to transform line 14 into the load-compare-store sequence shown on lines 14 and 15 below:
1 int r1, x, y; 2 3 void cpu1(void) 4 { 5 WRITE_ONCE(y, 1); 6 smp_mb(); 7 WRITE_ONCE(x, 1); 8 } 9 10 void cpu2(void) 11 { 12 r1 = READ_ONCE(x); 13 if (r1 == 1) 14 if (y != 0) 15 y = 0; 16 }
Given this code, CPU 2 may reorder the load of y on line 14 before the READ_ONCE. If it does so, it might observe the original zero value of y and therefore skip the store on line 15. Thus y could indeed end up containing one.
Why would the compiler do such a thing? Please understand that to the best of our knowledge, this transformation is strictly theoretical. However, it does not take too much imagination to see how this might occur given feedback-driven optimization. So if you want your store to remain a store, for current and any future compilers, use WRITE_ONCE() or stronger.
Dead-code elimination
Dead-code elimination can occur when the compiler notices that the value from a load is never used, or when a variable is stored to, but never loaded from. This can of course eliminate an access to a shared variable, which can in turn defeat a memory-ordering primitive, which could cause your concurrent code to act in surprising ways. Experience thus far indicates that relatively few such surprises will be at all pleasant. Elimination of store-only variables is especially dangerous in cases where external code locates the variable via symbol tables; the compiler is necessarily ignorant of such external-code accesses, and might thus eliminate a variable that the external code relies on.
Reliable concurrent code clearly needs a way to cause the compiler to preserve the number, order, and type of important accesses to shared memory, which is why the Linux kernel provides READ_ONCE(), WRITE_ONCE(), barrier(), and a wide variety of memory barriers and atomic read-modify-write operations.
How real is all this?
Some of the transformations called out in the preceding sections are more likely to actually occur than are others.
Occurs in the Wild? | |
---|---|
Transformation | (Properly Aligned, Machine-Word Sized) |
Load Tearing | Yes |
Store Tearing | Yes, for constants (to be fixed?) |
Load Fusing | Yes |
Store Fusing | Yes |
Code Reordering | Yes |
Invented Loads | Yes |
Invented Stores | In some cases |
Store-to-Load Transformations | Unknown |
Dead-Code Elimination | Yes |
So what is a Linux kernel developer to do? There is a range of possibilities, each of which applies READ_ONCE() and WRITE_ONCE() in different situations:
- Never.
- For any access to any shared variable for which there is a possibility of a data race, and for which it can be clearly shown that specific compiler optimizations could result in bugs.
- For any access to a shared variable for which there is a possibility of a data race for at least one of those accesses.
- For all accesses to all shared variables.
There is without doubt some code somewhere in the Linux kernel corresponding to each of these possibilities. However, developers and maintainers opting for one of the first two possibilities are taking on the responsibility of ensuring that new releases of the compiler won't break their code. For these developers and maintainers, a significant level of fear of the big bad optimizing compiler is a very healthy thing, and the rest of us should hope that they continue to maintain an appropriate level of fear.
Answer
Quick quiz 8:
Given the risk, why not simply require that all accesses to
shared variables use READ_ONCE() and
WRITE_ONCE()?
Answer
On the other hand, developers and maintainers who instead opt for one of the last two possibilities need not fear the big bad optimizing compiler, or at least they need not fear it quite so much. However, they could benefit from a tool that determines when READ_ONCE() and WRITE_ONCE() (or stronger) are needed to defend not just against present-day optimizing compilers, but also the bigger and badder optimizing compilers that the future will bring. The next article in this series describes a recent change to the Linux kernel memory model that does just that.
Acknowledgments
We owe thanks to a surprisingly large number of compiler writers and members of the C and C++ standards committees who introduced us to some of the things a big bad optimizing compiler can do, and to Junchang Wang, SeongJae Park, and Slavomir Kaslev for their help making an earlier draft of this material human-readable. We are also grateful to Mark Figley and Kara Todd for their support of this effort.
Answers to quick quizzes
Quick quiz 1: But we shouldn't be afraid at all for things like on-stack or per-CPU variables, right?
Answer: Although on-stack and per-CPU variables are often guaranteed to be untouched by other CPUs and tasks, the kernel really does allow them to be concurrently accessed in many cases. You do have to go out of your way to make this happen, say by explicitly passing the address of such a variable to another thread, but it's certainly not impossible.
For example, the _wait_rcu_gp() macro uses an on-stack __rs_array[] array of rcu_synchronize structures that, in turn, contain rcu_head and completion structures. The address of the rcu_head structure is passed to call_rcu(), which results in concurrent accesses to this structure, and eventually also to the completion structure.
Similar access patterns may be found for per-CPU variables.
Quick quiz 2: But there are lots of plain loads from shared variables in the Linux kernel. These cannot possibly all be buggy, can they?
Answer: This turns out to be a matter of the context in which these plain loads execute and just how vigilant the developers and maintainers wish to be.
Starting with context, if a given variable is only ever accessed under the protection of a given exclusive lock or mutex, then use of plain loads (and stores, for that matter) is perfectly safe.
Less restrictive contexts also suffice. If stores to a given variable can never execute concurrently with any other accesses to that variable, then use of plain loads and stores is again perfectly safe. For example, if all loads from a given variable are under the protection of a reader-writer lock or mutex, and if all stores to that same variable are under the write-side protection of that same reader-writer lock or mutex, use of plain loads and stores is perfectly safe. Alternatively, if all stores to a given variable are carried out by a given kernel thread, and that same variable is only ever loaded by subsequently spawned child threads, plain loads and stores are yet again perfectly safe. Similarly, if all stores to a given structure happen before it is made visible to readers via rcu_assign_pointer(), and the readers, having gained access via rcu_dereference(), only ever load from that structure, plain loads and stores are once more perfectly safe. There are numerous additional variations on this theme.
But there is no shortage of plain loads in the kernel that really can execute concurrently with stores to that same variable, which brings us to vigilance. For example, if the variable only ever transitions from zero to one, no matter how the compiler dices and slices the load, the result will be either a zero or a one. Give or take the possibility of invented loads, which could get the effect of both a zero and a one being loaded, though if the value loaded is only used once, one would hope that this confusing possibility would be avoided.
In other words, when using plain loads from shared variables, it is the developers' and maintainers' responsibility to either prevent concurrent stores to that same variable on the one hand or to ensure that the compiler cannot optimize their algorithms out of existence on the other.
So are the Linux kernel's plain loads from shared variables buggy? If the relevant developers and maintainers are either carefully controlling the contexts from which those variables are accessed on the one hand or vigilantly considering what optimizing compilers can do to their code on the other, perhaps not!
Quick quiz 3: Why does it matter whether do_something() and do_something_else() are inline functions?
Answer: Because gp is not a static variable, if either do_something() or do_something_else() were separately compiled, the compiler would have to assume that either or both of these two functions might change the value of gp. This possibility would force the compiler to reload gp on line 15, thus avoiding the NULL-pointer dereference.
In the absence of link-time optimizations (LTO), that is. As optimizing compilers become more aggressive, developers and maintainers must become aggressive about disabling destructive optimizations, whether that be via command-line arguments to the compiler or via source-code decorations such as barrier(), READ_ONCE(), and WRITE_ONCE().
Quick quiz 4: But line 2 specifically checks for NULL. So how can do_low() possibly be invoked with a NULL pointer?
Answer: Imagine the following sequence of events:
- Line 2 loads a non-NULL pointer from global_ptr.
- Some other CPU stores NULL to global_ptr.
- Line 3 loads the newly stored NULL from global_ptr, and this compares less than high_address.
- Surprise! There is now a call to do_low() with a NULL pointer.
Quick quiz 5: Ouch! So can't the compiler invent a store to a normal variable pretty much any time it likes?
Answer: Thankfully, the answer is no. This is because the compiler is forbidden from introducing data races. The case of inventing a store just before a normal store is quite special: It is not possible for some other entity, be it CPU, thread, signal handler, or interrupt handler, to be able to see the invented store unless the code already has a data race, even without the invented store. And if the code already has a data race, it already invokes the dreaded specter of undefined behavior, which allows the compiler to generate pretty much whatever code it wants, regardless of the wishes of the developer.
But if the original store is volatile, as in WRITE_ONCE(), for all the compiler knows, there might be a side effect associated with the store that could signal some other thread, allowing data-race-free access to the variable. By inventing the store, the compiler might be introducing a data race, which it is not permitted to do. And this is one reason why memory-barriers.txt requires WRITE_ONCE() for stores that are to be ordered by control dependencies. Another reason may be gleaned from the Store-to-Load Transformations section.
In the case of volatile and atomic variables, the compiler is specifically forbidden from inventing writes.
Quick quiz 6: What exactly is a "data race”?
Answer: A data race occurs when there are multiple concurrent accesses to a given variable, at least one of which is a plain C-language access and at least one of which is a store.
Quick quiz 7: This paper has covered all of the transformations that an optimizing compiler can carry out, right?
Answer: Wrong.
There are a great many more, which should not be a surprise given the large number of situations where the C standard specifies undefined behavior, each of which potentially points the way to interesting compiler optimizations. There are some efforts under way to rein in compiler optimizations to at least some extent (for example, here, here [PDF], and here [PDF]), but compiler developers and standards-committee members are not necessarily as supportive of such efforts as might be hoped by maintainers and developers working with concurrent code.
Quick quiz 8: Given the risk, why not simply require that all accesses to shared variables use READ_ONCE() and WRITE_ONCE()?
Answer: One can certainly argue that they should be used more heavily than they currently are, but it is not all that hard to get too much of a good thing. For example, as mentioned in the answer to an earlier quick quiz, any number of in-kernel mechanisms, perhaps most notably locking, can provide mutual exclusion so that READ_ONCE() and WRITE_ONCE() are not needed.
In addition, although READ_ONCE() and WRITE_ONCE() are low cost, they are not free due to the fact that they constrain compiler optimizations. For example, the compiler is required to emit the accesses for a pair of consecutive READ_ONCE() invocations in order, and it might well be just fine (and perhaps also cheaper) for those to invocations to be reordered. Some fast paths might therefore need plain C-language accesses, though one would hope that the developers and maintainers would see fit to take pity on people reading their code by providing appropriate comments.
And there are guarantees that the Linux kernel relies on that are provided by usage restrictions rather than by compiler directives. Examples include address and data dependencies, for which the usage restrictions are documented in rcu_dereference.txt as well as control dependencies, for which the usage restrictions are documented in the CONTROL DEPENDENCIES section of memory-barriers.txt.
However, we should continue to expect increasingly aggressive compiler optimizations over time. This will likely increase the development and maintenance burden incurred by those making use of plain C-language loads and stores to shared variables in cases where data races exist. This prospect might help explain why the use of things like READ_ONCE() and WRITE_ONCE() has been increasing steadily within the Linux kernel.
Index entries for this article | |
---|---|
Kernel | ACCESS_ONCE() |
GuestArticles | McKenney, Paul E. |
(Log in to post comments)
Who's afraid of a big bad optimizing compiler?
Posted Jul 16, 2019 3:10 UTC (Tue) by ariagolliver (subscriber, #85520) [Link]
- just use assembly (really easy on AMD64)
- cast to uint64_t, do the operation, and cast back. This only triggers implementation defined behavior, not undefined behavior. So it's easy to write a test to ensure a compiler upgrade doesn't break things silently in the future.
Creating a thorough unit test suite and running w/ ubsan is great for catching any subtleties (like surprising char conversion behaviors)
Everyone I talked to assumed they understood the implications of signed integer overflow being undefined behavior, but by the end I realized no one really did (and I'm certain I don't)
Who's afraid of a big bad optimizing compiler?
Posted Jul 16, 2019 6:41 UTC (Tue) by PaulMcKenney (subscriber, #9624) [Link]
Who's afraid of a big bad optimizing compiler?
Posted Jul 16, 2019 8:42 UTC (Tue) by Lekensteyn (subscriber, #99903) [Link]
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2018/p...
Who's afraid of a big bad optimizing compiler?
Posted Jul 16, 2019 9:42 UTC (Tue) by PaulMcKenney (subscriber, #9624) [Link]
Who's afraid of a big bad optimizing compiler?
Posted Jul 16, 2019 14:01 UTC (Tue) by foom (subscriber, #14868) [Link]
"If a signed operation would naturally produce a value that is not within the range of the result type, the behavior is undefined. The author had hoped to make this well-defined as wrapping (the operations produce the same value bits as for the corresponding unsigned type), but WG21 had strong resistance against this."
Who's afraid of a big bad optimizing compiler?
Posted Jul 16, 2019 14:19 UTC (Tue) by PaulMcKenney (subscriber, #9624) [Link]
However, there is at least the "is_modulo" type trait, which allows an implementation to indicate that it will wrap on overflow.
Other than that, I guess you have to use atomics to make signed overflow defined. :-/
Who's afraid of a big bad optimizing compiler?
Posted Jul 16, 2019 16:25 UTC (Tue) by ballombe (subscriber, #9523) [Link]
signed wrapping long x;
unsigned nonwrapping long x;
the same for aliasing rules, store rules, etc.
Who's afraid of a big bad optimizing compiler?
Posted Jul 16, 2019 16:48 UTC (Tue) by JoeBuck (subscriber, #2330) [Link]
The reason that C makes integer overflow undefined was for loop optimization, they wanted to make it possible for loops to be as fast as those in Fortran.Consider
void add_vector(double* c, const double* a, const double* b, int size) { int i; for (i = 0; i < size; i++) { c[i] = a[i] + b[i]; } }
They wanted the compiler to be able to assume that a+size, b+size, c+size (which are really p+size*sizeof(double)) do not overflow, which was an issue on small address space machines (I'm old, and started out on PDP-11 machines which were 16 bit, and even resorted to programming with overlays to make things fit). Part of the difficulty is that C passes arrays as pointers to the first element, and the compiler otherwise doesn't know whether c, a, or b is at the very end of the address space so a two's-complement wraparound requirement results in many special cases if we attempt to vectorize. At least, that was the thinking back when we got into this mess. So the "undefined" assumption means that a loop optimizer is allowed to assume that overflow does not happen.
Who's afraid of a big bad optimizing compiler?
Posted Jul 16, 2019 19:27 UTC (Tue) by epa (subscriber, #39769) [Link]
Who's afraid of a big bad optimizing compiler?
Posted Jul 17, 2019 6:47 UTC (Wed) by PaulMcKenney (subscriber, #9624) [Link]
Who's afraid of a big bad optimizing compiler?
Posted Jul 17, 2019 7:25 UTC (Wed) by dunlapg (guest, #57764) [Link]
There was a lot more variety in systems back in the day, including some in which signed integer overflow did interesting things.
But saying, "We'll do an add instruction, but we can't really guarantee what the processor will do" is a completely different thing than saying "Since we don't make any promises about what add does in this circumstance, we'll do something surprising and dangerous which corrupts data and allows people to exploit your system."
The GPL literally says that it makes no warranties about fitness for a particular purpose; but that doesn't give software authors the right to make software that behaves however they want. Imagine if the response to all bug reports was, "Our license literally says that we make no guarantees. This behavior is more convenient for us, so just put up with it." It's a preposterous position to take.
Who's afraid of a big bad optimizing compiler?
Posted Jul 17, 2019 14:31 UTC (Wed) by PaulMcKenney (subscriber, #9624) [Link]
But the standard needs to say what will happen in on all systems, including systems that have not yet been defined. Me, I join epa in preferring implementation-specified behavior to undefined behavior in this case, but either way, portable software would need to avoid signed-integer overflow. So one can argue that for a lot of software, there isn't much practical difference between implementation-specified behavior and undefined behavior.
But there is a lot of software that isn't intended to be universally portable!
Who's afraid of a big bad optimizing compiler?
Posted Jul 17, 2019 14:46 UTC (Wed) by dunlapg (guest, #57764) [Link]
For a given computer system, yes, the results of an add instruction should be defined.
Not sure what it is that you're replying to. I never said anything about the results of an add instruction being defined; I'm quibbling with the range of actions that compiler authors seem to think "undefined" allows them to do.
"Since we don't know what hardware will do on signed integer overflow, we can't promise signed integer overflow will work in any particular way" is reasonable. "Since we don't promise signed integer overflow works in any particular way, we are free to make signed integer overflow behave in completely pathological ways" is not.
Who's afraid of a big bad optimizing compiler?
Posted Jul 18, 2019 1:54 UTC (Thu) by brooksmoses (guest, #88422) [Link]
You could say that you would prefer that the compiler _never_ make optimization choices based on a deduction of "undefined behavior means that this can never happen", but it turns out that this deduction has been useful in quite a lot of ways (only one of which is making small loops fast) and people in practice turn out to object strenuously to significant performance regressions from their compiler.
Beyond this, a problem is that compiler optimization engines generally don't have an idea of "I know this information only for purposes of this optimization but not that one", and defining this would require a quite involved "tainted" model to know what information is allowed when. Without that knowledge, the compiler can't distinguish between logical statements that it learned from "this is required for the program to be well-formed" and "this has to be true for the program to be on this code path" or other entirely solid things.
The pathological cases are almost always the result of entirely-reasonable small pieces of logic combining to create something that wasn't expected by either the compiler author or the code author, not something that was intentionally designed in, and determining a small piece of logic to eliminate to prevent the pathological behavior without preventing useful behavior is quite difficult, if not impossible.
Who's afraid of a big bad optimizing compiler?
Posted Jul 20, 2019 13:31 UTC (Sat) by anton (subscriber, #25547) [Link]
You could say that you would prefer that the compiler _never_ make optimization choices based on a deduction of "undefined behavior means that this can never happen", but it turns out that this deduction has been useful in quite a lot of ways (only one of which is making small loops fast) and people in practice turn out to object strenuously to significant performance regressions from their compiler.You are insinuating that not performing such "optimizations" produces significant performance regressions. Funnily, for all their talk about the performance advantages of such "optimizations", the advocates of "optimizations" hardly ever produce numbers (and the one time I have seen numbers, they were from an unnamed source for another compiler). One would expect that, if there were significant performance benefits from "optimizations", their advocates would parade them up and down.
If there is no significant performance benefit from "optimizations", there is also no need to keep any "knowledge" around that is derived from the assumption that undefined behaviour does not occur.
In any case, the better approach would be to give optimization hints to programmers. That requires changes in just a few places instead of "sanitizing" the whole program all the time, and the consequences of not doing it are far less severe. E.g., if the use of wraparound int for a local variable would lead to sign-extension operations in hot code, the compiler could suggest to change that local variable to, say, (wraparound) long.
Who's afraid of a big bad optimizing compiler?
Posted Jul 18, 2019 9:01 UTC (Thu) by mathstuf (subscriber, #69389) [Link]
You're writing C for a C abstract machine. That machine has undefined behavior for signed integer overflow. Undefined behavior in C means "all bets are off". If this is not what you want, don't use C, convince compilers to provide a "sensible C" dialect, or change C via the standards committee.
Who's afraid of a big bad optimizing compiler?
Posted Jul 18, 2019 17:48 UTC (Thu) by mpr22 (subscriber, #60784) [Link]
Who's afraid of a big bad optimizing compiler?
Posted Jul 22, 2019 12:06 UTC (Mon) by rweikusat2 (subscriber, #117920) [Link]
Who's afraid of a big bad optimizing compiler?
Posted Jul 17, 2019 13:52 UTC (Wed) by epa (subscriber, #39769) [Link]
For pointer overflow there are performance speedups from allowing the semantics to be undefined, as the loop example shows. That doesn't apply to integer overflow, as far as I can tell.
Who's afraid of a big bad optimizing compiler?
Posted Jul 20, 2019 13:07 UTC (Sat) by anton (subscriber, #25547) [Link]
So from the viewpoint of someone in 1990, requiring wrap on signed integer overflow would have required substantial clairvoyance.Not at all. No significant new architecture with something other than twos-complement representation for signed integers was introduced after 1970. They sure saw during C standardization that the future was twos-complement , but they wanted to support the (then) present which included descendants of architectures from the 1960s.
That being said, even on these architectures implementing twos-complement would not be that expensive. They already support unsigned integers with wraparound (or C would not have standardized that). For +, -, and *, the same operations can be used for twos-complement arithmetics; inequality comparisons and possibly division would become a little more expensive, though.
Who's afraid of a big bad optimizing compiler?
Posted Jul 21, 2019 11:11 UTC (Sun) by PaulMcKenney (subscriber, #9624) [Link]
If I remember correctly, the usual way to do unsigned arithmetic on ones-complement-only systems was to force the sign bit clear after each addition and subtraction operation. And the ones-complement CDC 6600 systems used floating point to do integer multiplication and division, which required tricks normally used in multiple-precision arithmetic to get a 60-bit unsigned result out of floating-point operations with 48-bit mantissas. One reason that this worked was that there was a separate instruction for normalizing floating-point numbers, as well as another instruction that produced the lower 48 bits of the 96-bit product of two 48-bit mantissas.
Fun times! ;-)
Who's afraid of a big bad optimizing compiler?
Posted Jul 16, 2019 21:07 UTC (Tue) by nivedita76 (guest, #121790) [Link]
It would be impossible to implement any well-defined notion of non-wrapping (eg unsigned arithmetic that overflows gets absorbed at 2^n-1) efficiently anyway. No sense in pretending otherwise, if you want that, implement a custom C++ class.
Saturating arithmetic
Posted Jul 17, 2019 3:21 UTC (Wed) by jreiser (subscriber, #11027) [Link]
Saturating arithmetic
Posted Jul 17, 2019 3:52 UTC (Wed) by jreiser (subscriber, #11027) [Link]
ADDL EAX, EDX /* EAX += EDX */
SBCL ECX, ECX /* Each bit of ECX gets the Carry from above */
ORL EAX, ECX /* EAX |= ECX; Saturate */
Who's afraid of a big bad optimizing compiler?
Posted Jul 26, 2019 18:03 UTC (Fri) by andresfreund (subscriber, #69562) [Link]
Indeed. I did the same for postgres, and it's quite hard to get it right *and* performant. Especially when you want to support 64bit integers and you do not want to rely on casting to unsigned. While I think I got it right on a high-level, I'm far from certain that I avoided all subtly possible breakages. IIRC we have three different implementations of e.g. signed 64bit integer multiplication (__builtin_mul_overflow, cast to 128 bit integers, pre-multiplication check for all dangers). I really hate having to do that, when it's imo the languages job to provide something reasonable.
Who's afraid of a big bad optimizing compiler?
Posted Jul 27, 2019 9:36 UTC (Sat) by Wol (subscriber, #4433) [Link]
Surely a simple way for the standards committee would simply be to define a bunch of pragmas that said "this implementation-defined feature is required", and say that the compiler must terminate with a fatal error if it's not supported.
Okay, that then means that the poor programmer needs to add a bunch of requirements at the start of his program, but it also means that if he attempts to port to a different architecture that doesn't support his assumptions, the compiler will refuse to compile. Much better than everything appearing to work and then falling over in a heap in production.
Cheers,
Wol
Who's afraid of a big bad optimizing compiler?
Posted Oct 25, 2019 15:14 UTC (Fri) by plugwash (subscriber, #29694) [Link]
Who's afraid of a big bad optimizing compiler?
Posted Jul 16, 2019 13:49 UTC (Tue) by vegard (subscriber, #52330) [Link]
I guess it just doesn't pose much of problem since you need the combination of "compiler does something unexpected" (which may not happen at all in this case) AND "root writes this value at exactly the wrong moment" (which also may not happen at all).
Who's afraid of a big bad optimizing compiler?
Posted Jul 16, 2019 14:26 UTC (Tue) by PaulMcKenney (subscriber, #9624) [Link]
Who's afraid of a big bad optimizing compiler?
Posted Jul 17, 2019 6:49 UTC (Wed) by PaulMcKenney (subscriber, #9624) [Link]
Who's afraid of a big bad optimizing compiler?
Posted Jul 18, 2019 7:44 UTC (Thu) by pbonzini (subscriber, #60935) [Link]
Who's afraid of a big bad optimizing compiler?
Posted Jul 16, 2019 14:47 UTC (Tue) by fuhchee (guest, #40059) [Link]
Who's afraid of a big bad optimizing compiler?
Posted Jul 16, 2019 16:09 UTC (Tue) by PaulMcKenney (subscriber, #9624) [Link]
But those of us eschewing traditional mechanisms must take care to eschew them only when really needed, attractive though dangerous living might be. ;-)
Who's afraid of a big bad optimizing compiler?
Posted Jul 17, 2019 3:14 UTC (Wed) by Paf (subscriber, #91811) [Link]
Who is the audience?
Even kernel developers writing parallel code (I am one, I work on the Lustre parallel file system, which is concurrent all over the place, both within and across nodes) should almost never use most of the discussed primitives, nor make many of the discussed assumptions. Don’t assume loads or stores are atomic (even in the sense of not torn) unless you are using an explicit mechanism to make them so (Compilers aside, it’s a kernel, it runs on multiple architectures.) or perhaps if you’re writing arch specific code, which is rare even for kernel developers. In general, it’s a shortsighted assumption.
To my reading, the article also implicitly suggests the use of various advanced concurrency mechanisms like READ/WRITE_ONCE. These are powerful, but they are much harder to reason about than traditional locks with their heavyweight barriers, and the vast majority of the time, their use is not required for performant code. (And of course complexity should only be introduced when necessary.)
I guess this is all to say it seems like an article written for people who are doing work that is just shy of actually implementing concurrency control primitives. And I hope there are very few of those people, since it’s A) very hard to get that sort of thing right, and B) usually irrelevant to performance. In my day job, most of the times I’ve seen memory barriers explicitly invoked in new code have been mistakes or unnecessary cleverness in non-sensitive areas.
Apologies for getting on my soap box, and thank you for another excellent article.
Who's afraid of a big bad optimizing compiler?
Posted Jul 17, 2019 7:20 UTC (Wed) by PaulMcKenney (subscriber, #9624) [Link]
Who's afraid of a big bad optimizing compiler?
Posted Jul 17, 2019 7:34 UTC (Wed) by wtarreau (subscriber, #51152) [Link]
So no, it's not irrelevant to performance nor does it only concern those who need to implement their own concurrency primitives. In fact it should be for any developer of concurrent code who notices that his code either gets 20% slower on single-thread performance by just using mutexes at the wrong place, or that the code doesn't scale at all due to excessive cache lines bouncing between cores caused by excess of atomic ops. And sadly there are many people concerned by this, who often discover this the first time from a user report of very bad performance in a corner case.
Please also note that the points there are also valid with signals. And whoever plays with signals to perform various actions (state dump, config reload etc) should be really aware of this before manipulating half-written variables.
Who's afraid of a big bad optimizing compiler?
Posted Jul 16, 2019 16:34 UTC (Tue) by wtarreau (subscriber, #51152) [Link]
Excellent article, thanks for sharing this! Actually we've come to the same questions quite a few times inside haproxy (which is purely userland) because once you start to try to reduce the cost of a lock, you quickly fall into that foggy area where you don't clobber the whole memory after each and every small operation and you have to count on the compiler not to do bad things on you after you managed to work around the most obvious cases. And I clearly remember a few times saying "oh, if it would do this it would break so much code that it wouldn't make sense"... Probably some areas need to revisited. Regarding the stores that become load+store, I'm pretty sure I've seen this a long time ago. I remember being surprised to discover a 32-bit read followed by a 32-bit write instead of just a 8-bit write. I don't remember the exact context but it was probably something around this :union { unsigned int i; unsigned char c; } u; u.c = a; return u.i;This would result in something like this :
mov eax, [u] mov al, [a] mov [u], eax return eaxIn fact it would then be a partial load reordering since the load was expected at the end. But we could imagine architectures where some instructions are only available on full-sized registers and not smaller ones.
Who's afraid of a big bad optimizing compiler?
Posted Jul 16, 2019 16:43 UTC (Tue) by PaulMcKenney (subscriber, #9624) [Link]
And thank you for the example optimization from a simple store to a non-atomic read-modify-write. That could get one's attention! And ruin one's software's day!!! ;-)
There have been a number of systems lacking small accesses, including early DEC Alpha, CDC 6600, and probably quite a few others.
Load/Store tearing in this day and age?
Posted Jul 16, 2019 22:11 UTC (Tue) by pr1268 (subscriber, #24648) [Link]
I don't understand how/why, in this day and age, that load tearing and storing still occur. At least when loading/storing an n-bit value on an n-bit architecture. Isn't that what the "beauty" of larger-bit architectures is for?
For example, the compiler could, in theory, compile the load from global_ptr on line 1 of the following code as a series of one-byte loads.
70 years of electronic computers, 64-bit architectures are the norm now, and we're still putzing around one byte at a time? Ugh.
IMO loads and stores should be atomic at the assembly level (if not at the metal) for an n-bit value on an n-bit architecture. Just my $0x02.
Thank you to all the contributors for this article. Very enlightening (if not depressing).
Load/Store tearing in this day and age?
Posted Jul 16, 2019 22:58 UTC (Tue) by excors (subscriber, #95769) [Link]
Load/Store tearing in this day and age?
Posted Jul 17, 2019 6:58 UTC (Wed) by PaulMcKenney (subscriber, #9624) [Link]
Load/Store tearing in this day and age?
Posted Jul 18, 2019 4:21 UTC (Thu) by ncm (guest, #165) [Link]
On x86 they devote another few million transistors for each case to help avoid what would be a kernel trap. All the extra transistors strengthen Intel's (and, lately, AMD's and Samsung's) competitive position vs cheaper hardware. It's not free, because that enables a monopoly or oligopoly that may then extract rent or (worse) limit your choices.
In the US, only the former is ever considered actionable harm, despite the law recognizing both. It is artificially difficult to demonstrate the latter, where logically it should instead be assumed.
So, exercising care with alignment affords you more choice in hardware that can run your code fast enough, and safely, which might also enable saving money, too, and also power and heat, because those millions of transistors burn power.
Load/Store tearing in this day and age?
Posted Jul 17, 2019 9:32 UTC (Wed) by comex (subscriber, #71521) [Link]
> IMO loads and stores should be atomic at the assembly level (if not at the metal) for an n-bit value on an n-bit architecture. Just my $0x02.They typically *are* guaranteed to be atomic at the assembly level, but only if the pointer is properly aligned.
You can see this more easily if you use C11 atomics, which are more explicit about what guarantees they provide. For example, this source code:
#include <stdatomic.h> int load(_Atomic int *ptr) { return atomic_load_explicit(ptr, memory_order_relaxed); }
compiles to this assembly (GCC targeting x86-64):
load: mov eax, DWORD PTR [rdi] ret
All C11 atomic loads/stores are guaranteed to be, well, atomic, but GCC has decided to emit a plain mov instruction. In fact this is valid, because x86 has an architectural guarantee that regular movs (64-bit and smaller) to and from aligned pointers are atomic. (And GCC is allowed to assume that `ptr` is aligned, because using it is undefined behavior if not.) Most common architectures work the same.
On the other hand, many uses of atomics want a stronger memory ordering, e.g. memory_order_acquire. On x86, this *still* just generates a plain mov instruction, because x86 has very strong ordering guarantees for all accesses. But on other architectures it tends to require additional synchronization or a different instruction.
Who's afraid of a big bad optimizing compiler?
Posted Jul 17, 2019 15:58 UTC (Wed) by ghorbanian (guest, #129503) [Link]
Why, specifically, 16 times?
Who's afraid of a big bad optimizing compiler?
Posted Jul 17, 2019 18:45 UTC (Wed) by PaulMcKenney (subscriber, #9624) [Link]
Who's afraid of a big bad optimizing compiler?
Posted Jul 17, 2019 20:14 UTC (Wed) by law@redhat.com (guest, #31677) [Link]
Twiddling optimization flags, or worse yet, writing source that is specific to a set of transformations the compiler does or does not do today is a losing proposition. Such techniques may work today, they may work tomorrow, but they're really a case of writing code to a particular compiler implementation and could well fail in the future as compilers continue to evolve and more aggressively optimize.
The right thing, particularly for user space code, is to learn the C11/C++11 memory model and use those facilities. Do this and you're light years ahead on making code that works today, tomorrow and into the future regardless of what the compiler developers do.
I realize the kernel is special and kernel developers are going to do their own thing, but what they do should not be held up as an example of "the right way" in the general case.
Who's afraid of a big bad optimizing compiler?
Posted Jul 18, 2019 4:31 UTC (Thu) by wtarreau (subscriber, #51152) [Link]
Who's afraid of a big bad optimizing compiler?
Posted Jul 18, 2019 19:33 UTC (Thu) by k8to (guest, #15413) [Link]
The point is still a concern, of course.
Who's afraid of a big bad optimizing compiler?
Posted Jul 18, 2019 21:19 UTC (Thu) by law@redhat.com (guest, #31677) [Link]
Who's afraid of a big bad optimizing compiler?
Posted Jul 18, 2019 21:38 UTC (Thu) by pbonzini (subscriber, #60935) [Link]
Who's afraid of a big bad optimizing compiler?
Posted Jul 26, 2019 10:53 UTC (Fri) by ksandstr (guest, #60862) [Link]
I'd like to point out that it's not sufficient to tag values subject to concurrent write, but rather all consumers of those values should be marked by use of explicitly atomic primitives. Such as those in stdatomic.h, which will presumably stay consistent[0] despite e.g. future ultra-LTO exposing damage previously hidden by an intermodule boundary. This is because the semantics of concurrently-written data are generally different from those of data in a single-threaded or "classic C" program where it isn't intended that changes to shared state are visible outside of the current thread or the reverse, so it's necessary to indicate non-classic usage to the reader at point of interaction and declaration both.
>The right thing, particularly for user space code, is to learn the C11/C++11 memory model and use those facilities.
The C11 memory model can also be studied as first the "classic C" of single-threaded programs, where the compiler may yield an unrecognizably bizarre but correct result for reasons in the article; and by then laying over a set of diffs for "concurrently interacting C", which is just as the old but allows source to indicate that certain parts expect to publish or consume and should appear accordingly from out-of-thread[1]. Most importantly the latter allows execution of the former unmodified when it's known that concurrent interactions won't occur, such as between callsites of POSIX-style mutexes where synchronization is hidden behind a function call boundary, or indicated to a deeply LTO compiler by syscall or atomic primitive.
This is in contrast to ideas about forcing correct concurrent interactions through manipulation of the compiler output (sans memory-clobbering asm volatile), which seem like a sediment of bubblegum fixes compared to C11 atomics, or to even the C99 idea of the standard language as semantics that constrain the compiler over a naïve 1:1-ish mapping between statement and machine code.
[0] unimplementable ordering semantics aside; how come nobody smelled the cr^Wsufficiently smart compiler from a mile away? (and how come I used a born-defunct option in real code, thinking it'd one day do something besides introduce subtle breakage?)
[1] e.g. in a shared memory segment
Who's afraid of a big bad optimizing compiler?
Posted Jul 26, 2019 18:11 UTC (Fri) by andresfreund (subscriber, #69562) [Link]
Who's afraid of a big bad optimizing compiler?
Posted Aug 15, 2019 19:15 UTC (Thu) by PaulMcKenney (subscriber, #9624) [Link]
Who's afraid of a big bad optimizing compiler?
Posted Jul 25, 2019 14:13 UTC (Thu) by jerojasro (guest, #98169) [Link]
I know, the first thing the article says is: "the C standard grants the compiler the right to make some assumptions that end up causing these weird/non-obvious things if you don't guard against them", but I must ask: why can't the compiler grow an optimization mode that does the right thing (and thus avoid all of the issues documented here) in the presence of global variables and concurrent execution? that situation (globals+concurrence) is so common, that solving it in the compiler would benefit most, if not all, C users, in both kernel and user space.
are there any reasons for not fixing this issue (once) in the compiler, other than the amount of work involved (which I guess must be ... not trivial, of course), and this new optimization mode not being standards-compliant?
(I'm not attempting to troll anybody; it just seems to me that what I propose is an obvious solution, and since it's not being used, I'd like to know what obvious thing I'm missing that prevents us from using it)
Who's afraid of a big bad optimizing compiler?
Posted Jul 25, 2019 18:09 UTC (Thu) by excors (subscriber, #95769) [Link]
Probably the other main issue is performance. Particularly with concurrency where CPUs often require explicit memory barriers if you want guaranteed ordering, and if the compiler added memory barriers around every memory access just in case you were using the same memory in another thread, it would be pretty slow. Even if it was only a tiny performance regression, many C programmers care a lot about performance (especially microbenchmark performance) and won't be happy with a new compiler that's measurably slower, and users won't be happy if their application runs slower after updating their kernel. They'd rather have a dangerous but fast compiler, and rely on the programmer being smart enough to avoid those dangerous cases.
Example, read once may or may not be the "right thing".
Posted Apr 9, 2020 5:45 UTC (Thu) by gmatht (guest, #58961) [Link]
We even have an example in the article of different code making incompatible assumptions. Here the code assumes the variableneed_to_stop
will be read many times.
1 while (!need_to_stop) /* BUGGY!!! */
2 do_something_quickly();
The following code is instead assuming that global_ptr
won't change. This could be ensured by only reading global_ptr
once.
2 if (global_ptr != NULL &&
3 global_ptr < high_address)
4 do_low(global_ptr);
In general it might be hard to determine which of these two contradictory assumptions the code is making.
Example, read once may or may not be the "right thing".
Posted Apr 9, 2020 8:30 UTC (Thu) by geert (subscriber, #98403) [Link]
Example, read once may or may not be the "right thing".
Posted Apr 10, 2020 17:22 UTC (Fri) by zlynx (guest, #2285) [Link]
In any thread situation you want an atomic access. Which might be implemented using volatile, but requires more than that such as memory barrier operations.
Who's afraid of a big bad optimizing compiler?
Posted Jul 25, 2019 19:01 UTC (Thu) by farnz (subscriber, #17727) [Link]
The short answer is performance; each of the eight transforms in the article permits the generated code to be optimized into much, much faster code, as long as the original C code did what the programmer intended it to do.
That statement ("as long as the original C code did what the programmer intended it to do") is the source of all the pain. The C standard specifies an abstract C machine that directly executes C code, and the job of a compiler is to translate C code into a machine code that has the same semantics as C running on the abstract C machine. However, most C programmers are ignorant of the abstract C machine (not all, but most, including me); instead, they either rely on "my compiler turns it into machine code that does what I want", or "the obvious translation to my preferred machine code does what I want". Each of these interpretations of "what my C means" leads to its own set of problems:
- "My compiler turns it into machine code that does what I want" ignores compiler bugs, portability issues, optimizations that produce reasonable results for your test vectors but not inputs you're not testing etc. In other words, writing a compiler that complies with this level of expectation means being bug-for-bug compatible with all old versions; if that's what you want/need, why are you updating your compiler to begin with? Even new CPU support can break this level.
- "The obvious translation to my preferred machine code does what I want" is more insidious; the issue is that they're reading their C code as-if it's implemented as assembly macros that generate the machine code they want. The problem here is that C is not a macro assembler - it's a full blown portable programming language - and while their assumptions probably hold true for a specific compiler version and optimization settings on a single processor model, they fall over on other processor models and with other optimizations that are allowed by the C language.
In short, it's hard, because of C's legacy as "basically one step above a macro assembler for the PDP-11".
Who's afraid of a big bad optimizing compiler?
Posted Jul 26, 2019 15:08 UTC (Fri) by jerojasro (guest, #98169) [Link]
but after reading both of your comments, I realized (hope I'm correct), that globals are *both*: things declared as global, *and also* anything allocated in the heap. And that last part is what makes automating all of these checks/guards such a performance hit, and worth bothering the programmer with handling manually the unsafe situations.
Man I'm glad I just concatenate strings for a living.
Who's afraid of a big bad optimizing compiler?
Posted Jul 26, 2019 15:32 UTC (Fri) by farnz (subscriber, #17727) [Link]
It's a bit worse than that - things in C are global in the sense of "need concurrency-awareness baked-in" if they are syntactically global, or if there is ever a pointer to the thing. That last covers all heap allocations, and any stack allocations whose address you take with &, and in turn means that you have to add barriers etc to all heap allocations plus some stack allocations.
And, of course, this is only going to help code that does not execute correctly on the C machine, as it ignores the semantics of the C machine. We don't want to strengthen the semantic guarantees of the C machine, since that results in the output code running slower (it needs more barriers, as more of the "internal" effects are now externally visible). So it only actually helps developers who dont actually understand what they're telling the computer to do - while this may be a common state, it's not necessarily something to encourage.
Who's afraid of a big bad optimizing compiler?
Posted Jul 26, 2019 18:18 UTC (Fri) by andresfreund (subscriber, #69562) [Link]
I'd say that C11/C++11 made a large step in that direction, by having a formalized memory model, and builtin atomics. Before that there really was no way to not rely on compiler implementation details to get correctly working (even though formally undefined) concurrent programs.
It does require you however to actually use the relevant interfaces.
I don't quite see how you'd incrementally get to a language that doesn't have any of these issues, without making it close to impossible to ever incrementally move applications towards that hypothetical version of C. I mean there's basically no language that allows to use shared memory and doesn't require escape hatches from its safety mechanisms to implement fast concurrent datastructures (e.g. rust needing to go to unsafe for core pieces). And the languages that get closes require enough of a different approach that it's hard to imagine C going towards it.
That's not to say that C/C++ have sufficiently progressed towards allowing to at least opt into safety. The C11/C++11 memory model and the atomics APIs are a huge step, but it's happened at the very least 10 years too late (while some of the formalisms where developed somewhat more recently, there ought to at least have been some progress before then). And there's plenty other issues where no proper ways are provided (e.g. signed integer overflow handling, mentioned in nearby comments).
Who's afraid of a big bad optimizing compiler?
Posted Jul 29, 2019 16:44 UTC (Mon) by PaulMcKenney (subscriber, #9624) [Link]
So there are great opportunities for innovations in the area of locating old concurrent C code in need of an upgrade! :-)
Who's afraid of a big bad optimizing compiler?
Posted Jul 26, 2019 13:09 UTC (Fri) by topimiettinen (guest, #133428) [Link]
-Topi
Who's afraid of a big bad optimizing compiler?
Posted Jul 26, 2019 17:18 UTC (Fri) by excors (subscriber, #95769) [Link]
E.g. for any shared data structure, there's probably some single-threaded initialisation code that sets it up before it's exposed to other threads. If the structure was declared with volatile/atomic fields, the compiler may add barriers in the initialisation code that the programmer knows are unnecessary. So the programmer might choose to explicitly use READ_ONCE/WRITE_ONCE, improving performance but increasing the risk of missing a case where it's actually required. Which way is "better" depends on how you weigh performance vs correctness.
Who's afraid of a big bad optimizing compiler?
Posted Jul 26, 2019 17:57 UTC (Fri) by andresfreund (subscriber, #69562) [Link]
More important than initialization are probably all the accesses inside sections of code holding a lock, where you likely do not want unnecessary repeat accesses to memory, just because a variable is referenced twice. Because acquiring spinlock/mutex/whatever commonly acts as a barrier of some sorts (depending on the implementation somewhere between a full memory barrier, acquire/release barrier, compiler barrier and nothing), it is *often* unnecessary to use READ_ONCE / WRITE_ONCE from within. Although if there's other accesses working without that lock, or if you have more fine grained locking schemes (say exclusive, exclusive write, read), it's possibly still necessary to use them even within a locked section.
There's also the fact that volatile on datastructures itself often ends up requiring annoying casts to get rid of it, which then makes the code more fragile too (lest you use some other type of smart macro that keeps the type the same except for removing the volatile).
Who's afraid of a big bad optimizing compiler?
Posted Aug 10, 2019 13:28 UTC (Sat) by dcoutts (subscriber, #5387) [Link]
You have this combination
* C semantics doesn't talk about shared variables
* almost all variables are private
* C compilers want to optimise everything
* assuming variables are shared would destroy many optimisations
Which ends up as a mess when you do have shared variables. If they were explicitly identified then the compiler could do the right things with them (given the machine's memory model) and since there's so few of them it would not affect optimisations in general.
A language which gets imperative programming with shared mutable variables right, ironically, is the functional language Haskell. It has two types of shared mutable variables, MVars which are like a mutable variable protected by a lock, and TVars which are part of the Software Transactional Memory concurrency feature. Of course these also are library APIs, that are able to enforce safe access (reading the MVar and taking the lock are inseparable, and TVars can only be read inside STM transactions).
But even in a low level language where you don't try to enforce safe concurrency, simply distinguishing shared variables from other variables seems like it would go a long way to resolving this confusion and the C compiler's difficulties.
Who's afraid of a big bad optimizing compiler?
Posted Aug 13, 2019 15:17 UTC (Tue) by jezuch (subscriber, #52988) [Link]
And, I would guess, Rust. Which is not surprising because this was an explicit design goal.
Who's afraid of a big bad optimizing compiler?
Posted Aug 15, 2019 19:19 UTC (Thu) by PaulMcKenney (subscriber, #9624) [Link]
Who's afraid of a big bad optimizing compiler?
Posted Aug 16, 2019 10:10 UTC (Fri) by farnz (subscriber, #17727) [Link]
I wonder if there's a human language barrier here; Rust uses "unsafe" code to mean "the human in front of the computer is responsible for proving absence of data races and invalid pointer dereferences". The Nomicon" describes what, exactly, unsafe opens up to you - basically, safe Rust contains no way to trigger the dreaded Undefined Behaviour, while unsafe Rust provides for 6 ways to trigger UB if you misuse language features. The idea is that you carefully encapsulate the risk of UB into well-validated and tested code, and provide a safe Rust interface on top that is guaranteed to not have UB no matter how it's used.
For example, the implementation of a RwLock is full of unsafe code. It has to be to implement a concurrency primitive - the underlying platform is a C ABI, which is implicitly unsafe in Rust, as the compiler cannot verify that your C is free of UB. However, as a user of RwLocks, I don't need to care about the unsafety of RwLock internals - I can trust that the humans who've reviewed and tested that code ensured that safety guarantees are met.
I would expect the same to apply to advanced synchronization - you have to use unsafe to implement it, because you're doing things the compiler cannot verify are defined behaviour, but the resulting primitives are safe because there's no way to misuse them. There is an implementation of RCU in Rust that shows how this works - internally, it's unsafe; for starters, it uses the bottom 2 bits of a pointer as tag values, and it keeps raw pointers around, converting them back and forth to references (effectively safe pointers), but the external interface is entirely safe Rust.
Who's afraid of a big bad optimizing compiler?
Posted Aug 16, 2019 12:54 UTC (Fri) by excors (subscriber, #95769) [Link]
It may be more accurate to say it can't *directly* trigger Undefined Behaviour. It can still indirectly trigger UB if it calls into imperfectly-implemented Unsafe code with an interface that's marked as Safe.
I expect it takes a lot of discipline and skill to write non-trivial Unsafe modules that guarantee it's absolutely impossible to trigger UB through their Safe interface. https://doc.rust-lang.org/nomicon/safe-unsafe-meaning.html gives the example of BTreeMap with a Safe-but-incorrect Ord, which is the kind of subtlety that seems likely to trip up many programmers writing their own Unsafe code in pursuit of performance. And there's the situation described in https://dev.to/deciduously/the-trials-and-tribulations-of... where a popular Rust framework had "an understanding gap regarding what unsafe means in Rust", leading to much unhappiness.
(It still seems better than C/C++ where you have to apply discipline and skill to *all* code, if you don't want your application to crash all the time. Rust tells you where that attention needs to be focused. But you still need good programmers and good review processes to avoid safety problems sneaking in.)
Based on zero practical experience of Rust, I imagine there are cases where it's infeasible to provide a guaranteed-Safe interface to an interesting bit of Unsafe code (like particularly clever synchronisation primitives, though RcuCell may be a counterexample). You can give it an Unsafe interface and make it the caller's responsibility to guarantee safety, because it's easier to make guarantees at a higher level, but the tradeoff is you've now got more code running unsafely. Rust seems to rely on the hypothesis that, in practice, unsafe code can be kept small and isolated, so a large majority of the codebase gets the benefits of guaranteed safety - is there much evidence for or against that yet? (particularly when writing kernel-like code that has to deal with a lot of fundamentally unsafe hardware)
Who's afraid of a big bad optimizing compiler?
Posted Aug 16, 2019 13:24 UTC (Fri) by farnz (subscriber, #17727) [Link]
In terms of evidence for unsafety not growing, there's Redox OS, which has so far been successful at encapsulating unsafe code (at least in terms of apparent safety - it'd take a formal audit to be confident that there are no erroneous safe wrappers). Similarly, Servo (and the parts backported to Firefox) show that it's also practical to keep unsafe to a small area when working in a big project like a web browser.
My practical experience of Rust is that I have yet to encounter a piece of code which has to be unsafe and cannot provide a safe interface to the rest of your project; I have encountered several places where unsafe is used because it's easier to violate UB guarantees than to design a good interface. As an example, I've written several FFIs to internal interfaces where the C code has a "flags tell you which pointers in this structure are safe to access" design, and I've ended up creating a new datastructure that has nested Rust structs and makes use of Option to identify which sub-structs are valid right now.
Personally, I see "unsafe" as a review marker; it tells you that in this block of code, the programmer is making use of features that could lead to UB if abused (the Rust compiler warns about unnecessary use of unsafe), and that you need to be confident that they are upholding the "no UB via Safe code" guarantee that the compiler expects. It's by no means a perfect system, and as Actix shows, it can be abused, but it does show you where you need to hold the code author to a higher standard than normal.
FWIW, having spent the last 2 years working in Rust, I think it's a decent advance on C; I spend longer getting code to actually compile than I did with C11 or C++17, but it then works (both when tested and in production) more often than my C11 or C++17 did.
Who's afraid of a big bad optimizing compiler?
Posted Aug 16, 2019 14:06 UTC (Fri) by mathstuf (subscriber, #69389) [Link]
A note is that the amount of *thinking* is decreased overall. Rust front-loads it on the compilation phase (with a decent amount of hand holding) while C likes to back-load it on the debugging phase with a possibly unbounded timesink (where I, at times, feel like gdb is watching me like the Chesire cat). The number of times our Rust stuff has fallen over (mostly "unresponsive"[1] or hook deserialization errors[2]) is less than a dozen. But, an error (not a crash!) in the deserializer is much easier to diagnose than "someone threaded a `NULL` *here*‽" questions I find myself asking in C/C++.
[1] Someone pushed a branch as a backport that caused the robot to go checking ~1000 commits with ~30 checks each. The thinking here was actually about how to detect that we should avoid checking so much work (since the branch is obviously too new to backport). It was churning through it as best it could, but it was still taking ~20 minutes to do the work.
[2] GitLab (and GitHub) webhook docs are so anemic around the types they are using for stuff, so that a field is nullable is sometimes surprising. I can't wait for GraphQL to become more widespread :) .
Who's afraid of a big bad optimizing compiler?
Posted Aug 16, 2019 14:10 UTC (Fri) by mathstuf (subscriber, #69389) [Link]
I suspect things like RcuCell are fine because you can still teach Rust about them through an API. For my keyutils library[1], it's a bit harder because, while there's no pointer juggling, the lifetime of things are tied to kernel objects outside of the Rust code's control. Luckily synchronization is handled by the kernel as well, so I don't have to worry about *that*, but it does mean that your "owned" data is not really such since they're just handles. It does mean that pretty much every function needs to return a Result<> because basically any operation can end up with a `ENOKEY` result if something else ends up deleting it out from underneath you.
Who's afraid of a big bad optimizing compiler?
Posted Aug 16, 2019 15:06 UTC (Fri) by PaulMcKenney (subscriber, #9624) [Link]
It will be interesting to see Writing A Kernel Driver in Rust at the upcoming Linux Plumbers Conference. From what I have seen, things like split counters, sequence locks, and base RCU (as opposed to RCU use cases like RcuCell) have been carefully avoided in earlier Rust-language Linux-kernel drivers. Maybe they will take them head-on in this talk. Either way, looking forward to it!I agree that encapsulating difficult-to-get-right things into libraries is a most excellent strategy where it applies, but there are often things that don't fit well into a safe API. Again, I find a number of Rust's approaches to be of interest, but I do not believe that Rust is quite there yet. As I have said in other forums, my current feeling is that current Rust is to the eventual low-level concurrency language as DEC Alpha is to modern microprocessors.
Who's afraid of a big bad optimizing compiler?
Posted Aug 15, 2019 19:11 UTC (Thu) by PaulMcKenney (subscriber, #9624) [Link]
But both C and C++ do now allow shared variables to be designated as such, for example, using C _Atomic and C++ atomic<T>. Although these are not perfect, they can be quite helpful. Especially given that they allow easy use of lighter-weight and more-scalable synchronization mechanisms than MVars and TVars appear to enable.The problem is that these C and C++ features were not introduced into either standard until 2011, some decades after both languages were first used to write concurrent code. So there is quite a bit of production code that does not mark shared variables, because there was no way to do so at the time that code was written.
So again, this is a golden opportunity for people who can come up with ways of locating vulnerable concurrent code in large production concurrent C and C++ code bases!
Who's afraid of a big bad optimizing compiler?
Posted May 10, 2020 16:39 UTC (Sun) by rep_movsd (guest, #100040) [Link]
Who's afraid of a big bad optimizing compiler?
Posted May 10, 2020 21:16 UTC (Sun) by farnz (subscriber, #17727) [Link]
Couple of reasons:
- volatile is not necessarily defined clearly enough on all architectures to be what you want. In the event that it's not sufficient on a given architecture, you can make ACCESS_ONCE do the right thing. I don't believe this applies to the kernel, but it's something to be aware of.
- Even on architectures where volatile does do the right thing, it causes pessimisation of all accesses to that object, not just those that need ACCESS_ONCE semantics. '
A different coding style wouldn't need to worry about the second case, as it would cast away volatile instead of using ACCESS_ONCE for cases that don't need ACCESS_ONCE semantics, but for better or worse, the Linux coding style doesn't work that way.