Java Memory Model Pragmatics (transcript)

Aleksey Shipilёv, @shipilev, aleksey@shipilev.net

Caution

This is a very long transcript for a very long talk. The talk’s running time is close to 120-150 minutes, and transcript may take even longer if digested thoughtfully. Please plan your reading accordingly. There are five large sections which are mostly independent of each other, that should help to split the reading.

Warning

The talk is correct to the best of my knowledge, but parts of it may still be incorrect, misguided, or sometimes wrong. The only true source of truth is The Java Language Specification itself. My understanding of JMM was proven incorrect more than once over the previous five years. If you see something wrong in the post, don’t hesitate to drop me a note.

Note	This post is also available in ePUB and mobi.

Preface

The Java Memory Model is the most complicated part of Java spec that must be understood by at least library and runtime developers. Unfortunately, it is worded in such a way that it takes a few senior guys to decipher it for each other. Most developers, of course, are not using JMM rules as stated, and instead make a few constructions out of its rules, or worse, blindly copy the constructions from senior developers without understanding the limits of their applicability. If you are an ordinary guy who is not into hardcore concurrency, you can pass this post, and read high-level books, like "Java Concurrency in Practice". If you are one of the senior folks interested in how all this works, read on!

This post is a transcript of the "Java Memory Model Pragmatics" talk I gave during this year at different conferences, mostly in Russian. There seems to be a limited supply of conferences in the world which can accommodate such a long talk, and being in need for exposing some background reading for my JMM Workshop at JVMLS this year, I decided to transcribe it.

We will reuse a lot of the slides, and I’ll try to build the narrative based on them. Sometimes I’ll just skip over without narrative when the slides are self-explanatory. The slides are available in Russian and English. The slides below are rasterized, but have a nice native resolution. Zoom-in if they are unreadable. Sane browsers smartly resize the images, and more details are visible when zoomed in. (I would make the illustrations in SVG, but my iPad crashes when trying to render 150+ of them on one page!)

I would like to thank Brian Goetz, Doug Lea, David Holmes, Sergey Kuksenko, Dmitry Chyuko, Mark Cooper, C. Scott Andreas, Joe Kearney and many others for helpful comments and corrections. The example section on final fields contains the info untangled by Vladimir Sitnikov and Valentin Kovalenko, and is the excerpt from their larger talk on Final Fields Semantics.

Intro

First, a simple detector slide. Hey there, @gakesson, waves!

If you read just about any language spec, you will notice it can be logically divided into two related, but distinct parts. First, a very easy part, is the language syntax, which describes how to write programs in the language. Second, the largest part, is the language semantics, which describes exactly what a particular syntax construct means. Language specs usually describe the semantics via the behavior of an abstract machine which executes the program, so the language spec in this manner is just an abstract machine spec.

When your language has storage (in the form of variables, heap memory, etc.), the abstract machine also has storage, and you have to define a set of rules concerning how that storage behaves. That’s what we call a memory model. If your language does not have explicit storage (e.g. you pass the data around in call contexts), then your memory model is darn simple. In storage-savvy languages, the memory models appear to answer a simple question: "What values can a particular read observe?"

In sequential programs, that seems a vacuous question to ask: since you have the sequential program, the stores into memory are coming in some given order, and it is obvious that the reads should observe the latest writes in that order. That is why people usually meet with memory models only for multi-threaded programs, where this question becomes complicated. However, memory models matter even in the sequential cases (although they are often cleverly disguised in the notion of evaluation order).

For example, the infamous example of undefined behavior in a C program that packs a few increments between the sequence points. This program can satisfy the given assert, but can also fail it, or otherwise summon nasal demons. One could argue that the result of this program can be different because the evaluation order of increments is different, but it would not explain, e.g. the result of 12, when neither increment saw the written value from the other. This is the memory model concern: what value should each increment see (and by extension, what it would store).

Either way, when presented with a challenge of implementing the particular language, we can go one of two ways: interpretation, or compilation of abstract machine to the target hardware. Both interpretation and compilation are connected via Futamura Projections anyway.

The practical takeaway is that both interpreter and compiler are tasked with emulating the abstract machine. Compilers are usually blamed for screwing up the memory models and multi-threaded programs, but interpreters are not immune, either. Failing to run an interpreter to the abstract machine spec may result in memory model violations. The simplest example: cache the field values over volatile reads in an interpreter, and you are done for. This takes us to an interesting trade-off.

The very reason why programming languages still require smart developers is the absence of hypersmart compilers. "Hyper" is not a overstatement: some of the problems in compiler engineering are undecidable, that is, non-solvable even in theory, let alone in practice. Other interesting problems may be theoretically feasible, but not practical. Therefore, to make practical (optimizing) compilers possible, we need to cause some inconvenience in the language. The same goes for hardware, since (at least for Turing machines) it is just the algorithms in silica.

To elaborate on this thought, the rest of the talk is structured as follows.

Part I. Access Atomicity

What Do We Want

The simplest thing to understand in JMM is the access atomicity guarantee. To specify this more or less rigorously, we need to introduce a bit of notation. In the example on this slide, you can see the table with two columns. This notation reads as follows. Everything in the header happened already: all variables are defined, all initializing stores committed, etc. The columns are different threads. In this example, Thread 1 stores some value V2 into global variable t. Thread 2 reads the variable, and asserts the read value. Here, we want to make sure the reading thread only observes the known value, not some value in between.

What Do We Have

This seems to be a very obvious requirement for a sane programming language: how you can possibly violate this, and why? Here is why.

To maintain atomicity under concurrent accesses, you have to at least have the machine instructions operating with the operands of given width, otherwise the atomicity is broken on instruction level: if you need to split the accesses into several sub-accesses, they can interleave. But even if you have the desired-width instructions, they still can be non-atomic: for example, the atomicity guarantees for 2- and 4-byte reads are unknown for PowerPC (they are implied to be atomic).

Most platforms, however, do guarantee atomicity for up to 32-bit accesses. This is why we have a compromise in JMM which relaxes the atomicity guarantees for 64-bit values. Of course, there are still ways to enforce atomicity for 64-bit values, e.g. by pessimistically acquiring the lock on update and read, but that will come at a cost, and so we provide an escape hatch: users put volatile where they need atomicity, and VM and hardware work together to preserve it, no matter what the costs are.

On most hardware it is not enough, however, to have the desired-width operations to maintain atomicity. For example, if data access causes multiple transactions to memory, the atomicity is off, even though we executed a single access instruction. In x86, for example, the atomicity is not guaranteed if the read/write spans two cache lines, since it requires two memory transactions. This is why generally only the aligned reads/writes are atomic, which forces VMs to align the data.

In this example, which is printed by JOL, we can see the long field being allocated at offset 16 from the object start. Coupled with object alignment of 8 bytes, we have the perfectly aligned long. Now, it would not violate the memory model to put long at offset 12, if we know it is not volatile, but that will only work on x86 (other platforms may violently disagree on performing misaligned accesses), and possibly with performance disadvantages.

Test Your Understanding

Let’s test our understanding with a simple quiz. Setting -1L is equivalent to setting all the bits to 1 in long.

Answer (select over to reveal): No magic is involved; a volatile long field inside AtomicLong guarantees this. This is required by the language spec, and no special treatment for AtomicLong from the VM side is needed for this sample to work.

Value Types and C/C++

In Java, we are "lucky" to have the built-in types of small widths. In other languages which provide value types, the type width is arbitrary, which presents interesting challenges for the memory model.

In this example, C++ follows C compatibility by supporting structs. C++11 additionally supports std::atomic, which requires access atomicity for every Plain Old Data (POD) type T. So, if we do a trick like this in C++11, the implementations are forced to deal with atomically writing and reading the 104-byte memory blocks. There are no machine instructions which can guarantee atomicity at these widths, so implementation should resort to either CAS-ing, or locking, or something else.

(It gets even more interesting since C++ allows separate compilation: now the linker is tasked with the job of figuring out what locks/CAS-guards are used by this particular std::atomic. I am not completely sure what happens if threads execute the code generated by different compilers in the example above.)

JMM Updates

This section covers the atomicity considerations for the updated Java Memory Model. See a more-thorough explanation in a separate post.

In 2014, do we want to reconsider the 64-bit exception? There are few use cases when racy updates to long and double make sense, e.g. in scalable probabilistic counters. Developers may reasonably hope the long/double accesses are atomic on 64-bit platforms, but they nevertheless require volatile to be portable if the code is accidentally run on 32-bit platforms. Marking fields volatile will pay the cost of memory barriers.

In other words, since volatile is overloaded with two meanings: a) access atomicity; and b) memory ordering — you cannot get one without getting the other as baggage. One can speculate on the costs of removing the 64-bit exception. Since VMs are handling access atomicity separately by emitting special instruction sequences, we can hack the VM into unconditionally emitting atomic instruction sequences when required.

It takes some time to understand this chart. We can measure reads and writes of longs — three times for each access mode (plain, volatile, and via Unsafe.putOrdered). If we are implementing the feature correctly, there should be no difference on 64-bit platforms, since the accesses are already atomic. Indeed there is no difference between the colored bars on 64-bit Ivy Bridge.

Notice how heavyweight a volatile long write can be. If I only wanted atomicity, I pay this cost for memory ordering.

It gets more complicated when dealing with 32-bit platforms. There, you will need to inject special instruction sequences to get the atomicity. In the case of x86, FPU load/stores are 64-bit wide even in 32-bit platforms. You pay the cost of "redundant" copies, but not that much.

On non-x86 platforms, we also have to use alternative instruction sequences to regain atomicity, with predictable performance impact. Note that in this case, as well in the 32-bit x86 case, volatile is a bit slower with enforced atomicity, but that’s a systematic error since we need to also dump the values into a long field to prevent some compiler optimizations.