Java's new Z Garbage Collector (ZGC) is very exciting

Wed 29 August 2018 Sadiq Jaffer & Richard Warburton

Java 11 has recently been feature frozen and contains some really great features, one in particular we’d like to highlight. The release contains a brand new Garbage Collector, ZGC, which is being developed by Oracle that promises very low pause times on multi-terabyte heaps. In this article we’ll cover the motivation for a new GC, a technical overview and some of the really exciting possibilities ZGC opens up.

So why the need for a new GC? After all Java 10 already ships with four that have been battle-tested for years and are almost endlessly tunable. To put this in perspective, G1 the most recent GC in Hotspot was introduced in 2006. At that time the biggest AWS instance available was the original m1.small packing 1 vCPU and 1.7GB of ram, today AWS will happily rent you an x1e.32xlarge with 128 vCPUs and an incredible 3,904GB of ram. ZGC’s design targets a future where these kinds of capacities are common: multi-terabyte heaps with low (<10ms) pause times and impact on overall application performance (<15% on throughput). The mechanisms by which it can achieve this can also be extended in the future to support some exciting possibilities such as multi-tiered heaps i.e DRAM for hot objects and NVMe flash for less frequently accessed ones, or compressed heaps.

GC Terminology

To understand where ZGC fits in with the existing collectors and how it can achieve this, we’ll need to go over some terminology first. Garbage Collection at its most basic involves identifying memory that is no longer in use and making it available for re-use. Modern collectors carry out this process in several phases and they tend to be described as follows:

Parallel - while the JVM is running there are application threads and garbage collector threads. A parallel phase is one carried out by multiple gc threads, i.e the work is split up between them. It says nothing about whether the gc threads might be overlapping with running application threads.
Serial - a phase that is serial is only carried out on a single gc thread. As with parallel above, it says nothing about whether the work overlaps with currently running application threads.
Stop The World - in a stop the world phase, the application threads are suspended in order for the gc to carry out its work. When you experience GC pauses this is due to a Stop The World phase.
Concurrent - if a phase is concurrent then the GC can carry out its work at the same time the application threads are progressing with their work. Concurrent phases are complex because they need to be able to deal with application threads potentially invalidating their work before the phase completes.
Incremental - if a phase is incremental then it can run for a period and terminate early due to some condition, e.g time budget or a higher priority gc phase that needs carrying out, while still having done productive work. This is in contrast to a phase that needs to fully complete for it to have been productive.

Inherent trade-offs

It’s worth pointing out that all of these properties involve trade-offs. For example, a parallel phase will utilise multiple gc threads to carry out work but in doing so incurs overhead in coordination between the threads. Likewise, a concurrent phase won’t pause application threads but may involve significantly more overhead and complexity to deal with application threads concurrently invalidating its work.

ZGC

Now we understand the properties of different gc phases, let’s explore how ZGC works. To achieve its goals ZGC uses two techniques new to Hotspot Garbage Collectors: coloured pointers and load barriers.

Pointer colouring

Pointer colouring is a technique that stores information in pointers (or in Java parlance, references) themselves. This is possible because on 64-bit platforms (ZGC is 64-bit only) a pointer can address vastly more memory than a system can realistically have and so it’s possible to use some of the other bits to store state. ZGC restricts itself to 4Tb heaps which require 42-bits, leaving 22-bits of possible state of which it currently uses 4-bits: finalizable, remap, mark0 and mark1. We’ll explain what they’re used for in a bit.

One problem with pointer colouring is that it can create additional work when you need to dereference the pointer because you need to mask out the information bits. Some platforms like SPARC have built-in hardware support for pointer masking so it’s not an issue but for x86, the ZGC team use a neat multi-mapping trick.

Multi-mapping

To understand how multi-mapping works, we need to briefly explain the difference between virtual and physical memory. Physical memory is the actual ram available to the system, usually the capacity of the DRAM chips installed. Virtual memory is abstraction that means applications have their own (often isolated) views into physical memory. The operating system is responsible for maintaining the mappings between ranges of virtual memory and physical memory, it does this through the use of a page table and the processor’s Memory Management Unit (MMU) and Translation Lookaside Buffer (TLB) which translates addresses requested by applications.

Multi-mapping involves mapping different ranges of virtual memory to the same physical memory. Since by design only one of remap, mark0 and mark1 can be 1 at any point in time, it’s possible to do this with three mappings. There’s a nice diagram in the ZGC source for this.

Load barriers

Load barriers are pieces of code that run whenever an application thread loads a reference from the heap (i.e accesses a non-primitive field on an object):

void printName( Person person ) {
    String name = person.name;  // would trigger the load barrier
                                // because we’ve loaded a reference 
                                // from the heap
    System.out.println(name);   // no load barrier directly
}

In the above code, the line that assigns name involves following the person reference to the object data on the heap and then loading the reference to name it contains. It is at this point the load barrier is triggered. The second line, that triggers a print to the screen, won’t directly cause the load barrier to trigger because there is no load of a reference from the heap - name is a local variable and so no reference is being loaded from the heap. There may be other load barriers triggered on referencing System and out, or inside println however.

This is in contrast to write-barriers used by other GCs, such as G1. The load barrier’s job is to examine the reference’s state and potentially carry out some work before returning the reference (or maybe even a different reference) to the application. In ZGC it carries out this task by testing the loaded reference to see if certain bits are set, depending on the current phase. If the reference passes the test then no additional work is carried out, if it fails some phase-specific task is carried out before the reference is returned to the application.

Marking

Now we understand what these two techniques are, let’s look at a ZGC GC cycle. The first part of the cycle is Marking. Marking involves finding and labelling in some way all heap objects that could be accessible to the running application, in other words finding the objects that aren’t garbage.

ZGC’s Marking is broken up into three phases. The first phase is a Stop The World one where the GC roots are marked as being live. GC roots are things like local variables, which the application can use to access the rest of the objects on the heap. If an object isn’t reachable by a walk of the object graph starting from the roots then there’s no way an application could access it and it is considered garbage. The set of objects accessible from the roots is referred to as the live set. The GC roots marking step is very short, as the total number of roots is often relatively small.

GC roots leading to the live set

Once that phase has completed, the application resumes and ZGC begins the next phase which concurrently walks the object graph and marks all accessible objects. During this phase the load barrier is testing all loaded references against a mask that determines if they have been marked or not yet for this phase, if a reference hasn’t been marked then it is added to a queue for marking.

After the walk is complete, there’s a final, brief, Stop The World phase which handles some edge cases (which we’re going to ignore for now) and then marking is complete.

Relocation

The next major part of the GC cycle is that of relocation. Relocation involves moving live objects around in order to free up sections of the heap. Why move objects rather than just fill in the gaps? Some GCs do in fact do that but it has the unfortunate consequence that allocation becomes much more expensive as when an allocation is required the allocator needs to find a free space to place the object. In contrast, if large chunks of memory can be freed up then allocation becomes simply incrementing (or ‘bumping’) the pointer by the amount of memory required for the object.

ZGC divides the heap up in to pages and at the start of this phase it concurrently selects a set of pages whose live objects need to be relocated. When the relocation set is selected, there is a Stop The World pause where ZGC relocates any objects in the relocation set that are referenced as a root (local variables, etc.) and remaps their reference to the new location. As with the earlier Stop The World step, the pause time involved here depends only on the number of roots and the ratio of the sizes of the relocation set and total live set of objects, which is normally reasonably small. It doesn’t scale with the overall size of the heap, like many collectors do.

After moving any roots that were necessary, the next phase is concurrent relocation. During this phase GC threads walk the relocation set and relocate all the objects in the pages it contains. Application threads can also relocate objects in the relocation set if they try to load them before the GC has relocated them, this is achieved through the load barrier (which is triggered on loading references from the heap) as detailed on the following flowchart:

Flowchart for remapping

This ensures that all references the application sees will have been updated and there is no possibility for the application to be operating on an object that is concurrently being relocated.

The GC threads will eventually have relocated all objects in the relocation set, there may still be references pointing to the old locations of these objects. The GC could walk the object graph and remap all references to their new locations but this is an expensive step. Instead, this is combined with the next marking phase. During the walk, if a reference is found to not be remapped, it is remapped and then marked as live.

Recap

Trying to reason about the performance characteristics of a complex Garbage Collector such as ZGC in isolation is difficult but from the earlier sections it should be clear that nearly all the pauses we’ve covered involve work that depends on the set of GC roots rather than the live set of objects, heap size or garbage. The final pause in the Marking phase, which deals with mark termination, is an exception to this but is incremental and the GC reverts to concurrent marking if it’s time budget is exceeded until it is attempted again.

Performance

So how does it perform? Stefan Karlsson and Per Liden gave some preliminary numbers at their Jfokus talk earlier this year. ZGC’s SPECjbb 2015 throughput numbers are roughly comparable with the Parallel GC (which optimises for throughput) but with an average pause time of 1ms and a max of 4ms. This is in contrast to G1 and Parallel who had average pause times in excess of 200ms.

However, Garbage Collectors are complex beasts and benchmark results might not generalise to real-world performance. We’re looking forward to testing ZGC ourselves to get an idea of how it’s performance varies by workload.

Future possibilities

Coloured pointers and load barriers provide some interesting future possibilities.

Multi-tiered heaps and compression

With flash and non-volatile memory becoming much more common, one possibility is to allow for multi-tiered heaps in the JVM where objects that are technically live but rarely if ever used are stored on a slower tier of memory.

This could be possible by expanding the pointer metadata to include some usage counter bits and using this information to decide where to move an object that requires relocation. The load barrier could then retrieve the object from storage when if it was required in future.

Alternatively instead of relocating objects to slower tiers of storage, objects could be kept in main memory but in a compressed form. This could be decompressed and allocated in to the heap by the load barrier when requested.

The State of ZGC

At the time of writing ZGC is marked as experimental. You can take it for a spin using one of the Java 11 Early Access builds (http://jdk.java.net/11/) but it’s worth pointing out that it can take a while to iron out all the kinks in a new GC, it was at least three years between G1’s release and when it was finally promoted from experimental to supported.

Summary

With servers possessing hundreds of gigabytes to terabytes of ram becoming more widely available, Java’s ability to effectively use that heap is increasingly important. ZGC is an exciting new Garbage Collector that is designed to offer very low pause times on large heaps. It does this through the use of coloured pointers and load barriers, which are GC techniques new to Hotspot and which open up some other interesting future possibilities. It will be available as experimental in Java 11 but you can try it out now using an Early Access build.

Further resources

As mentioned earlier, Stefan Karlsson and Per Liden’s Jfokus talk is definitely worth a watch.

Dominik Inführ has also written a good article on some of the lower level details in ZGC, with links in to the source.

Thanks

We appreciate Per Liden’s help in reviewing drafts of this article.