Possible alternative compiler backend: Cretonne

Hi everyone,

So I’ve been having some interesting conversations with @sunfishcode, and I wanted to talk about an idea that he floated to me and try to get some feedback. The idea is to add an alternate backend to LLVM based on Cretonne. Cretonne is a compiler IR, similar to LLVM, being implemented (in Rust!) for use with WASM. It is intended for use in SpiderMonkey, Mozilla’s JS engine. The idea is that if you load WASM, it will be converted to Cretonne and optimized – similarly, when executing JS, Cretonne can be used as the backend for the JIT compiler (at least in aggressive stages). However, it’s important to emphasize that Cretonne is intended to be independent of SpiderMonkey, and in particular not to be limited to WASM/jit use-cases. I’m sure @stoklund an say more.

Short term, the idea would be for Cretonne, to serve as the IR when doing “debug” compilation (not “release”). It has a lot of advantages over LLVM in this respect:

  • It is defined for speed. Trust me when I tell you that the SpiderMonkey people care about latency.
  • We have more control and input over it, so it should be easier to ensure that our needs are met.
  • It will help us validate that Rust is independent from LLVM.

Longer term, we might be able to use Cretonne for release targets as well. It also has some other advantages that come into play at that point:

  • It does not have undefined behavior. At all.
  • We can have better control of its alias analysis, which may be useful.

There are of course disadvantages which suggest that we will want to continue using LLVM for a long time:

  • LLVM has a lot of existing optimizations
    • But as long as we use it when compiling in release mode, that’s not a problem
  • LLVM supports a lot of architectures
    • So we’ll still need to use it sometimes, even for debug builds

If people are excited about this idea, @sunfish and @stoklund are potentially interested in pushing it forward. It’d be great to think over what the necessary steps are. He and I drafted a few major blockers:

  • The easy part is extending rustc:
    • create the cretonne IR from MIR
  • The harder parts are extending Cetonne:
    • support writing out a native .o file
    • debuginfo / DWARF output
    • unwinding

Probably we would start by focusing on linux, as it is the most straightforward, then try to support windows next. Naturally the existing LLVM targets (both debug/release) would continue to be supported in the meantime (and probably forever; certainly for the foreseeable future).

Thoughts?

cc @brson, @kripken

63 Likes

This is a great idea. It’s possible to get massively faster codegen times this way, and Cretonne being written in Rust is obviously a huge bonus in this context.

I have proposed a variant on this, that instead of going Rust=>Cretonne goes Rust=>WebAssembly. That could have similar benefits in compile times, but otherwise has a bunch of pluses and minuses. I could elaborate if there’s interest (but wouldn’t want to divert the focus here otherwise.)

2 Likes

Faster compile times are always a plus!

Second, it sounds like it might allow a good compromise between compile time and optimisations. There are some projects where optimisations are pretty important just to run tests.

1 Like

I'd like a Rust REPL, to use it almost as in Python/Julia for exploratory programming. Can Cretonne lead to a good enough REPL?

I agree that it's important to not tie a system language too much to a single back-end. How about also using the GCC back-end?

I am a GCC contributor, if there's interest in a GCC backend I would be able to help. Although, I must say, I know little of Rust at this point so I wouldn't be able to carry this project forward by myself. Someone with a deep knowledge of Rust would need to mentor.

5 Likes

This seems really cool. Like @leonardo I think a Rust REPL would be a really awesome tool, and I’m interested in knowing if this could lead to that.

My only question is how this fits into the 2017 roadmap, since it seems like a big project.

1 Like

I'm a big fan of this project, and of adding backends generally, and of creating alternate compiler implementations even more generally. For those interested in the Binaryen route there's some work already in mir2wasm.

To the question at hand of cretonne, it seems to me there's an opportunity to make some powerful advances in the state of the art wrt high-performance code generation: the industry seems to be pretty content that LLVM is good enough, but we know it has some major limitations, primarily that it is slow and that it suffers from the undefined behavior spectre. If a competitive code generator appeared that didn't suffer these problems it could make a big splash. There's probably a lot of trepidation that the amount of work to catch up to LLVM would be massive, but we do have the advantage of writing cretonne in Rust, which is much easier to maintain and rely on than C++, and we have a viable incremental strategy of supporting debug mode x86 first and slowly competing with LLVM. Having a fast code generator is a critical step to solving slow compile times, which is Rust's Achilles heel.

I suspect this wouldn't be a major focus of the Rust team members this year, but hopefully there are others interested in making progress with our support (and cretonne development is ongoing regardless of what Rust is doing).

16 Likes

What does it mean to have an assembly language “without undefined behavior”? Surely you can still do use-after-free and the like?

It limits the assumptions the compiler can make when optimizing. For example, signed integer overflow is undefined in C, so the compiler can optimize a + 1 < a to false, even though that condition can be true in an implementation that wraps on overflow.

6 Likes

Indeed. If you use a store instruction to write '7' as your return address, something bad will happen when you try to return. Cretonne removes focus from the undefined-behavior approach to optimizations, but there are still invariants that shouldn't be broken. Cretonne differs from LLVM on these points:

  • Arithmetic instructions either trap or produce a value.
  • There is no concept of undef or poison values in the type system.
  • There is no unreachable instruction to indicate that a code path is impossible. Use a trap instead.
  • Dereferencing an invalid pointer will cause the resulting machine code program to do the same. It doesn't cause Cretonne to assume that is can't happen.

The LLVM optimizations that exploit undefined behavior don't have counterparts in Cretonne. It is not the case that Cretonne has a new innovative way of performing these optimizations safely.

12 Likes

Ok, thanks for elaborating!

I wrote a document comparing Cretonne to LLVM.

I am quite excited about this project. It would be great for Cretonne to have multiple clients because it forces us to define the right abstractions and not let too many client-specific hacks creep in. For Rust, I believe Cretonne can provide debug builds that are both faster and that generate faster code than LLVM’s debug builds.

Debug information

A JIT compiler like IonMonkey does speculative optimizations that sometimes turn out to be incorrect, and a running function needs to be de-optimized. When this happens, the values of local variables are derived from the state of the optimized code and copied into an interpreter stack frame where execution continues.

This means that at certain safe points in the optimized code, it must be possible to compute the value of all local variables that are in scope. This is very similar to the requirements when breaking a debug-compiled Rust program in a debugger. The debugger can display the values of all local variables.

In LLVM, this is achieved with a -O0 build which generates code that keeps all local variables in a designated stack slot at all times. Such code is very slow and very large—register allocation is an important optimization.

A debug build that uses safe-point-preserving optimizations to generate accurate DWARF should run much faster than everything-on-the-stack -O0 code while being just as debuggable.

Schedule

Cretonne is a work in progress, and it can’t do much of anything at the moment. The first goal is to be able to compile WebAssembly as a second-tier compiler in SpiderMonkey. I expect we will get to that point sometime in 2017.

REPL

Cretonne can be used as a JIT for a REPL as can LLVM. I expect Cretonne to provide a better tradeoff between latency and optimizations than LLVM. Of course I have no data to back that up at this time.

Higher-level optimizations

Cretonne does not aim to compete with LLVM in the -O3 optimization space, but it could be a component of a compiler that did. I agree there is an interesting opportunity here, in particular if we can exploit Rust’s explicit aliasing model.

Cretonne’s intermediate representation isolates functions so they can be compiled in parallel. This also means that it can’t really express inter-procedural optimizations like inlining which is essential to optimizing Rust. There’s nothing wrong with inlining one Cretonne function into another, but the process needs to be guided by a call graph representation that is external to Cretonne.

There is an interesting challenge in defining a compiler architecture that permits concurrent inter-procedural optimizations as well as incremental compilation. I hope Cretonne can be a component of such an architecture—it does not attempt to define it.

25 Likes

I would personally be super excited about a new backend for Rust, especially one written in Rust and super fast. I'm particularly excited about even faster debug builds through an alternate backend like Cretonne.

That... sounds amazing!

So one thing I just thought of is that we could perhaps run both Cretonne and LLVM for release builds. Most of our time in LLVM is spent just because we're sending such a huge wad of code to LLVM, but if Cretonne could very quickly rip through most of that to greatly reduce the size of the IR we send to LLVM, then Cretonne+LLVM could arguably be faster than just LLVM in release mode.

Now release mode builds aren't necessarily the critical piece to make faster, just an idea :slight_smile:

12 Likes

This is really interesting - I wonder how it compares to WebKit’s own B3.

I’ve had some thoughts on an IR that doesn’t share LLVM’s problems (as I understand them) for a while now and Cretonne matches my expectations almost entirely, when it comes to the type system - aggregates only serve to allow sub-optimal assumptions and are otherwise a pointless price to pay - so this makes me very happy indeed. (LLVM devs thought I was mad, ha!)

The one thing that having no aggregate types implies is no aggregate constant expressions (at least for LLVM). What I came up with as a replacement for LLVM constants (not that LLVM devs would even consider it) was described in my design for an interpretable abstract Rust machine, which miri (developed by @scott and oli_obk) already implements.

The gist of it, especially for a backend as opposed to an interpreter, is to have globals be byte buffers with relocations (just like you’d emit in the binary you give a linker) that point to other globals, as opposed to having nested aggregates with getelementptr constant expressions embedded in them.

I can’t find anything related to globals or constants in Cretonne’s source so maybe I’m off the mark.

I did get to see the integer indexing (as opposed to pointers with complex ownership rules), which is another thing that makes me quite happy. If bounds checking ever gets too costly you could consider using the indexing crate - it would work remarkably well for any kind of analysis that doesn’t need to modify anything, since you only need to bound-check everything once and then provide the sound compiler-checked bound-less indexing.

EDIT: Ah, one thing I forgot: the design using various side-tables (EntityMap) is nicely extendable and one place this really matters is debuginfo: wasting exactly 0 bytes per instruction when it’s not in use is really important.
On top of that, one complaint I hear about removing complex types from LLVM is that you lose information needed to debug LLVM IR - the answer there is debuginfo!
If you already have code to emit debuginfo in your front-end, why not make use of it in the back-end for presenting a richer IR?
LLVM’s is terrible, all you see is !dbg !123 and have to look up !123, and repeat this recursively because it’s a tree of metadata nodes.

The opportunity here is to annotate the textual IL form with all sorts of details that can be lazily reconstructed from debuginfo. For example, (*ptr).z.0.foo[5] might be GEP ptr, 0, 2, 0, 1, 5 in LLVM and offset 41 in Cretonne - but you can store the compact offset and use debuginfo to reproduce (*ptr).z.0.foo[5] in an IL comment - nicer than LLVM’s and more efficient underneath!

IOW, Cretonne can have its cake and eat it too - did I mention how happy I am this is happening? :smile:

12 Likes

Cough goblin

Would give me a kick in the butt to finish the rest of the binary formats. Adding a writer API wouldn't be too bad either, I think, depending on requirements.

So yea, I'd be interested in helping :slight_smile:

8 Likes

Five years from now how much work will be needed to add inlininig to Cretonne?

To clarify, you're talking about lowering MIR to Cretonne IR, letting Cretonne optimize that, and then translating the resulting Cretonne IR to LLVM IR? I have doubts about the economic viability of developing such a translator, since there's already a path forward for reducing IR size: MIR-level optimizations. Unlike Cretonne, MIR optimizations can work pre-monomorphization, so they have the potential for much more "bang for the buck". Moreover, if Cretonne's optimizations really are as simple as is expected from a mid-tier JIT compiler, Rust's own MIR optimizations may cover much of the same ground, leaving Cretonne little room for improvement.

8 Likes

Globals in WebAssembly are different, and I haven't gotten to them yet. Like stack objects, you can't take their address. I think they'll end up looking a lot like stack objects in Cretonne:

  • Globals accessed by the function are declared in the preamble, including a name à la FunctionName. A byte size is optional, and there is a constant flag and probably an alignment.
  • Instructions global_load and global_store work like the stack slot equivalents, and their offsets are verified for sized globals. For unsized globals, these instructions are not sandbox-safe.
  • A global_addr instruction is used when lowering from WASM, like stack_addr.

Regarding data initializers, I think your idea makes sense. Here's an LLVM example using somewhat contrived C++ code, but a constant array of enum variants is not unreasonable in Rust:

extern int earray[];

union U {
    int *p;
    struct S { float c; int *p; } b;
};

U myarray[2] = {
    { .p = &earray[5] },
    { .b = { 3.14, &earray[17] } },
};

To initialize this array, Clang actually has to declare it as a struct type (my formatting):

%union.U = type { %struct.U::S }
%struct.U::S = type { float, i32* }

@earray = external global [0 x i32], align 4
@myarray = global <{ { i32*, [8 x i8] }, %union.U }> <{
    { i32*, [8 x i8] } {
        i32* bitcast (i8* getelementptr (i8, i8* bitcast ([0 x i32]* @earray to i8*), i64 20) to i32*),
        [8 x i8] undef
    },
    %union.U {
        %struct.U::S {
            float 0x40091EB860000000,
            i32* bitcast (i8* getelementptr (i8, i8* bitcast ([0 x i32]* @earray to i8*), i64 68) to i32*)
        }
    }
}>, align 16 

The resulting assembly is much nicer:

	.section	__DATA,__data
	.globl	_myarray                ## @myarray
	.align	4
_myarray:
	.quad	_earray+20
	.space	8
	.long	1078523331              ## float 3.1400001
	.space	4
	.quad	_earray+68

The LLVM initializer expression does have the advantage that it can specify undef bits, which LLVM's optimizers will make use of. However, Cretonne won't, so a bag of bytes with relocations makes sense to me.

Sometimes optimizers want to constant fold loads from constant data, and that's where LLVM's representation makes sense. If the declared type matches the loaded type, the load can simply be replaced with the constant expression from the initializer. It is important that this kind of constant folding works when relocations are present. Constant folding vtable loads can enable more inlining.

1 Like

Except LLVM can’t always fold loads, not easily though. Bytes + relocations ensure O(lg R) load folding if it is safe to do so, where R is the number of relocations (because you have to do a binary search or something similar). For undef there’s a mask, I was thinking bit-level for LLVM, while miri uses byte-level because it denies using those values anyway.

1 Like

By then, multi-core CPUs will have been mainstream for 20 years, so let's assume a single-threaded compiler is off the table.

Inlining is a global optimization that interacts with (and modifies) the whole call graph. This means that its design depends a lot on the architecture you choose for a compiler that supports concurrent and incremental compilation efficiently. We probably want something like:

  • Multiple threads working on the call graph simultaneously. (And safely, and deterministically).
  • Don't require everything to be in memory at the same time. Some programs are large.
  • Parts of the call graph can be reused from the previous compilation because nothing they depend on has changed.

For compiling C/C++ with LLVM, ThinLTO is a really good bid. I think Rust can do better because it doesn't have to deal with C++'s legacy, but it's a high bar.

So I think inlining is something that evolves along with the compiler / optimizer architecture. We can start by build building a single-threaded, non-incremental inliner pretty easily.