WhereRustcSpendsItsTime

So a couple weeks ago I was a little stung by the quote from This Week In Rust: “Rust compilation is so slow that I can fix the bugs while it still compiles the crates”. On the one hand, I have unfond memories of waiting for a Typescript project to compile, pack (aka link), minify (aka optimize), and so on, over and over, on every change. At least if it had been Rust I’d have been able to fix the bugs while it was doing this. On the other hand, it’s also mostly true: compiling Rust is heckin’ slow. So I’ve decided to dust off a backburner project for a while, and figure out just where rustc spends most of its time.

There’s one key thing that made me curious about all this in the first place… Once upon a time I read an ancient, lost reddit post where an unknown compiler dev mentioned that rustc was fairly stupid in how it generated its output, and created a whole lot of unnecessary IR code that it then just shoved into LLVM to clean up. This caught my attention enough to stick in my mind for years, at least in concept. I have no idea if this is still true (I vaguely remember that it was talked about shortly after 1.0), to what extent it was true (I recall the figure of 100’s of megabytes of IR generated for even fairly simple things), or if this was ever true (I might have imagined the whole thing). But the time has come for me to find out.

Assumptions

First off, what do I actually know about rustc? Not a whole heck of a lot, but I know a few things about compilers in general. Compilers are one of the most fun and most fundamental types of programs because they’re essentially a gigantic pure function: inputs go in (source code) and outputs come out (machine code). In between there’s not a lot of user interaction, network stuff, I/O, or anything like that, just computation. Compilers are traditionally structured as a linear pipeline of transformations that occur in distinct steps. The traditional compiler pipeline is: lex text to tokens -> parse tokens to a syntax tree -> transform syntax tree to intermediate representation (IR) -> transform IR to more/different IR, optimizing it along the way -> transform final IR into machine code. This is not always 100% accurate, since these phases can overlap more than a little, but it is always a good place to start. I know that currently in rustc, there’s several stages of IR: HIR (high-level intermediate representation), MIR (mid-level intermediate representation), and then LLVM IR. I would expect there to be a LIR someday, but I don’t know if it exists yet. I don’t yet know exactly what gets represented where, or how, but I think MIR contains enough information to do borrow-checking, and I’ve at least dinked around a little with LLVM so I have some clue of what it looks like.

I do know that there are some TOP MEN working on optimizing rustc from the micro-optimization and benchmark level, so I don’t need to look too deeply into the specifics of data structures and stuff. I can just assume that if someone hasn’t looked at the details of some part of the compilation process, its on a to-do list somewhere. That suits me fine, since I’m more interested in the holistic view (“steps X and Y take up most of the time because of A and B”) than the specific view (“a lot of time is spent accessing this particular hashtable in this one place”). I’m also not actually going to try to do any optimizatio myself, just try to figure out where time gets spent.

And, of course, I think I know that most of rustc’s time is spent in code optimization.

Watching the compiler do stuff

I’m going to be using two key test programs: oorandom which is not quite trivial but is very small and simple, and nalgebra which is big and complicated and uses an enormous amount of interesting and hairy metaprogramming stuff to lean hard on Rust’s optimization. We are using oorandom 11.1.0 and nalgebra 0.18.1, and rustc 1.40.0-nightly 20cc75272 2019-10-09.

Okay, so first off, building oorandom in debug mode takes about 0.25 seconds, and building nalgebra takes about 25 seconds. Hopefully that makes a sufficiently interesting spread of samples! Considering nalgebra is about 100x larger in terms of significant lines of code, it’s semi-sane that it takes 100x more time. Compiling them in release mode takes 0.2 seconds and 22 seconds, respectively. Now that’s interesting since I’d expect a debug compile to be faster than release, and usually it is. However, it’s not interesting enough to dig into it further to see if it’s more than experimental error; we’re just going to compile in release mode all the time, since the cost of optimization is something I want to look at. Second off, nalgebra has some dependencies it needs to build as well, but it looks like rustc’s built-in profiling stuff only measures the crate you’re actually building, not its dependencies, which is what we want. Excellent.

Third off, we can get a high-level overview of where rustc spends its time with the following invocation: cargo +nightly rustc -- -Z time-passes. This is basically the information I want to start with, so huzzah! For oorandom, it produces this giant whack of output:

  time: 0.001; rss: 64MB        parsing
  time: 0.000; rss: 64MB        attributes injection
  time: 0.000; rss: 64MB        recursion limit
  time: 0.000; rss: 64MB        plugin loading
  time: 0.000; rss: 64MB        plugin registration
  time: 0.000; rss: 64MB        pre-AST-expansion lint checks
  time: 0.000; rss: 64MB        crate injection
    time: 0.007; rss: 74MB      expand crate
    time: 0.000; rss: 74MB      check unused macros
  time: 0.007; rss: 74MB        expansion
  time: 0.000; rss: 74MB        maybe building test harness
  time: 0.000; rss: 74MB        AST validation
  time: 0.000; rss: 74MB        maybe creating a macro crate
  time: 0.001; rss: 77MB        name resolution
  time: 0.000; rss: 77MB        complete gated feature checking
  time: 0.000; rss: 77MB        lowering AST -> HIR
  time: 0.000; rss: 77MB        early lint checks
    time: 0.000; rss: 79MB      validate HIR map
  time: 0.001; rss: 79MB        indexing HIR
  time: 0.000; rss: 79MB        load query result cache
  time: 0.000; rss: 79MB        dep graph tcx init
    time: 0.000; rss: 79MB      looking for entry point
    time: 0.000; rss: 79MB      looking for plugin registrar
    time: 0.000; rss: 79MB      looking for derive registrar
  time: 0.000; rss: 79MB        misc checking 1
  time: 0.001; rss: 82MB        type collecting
  time: 0.000; rss: 82MB        impl wf inference
    time: 0.000; rss: 92MB      unsafety checking
    time: 0.000; rss: 92MB      orphan checking
  time: 0.007; rss: 92MB        coherence checking
  time: 0.001; rss: 92MB        wf checking
  time: 0.003; rss: 94MB        item-types checking
  time: 0.012; rss: 96MB        item-bodies checking
    time: 0.000; rss: 96MB      match checking
    time: 0.000; rss: 96MB      liveness checking + intrinsic checking
  time: 0.001; rss: 96MB        misc checking 2
  time: 0.007; rss: 102MB       MIR borrow checking
  time: 0.000; rss: 102MB       dumping Chalk-like clauses
  time: 0.000; rss: 102MB       MIR effect checking
  time: 0.000; rss: 102MB       layout testing
    time: 0.000; rss: 102MB     privacy access levels
    time: 0.000; rss: 102MB     private in public
    time: 0.000; rss: 102MB     death checking
    time: 0.000; rss: 102MB     unused lib feature checking
      time: 0.005; rss: 109MB   crate lints
      time: 0.000; rss: 109MB   module lints
    time: 0.005; rss: 109MB     lint checking
    time: 0.000; rss: 109MB     privacy checking modules
  time: 0.006; rss: 109MB       misc checking 3
  time: 0.001; rss: 113MB       metadata encoding and writing
      time: 0.000; rss: 113MB   collecting roots
      time: 0.003; rss: 113MB   collecting mono items
    time: 0.003; rss: 113MB     monomorphization collection
    time: 0.000; rss: 113MB     codegen unit partitioning
    time: 0.003; rss: 128MB     codegen to LLVM IR
    time: 0.000; rss: 130MB     assert dep graph
    time: 0.000; rss: 130MB     serialize dep graph
  time: 0.008; rss: 130MB       codegen
    time: 0.001; rss: 136MB     llvm function passes [oorandom.d5irq2oa-cgu.0]
    time: 0.003; rss: 136MB     llvm function passes [oorandom.d5irq2oa-cgu.1]
    time: 0.004; rss: 136MB     llvm module passes [oorandom.d5irq2oa-cgu.0]
    time: 0.016; rss: 135MB     llvm module passes [oorandom.d5irq2oa-cgu.1]
    time: 0.003; rss: 135MB     LTO passes
    time: 0.005; rss: 135MB     codegen passes [oorandom.d5irq2oa-cgu.0]
    time: 0.016; rss: 139MB     LTO passes
    time: 0.012; rss: 139MB     codegen passes [oorandom.d5irq2oa-cgu.1]
  time: 0.055; rss: 140MB       LLVM passes
  time: 0.000; rss: 140MB       serialize work products
  time: 0.001; rss: 140MB       linking
time: 0.134; rss: 119MB         total
    Finished release [optimized] target(s) in 0.25s

Whew, that’s a lot of stuff! I’m counting 69 steps total, about half of which are listed at the top level. It also isn’t fully precise, since Cargo takes 0.25 seconds and the compiler’s total is 0.134 seconds, but if you add all the values there it adds up to 0.199 seconds and if you add all the toplevel values it’s 0.113 seconds. So, there’s some measurement error to be aware of; these are quick and dirty values. Regardless, for oorandom there’s basically four things that take time: “LLVM passes”, “item-bodies checking”, “codegen” and “expansion”.

Let’s look at nalgebra:

  time: 0.092; rss: 88MB        parsing
  time: 0.000; rss: 88MB        attributes injection
  time: 0.000; rss: 88MB        recursion limit
  time: 0.000; rss: 88MB        plugin loading
  time: 0.000; rss: 88MB        plugin registration
  time: 0.007; rss: 89MB        pre-AST-expansion lint checks
  time: 0.000; rss: 89MB        crate injection
    time: 0.245; rss: 159MB     expand crate
    time: 0.000; rss: 159MB     check unused macros
  time: 0.245; rss: 159MB       expansion
  time: 0.000; rss: 159MB       maybe building test harness
  time: 0.010; rss: 159MB       AST validation
  time: 0.000; rss: 159MB       maybe creating a macro crate
  time: 0.186; rss: 184MB       name resolution
  time: 0.022; rss: 184MB       complete gated feature checking
  time: 0.125; rss: 225MB       lowering AST -> HIR
  time: 0.023; rss: 225MB       early lint checks
    time: 0.024; rss: 234MB     validate HIR map
  time: 0.127; rss: 234MB       indexing HIR
  time: 0.000; rss: 234MB       load query result cache
  time: 0.000; rss: 234MB       dep graph tcx init
    time: 0.000; rss: 234MB     looking for entry point
    time: 0.000; rss: 234MB     looking for plugin registrar
    time: 0.000; rss: 234MB     looking for derive registrar
  time: 0.056; rss: 248MB       misc checking 1
  time: 0.104; rss: 255MB       type collecting
  time: 0.009; rss: 257MB       impl wf inference
    time: 0.001; rss: 273MB     unsafety checking
    time: 0.002; rss: 273MB     orphan checking
  time: 1.639; rss: 273MB       coherence checking
  time: 3.593; rss: 295MB       wf checking
  time: 0.357; rss: 293MB       item-types checking
  time: 7.038; rss: 309MB       item-bodies checking
    time: 0.270; rss: 317MB     match checking
    time: 0.042; rss: 317MB     liveness checking + intrinsic checking
  time: 0.312; rss: 317MB       misc checking 2
  time: 6.173; rss: 382MB       MIR borrow checking
  time: 0.000; rss: 382MB       dumping Chalk-like clauses
  time: 0.003; rss: 382MB       MIR effect checking
  time: 0.000; rss: 382MB       layout testing
    time: 0.067; rss: 384MB     privacy access levels
    time: 0.033; rss: 384MB     private in public
    time: 0.035; rss: 386MB     death checking
    time: 0.007; rss: 386MB     unused lib feature checking
      time: 0.038; rss: 386MB   crate lints
      time: 0.049; rss: 389MB   module lints
    time: 0.088; rss: 389MB     lint checking
    time: 0.108; rss: 389MB     privacy checking modules
  time: 0.338; rss: 389MB       misc checking 3
  time: 1.080; rss: 417MB       metadata encoding and writing
      time: 0.003; rss: 417MB   collecting roots
      time: 0.006; rss: 417MB   collecting mono items
    time: 0.009; rss: 417MB     monomorphization collection
    time: 0.000; rss: 417MB     codegen unit partitioning
    time: 0.013; rss: 448MB     codegen to LLVM IR
    time: 0.000; rss: 448MB     assert dep graph
    time: 0.000; rss: 448MB     serialize dep graph
  time: 0.026; rss: 448MB       codegen
    time: 0.001; rss: 448MB     llvm function passes [nalgebra.6tz72kds-cgu.2]
    time: 0.003; rss: 455MB     llvm function passes [nalgebra.6tz72kds-cgu.1]
    time: 0.003; rss: 455MB     llvm module passes [nalgebra.6tz72kds-cgu.2]
    time: 0.001; rss: 455MB     llvm function passes [nalgebra.6tz72kds-cgu.0]
    time: 0.001; rss: 454MB     llvm module passes [nalgebra.6tz72kds-cgu.0]
    time: 0.029; rss: 452MB     llvm module passes [nalgebra.6tz72kds-cgu.1]
    time: 0.002; rss: 451MB     LTO passes
    time: 0.002; rss: 451MB     LTO passes
    time: 0.003; rss: 451MB     codegen passes [nalgebra.6tz72kds-cgu.0]
    time: 0.005; rss: 453MB     codegen passes [nalgebra.6tz72kds-cgu.2]
    time: 0.042; rss: 455MB     LTO passes
    time: 0.056; rss: 456MB     codegen passes [nalgebra.6tz72kds-cgu.1]
  time: 0.149; rss: 456MB       LLVM passes
  time: 0.000; rss: 456MB       serialize work products
  time: 0.010; rss: 459MB       linking
time: 21.782; rss: 423MB                total

…now that’s interesting. Even building in release mode, codegen and LLVM don’t seem to take much time, 0.2 seconds of the total. Instead 19 of the 22 seconds are spent here:

  time: 1.639; rss: 273MB       coherence checking
  time: 3.593; rss: 295MB       wf checking
  time: 0.357; rss: 293MB       item-types checking
  time: 7.038; rss: 309MB       item-bodies checking
  time: 0.312; rss: 317MB       misc checking 2
  time: 6.173; rss: 382MB       MIR borrow checking

This is NOT what rustc spending most of its time in codegen looks like! And ALSO not what oorandom’s results look like! That was so contrary to expectations that I tried again with an older version of the nightly compiler I had lying around, to see if something weird had regressed recently. It gave the same results. So then I tried building a different project entirely that I knew took a while to build, ggez 0.5.1. ggez doesn’t really do much fancy metaprogramming stuff, but does use a lot of other libraries, so apparently finalizing and linking each of those got counted individually so time-passes spit out 1000+ lines of various “llvm passes” and “codegen passes”. The things that took a long time in nalgebra, “item-bodies checking” and “MIR borrow checking”, still were significant minorities in ggez that together took nearly a second, but the vast majority of time was spent in codegen and LLVM, at 6+ seconds each.

Desperate to find some sort of consistent pattern, I tried to build a simple hello-world program using the rendy graphics library (release 0.4), which like ggez has a large number of fairly straightforward dependencies but like nalgebra also does a big honkin’ pile of generic specialization as well. That took 15 seconds to build, 9 of which was spent in “codegen”. Neither this nor ggez spent anywhere near as much time in typechecking and borrow-checking as nalgebra did.

And when I built an actual binary with rendy it also spent several seconds in the “linking” stage, which hasn’t been a significant contributor at all when just building libraries, even libraries like ggez that one would expect to be linked with a bunch of other libraries.

…So with five different projects we get five different sorts of results. The plot thickens.

Take a deep breath…

Okay, rustc’s -Z codegen-passes is a decent place to start but the more you look at it the more obvious it becomes that it’s just an approximation that doesn’t stand up to close scrutiny too well. If we’ve learned anything by now it’s that this is complicated. So:

  • Validation and borrow-checking is usually a significant minority of the time consumed, but only a majority in weird codebases like nalgebra.
  • Codegen and LLVM takes a lot of time sometimes but not others
  • Linking binaries takes a lot of extra time but linking libraries not so much

What is also important is what we don’t really see much of:

  • Parsing is negligible. All the books on compiler writing that say parsing is a performance bottleneck are out of date. While Rust is relatively easy to parse and I’m sure a C++ compiler spends a lot more work on it, parsing is not a performance problem for a well-designed language with modern hardware and algorithms.
  • Macro expansion is always in the list enough to tip the scales, but is always overwhelmed other things. If we were looking at structopt or serde or something else very macro-heavy that might change though. (Heck, why don’t we check? Nope, serde’s numbers look a lot like nalgebra’s, and something using structopt doesn’t spend significantly more time in macro expansion than anything else. Might be that you need to actually use the macros quite a lot for the cost of the macros to add up, and just defining even complex ones doesn’t do much.)
  • There doesn’t seem to be much time spent in doing things like building IR’s and transforming them compared to everything else. As far as I can tell at this point, going from HIR to MIR and from MIR to LLVM IR takes trivial amounts of time. Either doing this is cheap/well-optimized, or rustc just doesn’t do a lot of it.

We need information. Information! Before we go too much deeper though, let’s look at memory. The original assertion that got me interested in this was about how much data was produced in each step of compilation, after all. And since rustc’s time-passes conveniently spits out RSS numbers along with its timings (resident set size, basically how much memory the OS thinks the program is using), we might as well use them. RSS is not a great way to measure memory usage, but we can start there. So, graphs!

RSS line graph
RSS line graph

First off: rustc is a little bit of a hog, taking 50+ megabytes just doing nothing. Most of this is probably just the actual code of the executable; most of the code seems to be in a DLL, librustc_driver.so, which is 60 MB. Though there’s also an LLVM DLL along with it that’s 80 megs, so. There’s a lot of code here.

Second, there’s a continual upward trend of memory usage through the program’s execution, which isn’t necessarily surprising, but there seems to be a couple more notable jumps. Let’s look at a normalized version of the graph, where the memory usage is scaled by the max memory used during that run:

Normalized RSS line graph
Normalized RSS line graph

Aha, yes. Four distinct jumps, each more or less present in all programs, but different magnitudes in each. I’ve outlined ’em in the graph. The graph isn’t quite exact ’cause making it involved some copy-pasting by hand, and as mentioned, -Z codegen-passes isn’t the most reliable tool, but it gives us something to look for. The compiler stages associated to each jump seem to be (roughly) “expansion”, “coherence checking”, “MIR borrow checking”, and of course “codegen”+“LLVM passes”. Not much more to say without going deeper, just that some of the time-intensive passes are also quite memory-intensive.

…before the plunge.

I love the Rust community’s documentation habits. Keep leading from the front, compiler team.

It appears that -Z codegen-passes is basically now made obsolute by a more precise (and hopefully more accurate) self-benchmarking method, a -Z self-profile flag with a set of tools appropriately called measureme. This outputs some binary profiling data that then gets crunched and displayed by one of measureme’s tools, summarize. Or even better, it can output ♥flame graphs♥.

(Right click on an image and select “view image” and you will be able to zoom in on various bits in the SVG.)

So we have an interesting spread. In oorandom, we get about 35% static analysis, somewhere in the range of 20-25% optimization and codegen, and the rest is dross – including 30% or so of time not accounted for, but since that’s only 0.05 seconds or so I’m not too worried. nalgebra spends over 90% of its measured time on static analysis, most of which is “mir_borrowck” and “type-check crate”, which again fits more or less what we saw from our previous attempt. And ggez, which I’m including anyway now since since it’s apparently a different enough workload to be interesting, spends 23% of its time in static analysis, a whopping 30ish% of the time unaccounted for (that’s about 15 seconds!), and the rest is basically all LLVM optimization, LTO, and code generation. What’s interesting here that didn’t quite show up as well before is there’s nearly 10% under “exported symbols”, which looks like it involves a lot of monomorphization stuff.

Dang, better tools make this way, way easier. But when building my rendy hello-world program there’s an even bigger gap of unaccounted-for stuff, almost 50%! So, there’s plainly plenty of room for making the tools better. (It also makes clear that monomorphization also happens as part of “codegen_crate”, and that particular crate spends about 10% of its time on it; as I’d halfway expected, instantiating giant piles of generics takes a lot of work.)

What -Z self-profile doesn’t give us though, is memory usage. Valgrind’s Massif tool seems perfect for what I want, but for some reason it doesn’t seem to give any output at all when running with rustc or cargo programs. There are some docs that note that you may need to compile rustc with jemalloc = false “for it to work well”, so maybe we should try that. Massif does work with other Rust programs, so let’s try building a rustc without jemalloc…? It looks like that’s the default on Linux, but we should make sure anyway…

But now that I think of it, that won’t answer the question I started this whole thing to answer, which is… how gnarly is the LLVM IR that rustc produces? Or, more formally, how much LLVM IR gets fed into the start of LLVM’s optimizer, versus how much of it comes out the other end and gets turned into machine language? This actually just got mentioned recently in an update by one of the rustc optimization TOP MEN: “There’s been a long-running theory that the LLVM IR produced by the Rust compiler’s front end is low quality, that LLVM takes a long time to optimize it, and more front-end optimization could speed up LLVM’s code generation.” So, even the pro’s don’t know the answer for sure! We HAVE to at least try to find out.

To do this, we have to add a bit more instrumentation to rustc.

Digging ourselves deeper

So we have to build our own rustc. This is something I’ve done once or twice, but I am far from comfortable with the workflow. However, as usual, the documentation is great. I love the official Rust teams so much. I’m using the default config.toml config on the assumption that this is a sane starting place. Looks like it is, so after a git clone it’s just a matter of ./x.py build.

Building from scratch took 29 minutes on a Ryzen 1700. Whew! That includes downloading some dependencies and doing a full bootstrap though, which compiles the compiler at least twice, basically building the new version of the compiler with the current one, and then rebuilding the new compiler with itself to make sure it still actually works. If we’re just hacking around for benchmarking we can omit that second build. Again, the way to do this is in the docs: ./x.py build -i --stage 1 src/libstd, which gives us our changed compiler and updated std in only 6 minutes. And doing ./x.py check to only do typechecking takes less than 2 minutes from scratch, and a only a few seconds to check a small change. Great.

I also tried building rustc on my Ryzen 2400G file server/HTPC, mostly out of morbid curiosity, but that machine only has 6 GB of memory available and so thrashed itself to death halfway through. Alas! It is worth noting that rustc plus the standard libraries and tools (rustdoc, rustfmt, etc) is about 1.3 million lines of Rust code. And LLVM adds another 8 million lines or so of C/C++, even if most of it is stuff not used directly in rustc such as tests, benchmarks… uh, and all of the clang compiler, apparently. I guess it’s just easier for rustc’s build system to download the whole LLVM package than to try to only pull the parts it actually needs.

…Hmm, for comparison, gcc 9.2.0 is somewhere in the 5-8 million line range depending on how you measure, Go 1.13.3 is 2 million lines of Go, and the Roslyn 3.2.0 C#+VB compiler is about 5-6 million lines. So yeah, rustc is a bit of a chonker compared to most Rust programs, but certainly isn’t unreasonable by the standards of professional-grade compilers in 2019. It might still be quite valuable indeed to make a small, portable Rust compiler someday, along the lines of pcc or lcc, even if it only implements a subset of the language. mrustc is an alternative Rust compiler written in C++, but unfortunately it doesn’t seem super active these days; it’s about 110,000 lines of code. Naturally, lines of code is a terrible measurement so we’re not really comparing apples to apples here, but it’s still interesting to see that all these compilers fall into the same order of magnitude.

It’s actually pretty interesting to just read through the rustc book because rustc is in the process of becoming more and more of an incremental compiler, which drastically changes the structure. Instead of the “pure function” of source code -> object code I described at the start of this, it’s turning into a “query based” architecture, which as the name suggests, is a lot more database-y. I’m still grappling with the idea, but it sounds like instead of a step-by-step computation pipeline it turns the compilation process into a series of queries, where a query can be something like “get me the type for this variable” or “get me the compiled object code for this expression” or something, and each query can either do work to compute this or just return memoized results that have already been done elsewhere. This is naturally for the purpose of not needing to recompile everything when a program changes, just the bits that changed. It does seem like it would let the compiler do things in a more out-of-order fashion, and queries that depend on each other’s results will automatically sync waiting on results to be produced and cached, so you don’t even have to express the dependencies explicitly. Not a lot of compiler textbooks out there that build compilers this way! But it seems a logical way to do things because an incremental compiler doesn’t just do a pile of work and then is finished, batch-style; it is more of an interactive program where different parts of it may be fetching or altering data out of sync with each other.

I am reminded of Lisp systems, and their Smalltalk cousins, which are very much on-the-fly, query-based compilers that keep code and metadata in an in-memory database. And Lisp systems traditionally excel at interactive usage, live modification, and fast turnaround. Coincidence? I think not.

Anyway, back to business! This query kind of system is foreign to me, but the docs at least give me an idea of what’s going on. Alas, there’s limits to the majesty of all mortals, including the rustc team, and so the doc chapter for the codegen stuff is pretty empty. From here on it’s up to us, and librustc_codegen_llvm seems like the place to start, so let’s put on our hip waders and make an actual change to some actual source code and see what happens…

impl CodegenBackend for LlvmCodegenBackend {
    fn init(&self, sess: &Session) {
        ...
        dbg!("FLARGLE!");
    }
    ...

That looks like a good first change to make, and fortunately it compiles. Now how do we actually build stuff with our altered compiler? Ah, you can tell rustup to define a new toolchain. From your rustc source dir: rustup toolchain link stage1 build/x86_64-unknown-linux-gnu/stage1/. Then we go to nalgebra’s source dir and do cargo +stage1 build, and while building nalgebra it prints out some brand new and super helpful debug statements. Perfect.

…man, starting to dig through the source code of a really large open source program is so weird. It’s like wandering around a giant cathedral that’s being constantly renovated and repaired and maintained over the course of years by a giant team of invisible crafters and architects, who mostly communicate via notes and designs pinned to the walls in various places. (Sounds quite Mathic, really.) Did you know that rustc has most of the groundwork laid for supporting other backends besides LLVM? I didn’t, but it’s described very nicely by someone named Denis Merigoux in a random README file. It looks like src/librustc_codegen_ssa/ has an API for these backends to implement and a bunch of backend-independent utility code, and src/librustc_codegen_llvm/ actually implements this API for the LLVM backend. There’s tons of knobs and stuff to fiddle, but the bits I want to tinker with seem pretty simple. There’s a type CodegenContext<T> that does the actual work, and then you define your own T type for it (in this case, LlvmCodegenBackend) and implement a few traits on it, and you’re good to go. One of these traits provides useful-sounding methods like codegen() and optimize(), and those sound like exactly the methods we’re interested in. The only complexity then comes from the fact that LLVM mutates everything in place and it’s horrible.

Okay, there’s limits to how far I want to go into this right now, so I’m going to just print out instruction counts for a LLVM Module (compilation unit, hopefully a crate) before and after LLVM is done optimizing it. What I REALLY want is to see the initial LLVM IR generated from MIR to see how pessimal it looks, but that sounds like the start of a whole new project. Let’s start simple. There’s a Module::getInstructionCount() method in LLVM, but there’s no FFI binding to call it from Rust, so heck, let’s just write one. Some fairly simple additions to rustllvm/RustWrapper.cpp and librustc_codegen_llvm/llvm/ffi.rs and it builds, runs without crashing, and produces output that seems at least vaguely sensible. Annoyingly, due to the aforementioned query-based architecture it’s hard to figure out what the exact sequencing and dependencies between passes are. However, some poking around has me at least pretty sure that optimize() and codegen() are what we’re interested in and we can print some instruction counts during them without anything else blowing up. The only other likely-looking thing is some LTO-type stuff, and I’m going to ignore it since LTO isn’t turned on by default and I’m not super interested in it right now. So, optimize() and codegen() look like promising places to start.

 > cargo clean; cargo +stage1 rustc --release -- -C codegen-units=1
   Compiling oorandom v11.1.0 (/home/icefox/src/oorandom)
Module oorandom.4cri1ig3-cgu.0, 654 instr's, before opt
Module oorandom.4cri1ig3-cgu.0, 272 instr's, after opt
Module oorandom.4cri1ig3-cgu.0, 272 instr's, before codegen
Module oorandom.4cri1ig3-cgu.0, 277 instr's, after codegen
    Finished release [optimized] target(s) in 0.23s

Holy cow, it works. And even outputs something vaguely plausible! -C codegen-units=1 is there ’cause otherwise rustc multithreads the build and breaks it into two parts, even for something as small as oorandom, and each part has their own instruction counts. The sums come out the same regardless of the number of codegen-units though, so we’re probably doing something right, but let’s just do codegen-units=1 to keep life simple. Let’s look at nalgebra:

Module nalgebra.buzfxn0q-cgu.0, 1336 instr's, before opt
Module nalgebra.buzfxn0q-cgu.0, 925 instr's, after opt
Module nalgebra.buzfxn0q-cgu.0, 925 instr's, before codegen
Module nalgebra.buzfxn0q-cgu.0, 923 instr's, after codegen

That’s not many actual instructions. Carrying on from the theme of our earlier results, it looks like building nalgebra alone doesn’t actually produce much code, it produces a bunch of generics and type definitions and such that then need to be specialized by being used somewhere before the compiler will actually do more with them. So the 35 kLOC nalgebra produces less than 1000 actual LLVM instructions – if I’ve done these counts correctly. Meanwhile, for the 14 kLOC ggez, the results look like this:

Module ggez.b51tqrhr-cgu.0, 597,300 instr's, before opt
Module ggez.b51tqrhr-cgu.0, 198,942 instr's, after opt
Module ggez.b51tqrhr-cgu.0, 198,942 instr's, before codegen
Module ggez.b51tqrhr-cgu.0, 226,453 instr's, after codegen

Just a bit beefier! And we see that on the more codegen-heavy examples that LLVM’s optimization steps do cut the number of LLVM instructions by 1/2 to 1/3, and then a bit more stuff gets added during codegen, ~10% or so. Recall from our -Z self-profile graphs that codegen is part of where monomorphization happens, so it is logical that it could add more instructions. So for my rendy thing that uses giant piles of complex generics, I’d expect that step to add a lot of instructions:

Module ggraphics.9o1oe99j-cgu.0, 1,329,549 instr's, before opt
Module ggraphics.9o1oe99j-cgu.0,   267,528 instr's, after opt
Module ggraphics.9o1oe99j-cgu.0,   267,528 instr's, before codegen
Module ggraphics.9o1oe99j-cgu.0,   320,861 instr's, after codegen

Hmm, after-codegen is only 20% more instructions than before-codegen, while for ggez it was 13% more, so not as big a difference as I expected. Still goes in the direction we expected though, so there’s at least a BASIC sanity check! But dang, the post-optimization code size is 20% the pre-optimization code size! Number of instructions alone does not provide much information, since Eris only knows what what kind of code LLVM would find “easy” to optimize vs “hard”, let alone where rustc’s output falls on that spectrum. But making it cut the size of your program by a factor of 5 still seems like a good way to make your optimizer do a lot of work!

To continue this work from the angle of “what does the program look like before and after LLVM gets its paws on it”, my next thought would be to do statistics to instruction mixes or sizes of basic blocks or other stuff to try to figure out what’s going on. After all, if you have an intractably huge pile of complex data, you can just throw statistics at it for a while and hopefully get some clue of what it looks like. And I can’t think of many things that better fit the description of “intractably huge pile of complex data” than 1,329,549 LLVM instructions. However, that’s really the brute force kind of solution, and I would probably learn a lot more by actually taking the time to understand what’s going on instead of just describing its results. But that’s something for another time. Still, even this small bit has been pretty interesting.

It would be neat to see similar statistics statistics for clang compiling C or C++ as well, though. However, my experience has been that it is really quite difficult to compare Rust to C/C++ compilers, because it’s very difficult to find projects in both C/C++ and Rust that are close analogues of each other. Even when they do the same thing, they tend to do it in very different ways.

Encore!

Okay, I am hitting the edges of both my knowledge and patience, but there’s one last thing I want to try to do to close the loop. We’ve found and poked at the part of the compiler that goes from LLVM IR to actual machine code, looking at its optimization steps. Now I want to try things the other way around and find the code where we go from MIR to LLVM, and see what it looks like.

Aha, it’s in librustc_codegen_ssa. This is… hard to get into, because I am not familiar with MIR and how it’s built, and so I don’t know how the types in here fit together very well. There is a BuilderMethods trait in that sub-library that appears to be what you implement for a particular backend to turn MIR into that backend’s IR though, and the librustc_codegen_llvm module implements it. The implementation doesn’t look particularly complicated or obscure actually, it just translates MIR basic blocks and math instructions and whatnot into LLVM equivalents. It looks like MIR and LLVM IR are not terribly different, which isn’t surprising in retrospect since MIR was designed by people who’d been working a lot with LLVM already. Nothing in that trait impl seems to do a lot of weird work or turn single instructions into many, besides load_operand() and write_operand_repeatedly() for some reason. So the LLVM IR you get out seems like it should be pretty similar to the MIR you put in. That’s both easier and less informative than I had hoped.

So, that’s as far as I feel like going with this. I guess it’s a short encore. Figuring out how MIR really works is out of scope, at least until I decide to write my own rustc backend. …Which I now know how to do. Hmmmm…

Conclusions

Well, that was an adventure! What have we learned? Not a GREAT deal of concrete stuff, I’m afraid, but a lot of general things:

  • Production grade compilers are a lot more complicated than they look! …but good tools (and docs!) can make them comprehensible even to a non-expert.
  • “Why is Rust slow to compile” is a very complicated question with a lot of possible answers, depending on your workload.
  • A lot of work in compiling a Rust library gets pushed off until the library is actually used in an executable!
  • Strong typing, borrow checking, monomorphization and all that jazz can take up a lot of time when used heavily…
  • …but often you can also just blame LLVM and you won’t be very wrong.

From our (still sort of crude) benchmarks, one thing that consistently consumes a medium-ish chunk of time is MIR borrow checking, which makes me sad because if people somehow think that borrow checking is slow then other languages will be more hesitant to adopt it. But upon contemplation I’m not really worried about it ’cause MIR borrow checking is a very new feature, basically a better rewrite of the existing borrow checker to be more sound and convenient. Since it’s brand new, it probably hasn’t really been profiled and optimized much yet. I might not expect a 10x improvement in its performance, but 2-3x getting squeezed out of it next time someone takes a good hard look wouldn’t surprise me much. (Edit from the future: Apparently the compiler team has actually optimized it pretty heavily precisely because they didn’t want people to think the new borrow checker was slow, so the current implementation probably won’t see large improvements. Alas!)

As for our original thesis of “does rustc generate piles of mediocre code and then just let LLVM clean it up?”, the tentative conclusion appears to be “maybe yes”, just based on before-and-after-optimization code sizes. And we’ve learned that to really improve this situation, it has to be done at or before the level of MIR, not in the MIR-to-LLVM or LLVM-optimization steps. But since codegen is only one part of the picture, even if we make the fastest compiler backend in the universe, nalgebra and similarly metaprogramming-heavy crates will still take a lot of time to typecheck.

There’s still a lot to learn about the codegen side, though, which is frankly what I’m more interested in. Does any other compiler really do better? What LLVM optimizations do the most work on reducing the big heap of input code? Which take the most time to perform? You’d have to know the answer to those to make a meaningful start at improving rustc’s MIR output to make things compiler faster. Then you get deeper into details such as, are there any patterns that rustc generates that LLVM particularly likes or dislikes? Since MIR and LLVM IR seem reasonably similar, is it really going to be faster to generate better MIR than to optimize LLVM IR? And can we take advantage of Rust’s properties to perform better optimizations than LLVM can do alone?

I’m not going to answer these questions, at least not any time soon! This was always intended to be a pretty light touch on the topic, I don’t have time and energy spending weeks digging into it, much as I’d like to. But like all good research, finding out some of the answers to this has raised more questions to think about.

To the next adventure!