UNFINISHED PAPER

But I wanted to get it out there sooner rather than later

Some Strategies For Fast Lexical Analysis when Parsing Programming Languages

Sean Barrett, 2015-05-01

Abstract: Some techniques I've used to make lexical analysis faster: enhancing state machine traversals with additional inner-loop operations to minimize branch mispredictions, and restricting location-tracking-for-diagnostics to only file byte-offsets.

Contextual questions

1. Why not just use lex/flex?

I've generally found the use of parser generators like lex & yacc to be painful; they generate source code, which complicates the build process (e.g. on Windows where developers don't have them available by default); yacc is twitchy (shift-reduce conflicts, error handling, etc.); the regular expressions in lex/flex can get annoying (e.g. for strings with backslash-escaped elements).

I'm not the only one who reacts this way; for example lcc uses a recursive descent parser and a hand-coded lexer.

But the proof is in the pudding: what if we can just do faster than flex, and the speed of lexing matters? Here's the results I got in 2006 when I did this work with a C parser:

Runtime performance lexing ~7,500,000 lines of C code using MSVC 6 in 2006:

lexer input reading performance
flex (default) fread 8KB blocks 13.56 s
stb_lex (w/symbol hashing) fread whole file 4.84 s
stb_lex fread whole file 4.23 s
flex -F (fast) fread 8K blocks 3.07 s
flex -f (full) fread 8K blocks 2.92 s
handcoded fread whole file 2.45 s
handcoded mmap mmap() 2.14 s
wc fread 64K blocks 1.73 s

This isn't an apples-to-apples comparison because the handcoded table system does much more, but since it's the fastest that doesn't change anything. And note that these are purely synthetic tests that just do lexical analysis in a loop, without any actual parsing occurring.

The exact details don't matter, though; the trade-offs for performance change with new hardware. The rankings might be different if I ran the same tests with modern compilers on modern hardware. I'm not going to, though; I just want to talk about the ideas behind the "handcoded" lexer, and some other ideas, so people looking at making fast lexers maybe have some extra tools in their toolboxes.

2. Who cares about 1 extra second for parsing 7.5 MLoC?

Not you, obviously, or you wouldn't ask the question.

I do, though.

When Go was first being demonstrated (which was after I did this work), I rolled my eyes when everybody was all excited about how it was supposedly designed for compilation speed—because it compiled something like 3x or 4x slower than the compiler I was working on at the time.

3. No, seriously, does lexing speed matter that much?

Maybe? Maybe not? You'll have to measure.

Back when I was learning this stuff (in the days when we all drove Model Ts), when lex generated the state machine as code, e.g. goto statements as edges(!), and C compilers were maybe still single-pass for debug builds, the common argument was that the lexical analyzer processed more data than any other phase of the compiler. The parsing processed tokens, and there were many fewer tokens then there were characters. Yacc was linear in the number of input tokens. Even if you built an abstrast syntax tree and codegen'd from that, still most things were basically O(N) for their N inputs, and so lexical analysis, with the biggest N, was important to get right.

Generally I don't think people think this is true any more, because compilers do a lot more now. But if you're not optimizing, I'm not sure there's really a good excuse to be doing that much more to make the rest of it so much slower that the lexical analyzer doesn't matter at all. (But I guess maybe it could be that slow.) It also depends on the language, e.g. C++ is presumably slower to parse than C.

Put it this way: if it takes two minutes to compile 7.5 million lines of code, then savings one second of lexical analysis probably isn't worth it. But if it takes five seconds to compile that code? Then it's totally worth it.

4. What about stb_c_lexer.h?

stb_c_lexer.h is a "classic" hand-coded parser, one which switches on the first character of each token and then uses additional hand-coded logic to determine the actual token. This is very different from a hand-coded table-driven state-machine approach (indeed I wouldn't normally call code doing if()s on the next character a 'state machine' at all; the state is implicit in the program counter).

I don't recommend the stb_c_lexer.h approach for speed, but the obvious downside of a hand-coded state-machine is the difficulty of modification and maintainence, so I went with the simpler, slower approach for stb_c_lexer.h, which is more intended for quick-and-dirty or small programs. If you're making a huge 500,000 line compiler you can probably afford to write a real lexer. (And it's possible to change stb_c_lexer.h to something faster with the same API, if it turns out anyone cares.)

5. What about the C preprocessor?

I'll talk about the interactions with this after I present the main ideas.

A hand-coded state-machine based lexer

What is a lexer? To simplify parsing, modern language parses are divided into two components: one step, the lexical analyzer (lexer) breaks the input down into tokens, and the second, parsing step, parses the stream of tokens to match a context-free-grammar without any worry about the details of the individual characters that make up those tokens. We're just talking about that tokenization step.

When in 2006 I wrote the lexer I'm about to describe, I was working under the following assumptions which were different from when I learned about lexers in college around 1988 or so:

A naive hand-coded lexer

A naive hand-coded lexer, using the convention that single-character tokens use their character code as their token ID, looks something like this:

   top:
   // skip whitespace
   while (isspace(*p_src)) {
      if (*p_src == '\n')
         ++line_number;
      ++p_src;
   }
   switch (*p_src++) {
      case '?': return '?';

      case ':': return ':';

      case '/':
         if (p_src[0] == '/') {
            ... parse single-line comment ...
            goto top;
         } else if (p_src[0] == '*') {
            ... parse multi-line comment ...
            goto top;
         } else if (p_src[0] == '=') {
            ++p_src;
            return TOKEN_div_eq;
         }
         return '/';

      case '<':
         if (p_src[0] == '<')
            if (p_src[1] == '=') {
               p_src += 2;
               return TOKEN_shift_left_eq;
            } else {
               p_src += 1;
               return TOKEN_shift_left;
            }
         } else if (p_src[0] == '=') {
            p_src += 1;
            return TOKEN_less_eq;
         }
         return TOKEN_less;

      case '_': case 'A': case 'B': /* ... */ case 'z':
         ... parse identifier ...
         return lookup_identifier(identifier_buffer, id_length);

      ... etc. ...
   }

(Note that here I've written it as if p_src is a global variable; a proper implementation would copy p_src into a (register-friendly) local within the body of the function, then write to the global at the end. I've also not included correct handling of CR, LF, and CR+LF, although all my lexers do include this.)

In each case, we use the first character of a token to effectively jump to a block of code that hand-parses that kind of token. This makes the code very straightforward to follow. It uses no tables so it is cache-friendly, but it is very branch-mispredict heavy. The rule of thumb is that a loop over e.g. a comment or an identifier is basically a counted loop with a random length, so with a modern branch predictor we expect exactly one mispredict each time through the loop (the branch always predicts that the loop will continue, so the time we exit from the loop the branch mispredicts).

So here we see a branch mispredict every time we skip whitespace, another branch mispredict every time we hit a newline

Using a state machine

Let's start constructing our own table-driven state-machine along the lines of how flex works. The tables can be hand-coded to represent the logic used by the naive approach above, or you could machine generate them from regular expressions, or whatever.

The inner-loop of a typical parsing state machine looks like this:

   state = START;
   do {
      int ch = *p_src++;
      state = transition[state][ch];
   } while (!is_final_state[state]);

If we use this to parse out tokens in our lexical analyzer, at this point, the code still needs to do token-type specific processing, like looking up identifiers, computing numeric values for literals, etc. To do this, we still need to follow it with a switch statement:

   ... skip whitespace ...

   state = START;
   p_token_begin = p_src;

   do {
      int ch = *p_src++;
      state = transition[state][ch];
   } while (!is_final_state[state]);

   switch (state) {
      default: return token_for_state[state];
      case STATE_identifier:
         return lookup_identifier(p_token_begin, p_src - p_token_begin);
      ... etc...
   }

Note that this is different from the switch in the naive version; that one switched on the first character of the token and fully parsed the token by hand, whereas this one runs after the token has already been fully parsed to its end.

We've actually maybe made branch misprediction worse in many cases, as we now branch mispredict when exiting the state machine loop, and then we immediately branch mispredict doing the computed jump to the correct block for the token type. Also note that we actually rescan many token types, like literals, if we have to further process them into literal values, which will most likely cause us to branch mispredict again when those counted loops terminate. (This is typical of a flex-based parser: your token-specific code runs after the token is parsed.)

We can improve the first redundant misprediction by combining the jumps; the do...while can be replaced by a computed branch which jumps to the state-machine processing step for non-final states, or jumps to the the token-specific block for final states. Using gcc's computed labels, this is straightforward, but getting this behavior in MSVC is unlikely.

I'll come back to this later, but let's first focus on getting the inner parsing loop faster.

A smaller state-machine inner loop

The inner-loop of a typical parsing state machine looks like this:
   state = START;
   p_token_begin = p_src;
   do {
      int ch = *p_src++;
      state = transition[state][ch];
   } while (!is_final_state[state]);

Here we assume our language has some out-of-band character; traditionally this is NUL, ch='\0') which we can guarantee appears at the end of the input buffer by writing one there (but as mentioned we maybe can't with mmap).

The states are arbitrary, so we optimize this to assume that final states are at the beginning of the range:

   state = START;
   p_token_begin = p_src;
   do {
      int ch = *p_src++;
      state = transition[state][ch];
   } while (state > LAST_FINAL_STATE);

The tables for this are rather large; suppose we have 30 states, then we have 30*256 table entries assuming 8-bit characters. Assuming we will get better caching behavior using less cache (though this may not matter on a synthetic test with nothing else touching primary cache), we divide the input characters up into equivalence classes where every character in the class has the same transitions; for example, most alphabetic characters behave the same.

   state = START;
   p_token_begin = p_src;
   do {
      int ch = *p_src++;
      int eq_cls = equivalence_class[ch];
      state = transition[state][eq_cls];
   } while (state > LAST_FINAL_STATE);
Note that this only really works for ASCII; if we want special behavior for multibyte UTF-8 characters, each byte of the character needs a separate equivalence class, and if we have enough distinctly parsed multibyte UTF-8 characters (for example, to support Unicode identifier characters), we may end up with a very large number of equivalence classes and make this not really much of a performance win. (Supporting Unicode identifier characters will also add a lot more states, blowing up the table size on the other axis. I'll come back to this.)

A faster state-machine lexer

Now it's time to get tricky. In the current approach, we need a whitespace-skipper at the front which will mis-predict once on exit, and we treat comments as pseudo-tokens and (maybe) parse them with the state machine, but then jump back to the whitespace skipper after parsing them. (None of this code is shown in the above inner loop example.)

So, the trick we can do here is to include the comment-parsing and whitespace-skipping in a state machine loop. But note that we need to know where the actual token starts. So as a first attempt we make two state machines, one for the whitespace & comments and one for the token:

   // parse comments and whitespace
   state = START_WHITESPACE;
   do {
      int ch = *p_src++;
      int eq_cls = whitespace_equivalence_class[ch];
      state = whitespace_transition[state][eq_cls];
   } while (state > LAST_WHITESPACE_FINAL_STATE);

   --p_src; // back up to first character of token

   state = START;
   p_token_start = p_src; // record where this token started

   do {
      int ch = *p_src++;
      int eq_cls = equivalence_class[ch];
      state = transition[state][eq_cls];
   } while (state > LAST_FINAL_STATE);

   switch (token_for_state[state]) {
      ...
   }

Now we still have three mispredicted branches, one for each loop end and one for the switch.

I used separate tables for the two machines so that they can be as small as possible, but we could use the same tables for both (by having the states be non-overlapping).

If we do that, then we can use a single loop for both, removing a mispredicted branch. To do this we have to move the update code that exists between the loops above inside the loop. If we do this the way it works above, it's pretty gross, although out-of-order processors might still run it well:

   char *p_token_start, *dummy;
   char **token_start_table[2] = { &dummy, &p_token_start }; 
   state = START;

   do {
      int ch = *p_src++;
      int eq_cls = equivalence_class[ch];
      state = transition[state][eq_cls];
      *token_start_table[is_first[state]] = p_src; // usually writes to dummy, writes to p_token_start once
   } while (state > LAST_FINAL_STATE);

   switch (token_for_state[state]) {
      ...
   }

Fortunately, there's a much simpler way; instead of recording a pointer to the start of the token, we record the length of the token. With two loops this would means incrementing the length in the second loop but not the first; in the combined loop, it's just:

   int token_len = 0;
   do {
      int ch     = *p_src++;
      int eq_cls = equivalence_class[ch];
      state      = transition[state][eq_cls];
      token_len += in_token[state];
   } while (state > LAST_FINAL_STATE);

   p_token_start = p_src - token_len;
   switch (token_for_state[state]) {
      ...
   }
We don't always need the token start location (for example, after parsing a single-character symbol), so we move the "p_token_start" computation inside the switch only in the cases that need it.

Line numbers

Now, one thing I've casually omitted here is keeping track of the number of lines in the file. We do this because we want to attach to each parsed token its location in the file, which the compiler can use for generating diagnostic messages. In the original naive hand-coded parser, this required extra tests. Once again, we can do it here with some extra operations in the inner loop:

   do {
      int eq_cls = equivalence_class[*p_src++];
      state      = transition[state][eq_cls];
      token_len += in_token[state];
      num_lines += is_newline[eq_cls];
   } while (state > LAST_FINAL_STATE);

However, that can't correctly handle CR+LF; to handle CR+LF (which all of my programs always do) we need a little state machine. Rather than update a second state machine, we require that the main parse state distinguish this (the only cases are when parsing whitespace, and when parsing multi-line comments, so this only adds 4 states to the state machine):

   do {
      int eq_cls = equivalence_class[*p_src++];
      state      = transition[state][eq_cls];
      token_len += in_token[state];
      num_lines += is_first_char_of_newline[state];
   } while (state > LAST_FINAL_STATE);

Micro-optimization

Instead of

      switch (token[state])
we can obviously just do
      switch (state)
and save a little bit of speed/tables.

Let's consider this line:

      state = transition[state][eq_cls];
Assuming there are at most 256 states, then this is a table of bytes, so it's reasonable compact. However, it still implicitly requires a shift or a multiply to handle the two-dimensionality. (If it weren't byte outputs, it would require another shift as well to convert from indices to byte offsets).

We can optimize this if we "pre-shift" one of the inputs:

      state = transition[state + eq_cls];
which maps to a single x86 instruction.

Nominally we tend to think of this as "a table of transitions per state" and thus grouped the way I showed it originally, transition[state][eq_cls], suggests preshifting/premultiplying 'state'.

However, in the inner loop, we index into multiple tables by 'state' and we don't index into any other tables by 'eq_cls', so we should actually reverse the table to be transition[eq_cls][state] and then premultiply eq_cls. This also allows us to directly switch on 'state' as mentioned above and keep the (compiler's) jump table for that small.

If the product of the number of states and the number of equivalence classes is no more than 256, then we can still fit eq_cls in a single byte, keeping the equivalence_class[] table small. (Actually, they can be slightly bigger and still fit.)

The "handcoded" lexer

The above has evolved the state-machine lexer to roughly what the "handcoded" line in the table at the beginning of this article looks like:
   token_len = 0;
   state = START;   
   do {
      int eq_cls = premultiplied_equivalence_class[*p_src++];
      state = transition[state + eq_cls];
      token_len += in_token[state];
      num_lines += first_char_of_newline[state];
   } while (state > LAST_FINAL_STATE);

   switch (state) {
      ... using 'p_src - token_len' to recover token start ...
   }

In the "handcoded" lexer reported at the beginning of the article, I have:

Contents of the state machine

Up until, now I've been assuming this state machine is fully lex-like, in that it fully parses each token. Remember that the switch in the state-machine code is switching on the state and we've been assuming that state fully implies the token type.

However, nothing stops us from parsing only some of a token with the state machine, then falling through to the switch, and using C code to do the reset of the parsing. In the "handcoded" line in the table above, that is how literals are handled, which is why it mentions that the values of literals are computed, whereas they aren't in the flex version.

For example, if we see a " character, the state machine transitions to a 'saw a " character' state which is final. This exits the state machine, and then the program switch()es to code which then manually parses out the string and handles \-escapes in the string.

Number processing

Here are the states used for number processing:

   S_seen_zero,     // non-final, output of start state on '0'
   S_saw_digit,     // final, output of start state on '1'-'9'
   S_lone_zero,     // final, output of S_seen_zero if next char isn't part of numeric literal
   S_hex_start,     // final, output of S_seen_zero if next char is 'x'
   S_octal_start,   // final, output of S_seen_zero if next char is '0'-'9'
   S_float_start,   // final, output of S_seen_zero if next char is '.'

In each case, the code in the switch then parses the input, e.g. using strod() or strtol(), with special code to recognize overflows so it can warn about them.

Identifier processing

In the lcc hand-coded lexer, keywords like 'if' and 'int' are processed by having the first-character switch() have an 'i' case which then parses these tokens out; meanwhile, user-defined names like 'in_token' are handled with a separate hash table. In hand-coded lexers that I write, I instead generally let the keywords go through the same path as identifiers; the symbol table stores the keywords and their corresponding token values. This may be less efficient for keyword processing since the hash table path will scan the identifier multiple times, but it may still be as fast as the branch-heavy lcc approach. (And in C code, keywords are probably much rarer than identifiers; and certainly it makes very little sense for 'int' and 'int32_t' to go through different paths.)

In the state-machine based approaches, however, parsing out keywords with the state-machine might make more sense, although it will add more states and more character equivalence classes to handle it, thus growing the tables significantly, so I didn't do it here (or for the flex or stb_lex parsers).

My hand-coded lexer actually has two different implementations of identifiers. In the first version, which I believe (but am not sure) is the version used to compute the times in the table at the beginning, as soon as a single identifier character is seen, we break out of the loop and jump to the identifier handler. The handler parses out the identifier while simultaneously computing the hash value (with my standard "one-at-a-time hash" which uses a circular-shift by 7 and an add of the new character), thus avoiding traversing the identifier multiple times during lexing. (No actual hash-table lookup occurs at this point; that will require at least one string comparison.)

The second one actually uses the main state machine loop to parse the identifer and compute a (weaker) hash value, but I believe this is not the one I used. It does this by adding the following code to the inner loop:

      hash &= hashmask[state];
      hash = 2*hash+lastchar;
Obviously the hash function chosen here was designed to be as fast as possible (it can be done with a single x86 LEA), even though it is kind of terrible about forgetting history. In 64-bit, though, it will still have full-history of the last 56 characters, so it's certainly viable.

There is also only one state (the identifier state) which doesn't mask it to 0, so the first line could possibly be done more efficiently with a conditional move, but expressing this robustly in C is tricky.

Symbol processing

Let's consider the parsing of one-character and two-character symbol-based tokens like '+=' and '<'.

There are three general strategies we can use for parsing these fixed-length multi-character tokens:

  1. fully parse them and end up with unique states for each token
  2. fully parse them and end up with shared states that need decoding
  3. incompletely parse them and do clean up at the end

The strategy taken in the "handcoded" parser is #2. This allows us to shrink the table for better cache performance, although it is not small enough for the transition table to be bytes, instead they are shorts.

I did later explore a version of strategy #3 which allowed transitions to fit in bytes, but I don't have performance results for that parser.

Here's the set of character equivalence classes I used:

   C_white,      // space, tab
   C_cr,
   C_lf,
   C_equable,    // ^ !
   C_equal,      // =
   C_letter,     // A-Za-z_
   C_x,          // x X
   C_zero,       // 0
   C_digit,      // 0, 1-9
   C_plus,
   C_and,
   C_or,
   C_hash,
   C_hyphen,
   C_greater,
   C_lesser,
   C_slash,
   C_star,
   C_misc,        // all symbols not otherwise listed
   C_period,
   C_single_quote,
   C_double_quote,
   C_eof,

Note that basically every symbol character gets its own equivalence class; this is needed so we can precisely parse every token. For example, although both '-' and '+' are similar, with += and -= and ++ and --, they are also different, since -> is a token but +> isn't. The only exceptions are '^' and '!', which can only appear by themselves or ^= and !=; since they're placed in the same equivalence class. This wouldn't work for a lexer of type #1, but it works for #2 and #3, as I'm about to explain.

For a lexer of type #1, I would need a distinct output state for every distinct multi-character symbol token. For this lexer, however, I use many fewer states. The states distinctly encode which initial characters have been seen—distinguishing the state of having seen '+' from '-'— but allow them to collapse to a single state meaning e.g. "i've seen a two-character token with one character of lookahead" (e.g. for ">>") or "i've seen a two-character token with no characters of lookahead" (e.g. for "->").

The processing routines for these final states then compute the final token value. For a single-character token, they just fetch the character out of the input buffer and return that, since we are using the character values as the token values. For a two-character token, the code does another table lookup, this time on a trivial perfect hash of the last two characters, to get the final token. Since the set of two characters that might be in the buffer are restricted to the set of characters that can transition there, simply adding the ASCII value of the two characters produces a perfect hash function ranging from 76 ("&&") to 248 ("||"), so this table lookup uses 173 bytes bytes (or 346 bytes if the token types returned are >= 256).

For parsing other languages with different two-character tokens, if adding the characters isn't sufficient to distinguish all tokens, a more complicated perfect hash function could be constructed, or extra states can be introduced so there are no collisions with the "add the characters" hash function, i.e. collisions are routed to different states that use different tables. In the worst case, a full table covering all allowable characters could be created.

Improving beyond the original results

There are a number of techniques which you might experiment with to improve on the lexer described above; whether these techniques are an improvement might depend on the interaction with other features and with the architecture and micro-architecture of the processor you're running on.

An orthogonal tactic

A strategy I use in most lexers I write these days, including stb_c_lexer.h, is to omit line-counting entirely. For the above hand-coded lexer, this allows us to remove one line in the inner loop—a table lookup, an add, and frees up a register.

The requirement from the parser is that the lexer needs to associate with each token information sufficient to present diagnostics in the case of errors or warnings. Normally this is done by counting lines, and tagging each input token with the line number (and filename) that the token came from. For better diagnostics, the character position within the line may also be recorded, which requires counting that as well.

An alternative strategy is to simply tag the token with the offset from the beginning of the file where the token started. (For example, in the terms of the above hardcoded parser, this might be something like computing file_offset = p_token_begin - p_file_start; and associating it with a token.) If and only if diagnostics are required, we can re-scan the file from the beginning to determine which line number (and possibly what character offset within the line) that position in the file is located at.

If displaying diagnostics is rare, saving the line-number processing (and offset-within-line processing) can be an overall performance improvement. If the only diagnostics are errors, and you stop after one error, it's totally worth it. If you print 20 or 50 errors, you'll want the diagnostic calculator to cache the results (e.g. only ever perform a single line-number-calculating scan per file). And there are some other downsides:

Note that the reparse-for-diagnostics doesn't need to do a proper parse, it mostly just has to count lines and offsets while scanning, but it does have to do '#line' and '#file' processing. On the other hand, the lex pass itself can ignore '#line' and '#file'.