2007-06-11 – Parsing expression grammars

PEGs are an interesting new formalism for grammars. Unlike most older formalisms, which are based on Chomsky’s generative grammars, their starting point is recognizing strings that are in a language, not generating them. As such they are a closer match to what we usually want a grammar for. The practical effect of this is that they naturally avoid common ambiguities without external rules, such as C’s if/else ambiguity or the various rules about greediness imposed on regexes (e.g. perl’s matching rules versus POSIX’s longest-leftmost rule, discussed in Friedl’s book). Even though PEGs can recognize some non-context-free languages (e.g. anbncn) they can be matched in linear time using a packrat parser (which can be implemented very beautifully in Haskell).

Bryan Ford’s 2004 POPL paper establishes the formal foundations of PEGs and defines a concrete syntax for them, fairly similar to ABNF. The key differences are: the choice operator is ordered (prefers to match its left-hand sub-expression); repetition operators are maximally greedy and don’t backtrack (so the second a in a*a can never match); and it includes positive and negative lookahead assertions of the form &a and !a (like (?=a) and (?!a) in perl).

It occurs to me that there is a useful analogy hidden in here, that would be made more obvious with a little change in syntax. Instead of a / b write a || b, and instead of &a b write a && b. With || and && I am referring to C’s short-cutting “orelse” and “andalso” boolean operators - or rather the more liberal versions that can return more than just a boolean, since a PEG returns the amount of input matched when it succeeds. This immediately suggests some new identities on grammars based on De Morgan’s laws, e.g. !a || !b === !(a && b). Note that !!a =/= a because the former never consumes any input, so not all of De Morgan’s laws work with PEGs. This also suggests how to choose the operators to overload when writing a PEG parser combinator library for C++ (which has a much wider range of possibilities than Lua).

⇐ 2007-06-05 ⇐ Opening new emacs frames ⇐ ⇒ A case for Lua ⇒ 2007-06-22 ⇒