C preprocessor expressions

With reference to the C standard, describe the syntax of controlling expressions in C preprocessor conditional directives.

I've started work on a "C preprocessor partial evaluator". My aim is to re-do unifdef with better infrastructure than 1980s line-at-a-time stdio, so that it's easier to implement a more complete C preprocessor. The downside, of course, is that the infrastructure (Lua and Lpeg) is more than ten times bigger than the whole of unifdef.

The main feature I want is macro expansion; the second feature I want is #if expression simplification. The latter leads to the question above: exactly what is allowed in a C preprocessor conditional directive controlling expression? This turns out to be more tricky than I expected.

What actually triggered the question was that I "know" that sizeof doesn't work in preprocessor expressions because "obviously" the preprocessor doesn't know about details of the target architecture, but I couldn't find where in the standard it says so.

My reference is ISO JTC1 SC22 WG14 document n1570 which is very close to the final committee draft of ISO 9899:2011, the C11 standard.

Preprocessor expressions are specified in section 6.10.1 "Conditional inclusion". Paragraph 1 says:

The expression that controls conditional inclusion shall be an integer constant expression except that: identifiers (including those lexically identical to keywords) are interpreted as described below;¹⁶⁶⁾ and it may contain [defined] unary operator expressions [...]

¹⁶⁶⁾ Because the controlling constant expression is evaluated during translation phase 4, all identifiers either are or are not macro names — there simply are no keywords, enumeration constants, etc.

The crucial part that I missed is the parenthetical "including [identifiers] lexically identical to keywords" - this applies to the sizeof keyword, as footnote 166 obliquely explains.

... A brief digression on "translation phases". These are specified in section 5.1.1.2, which lists 8 phases. Now, if you have done an undergraduate course in compilers or read the dragon book, you might expect this list to include things like lexing, parsing, symbol tables, something about translation to and optimization of object code, and something about linking separately compiled units. And it does, sort of. But whereas compilers are heavily weighted towards the middle and back end, C standard translation phases focus on lexical and preprocessor matters, to a ridiculous extent. I find this imbalance quite funny (in a rather dry way) - after such a lengthy and detailed build-up, the last two items in the list are almost, "and then a miracle occurs", especially the last sentence in phase 7 which is a bit LOL WTF.

6. Adjacent string literal tokens are concatenated.

7. White-space characters separating tokens are no longer significant. Each preprocessing token is converted into a token. The resulting tokens are syntactically and semantically analyzed and translated as a translation unit.

8. All external object and function references are resolved. Library components are linked to satisfy external references to functions and objects not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.

So, anyway, what footnote 166 is saying is that in the preprocessor there is no such thing as a keyword - the preprocessor has a sketchy lexer that produces "preprocessing-token"s (as specified in section 6.4) which are a simplified subset of the compiler's "token"s which mostly don't turn up until translation phase 7.

Paragraph 1 said identifiers are interpreted as described below, which refers to this sentence in section 6.10.1 paragraph 4:

After all replacements due to macro expansion and the defined unary operator have been performed, all remaining identifiers (including those lexically identical to keywords) are replaced with the pp-number 0, and then each preprocessing token is converted into a token. The resulting tokens compose the controlling constant expression which is evaluated according to the rules of 6.6.

This means that if you try to use a keyword (such as sizeof) in a preprocessor expression, it gets replaced by zero and (usually) turns into a syntax error. And this is why compilers produce less-than-straightforward error messages like error: missing binary operator before token "(" if you try to use sizeof.

Smash-keyword-to-zero has another big implication which is a bit more subtle. Section 6.6 specifies constant expressions, and paragraphs 3 and 6 are particularly relevant to the preprocessor.

3 Constant expressions shall not contain assignment, increment, decrement, function-call, or comma operators, except when they are contained within a subexpression that is not evaluated.

It is normal for real preprocessor expression parsers to implement a simplified subset of the C expression syntax which simply lacks support for these forbidden operators. So, if you put sizeof(int) in a preprocessor expression, that gets turned into 0(0) before it is evaluated, and you get an error about a missing binary operator. If you write something similar where the compiler expects an integer constant expression, you will get errors complaining that integers are not functions or that function calls are not allowed in integer constant expressions.

6 An integer constant expression shall have integer type and shall only have operands that are integer constants, enumeration constants, character constants, sizeof expressions whose results are integer constants, _Alignof expressions, and floating constants that are the immediate operands of casts.

Re-read this sentence from the point of view of the preprocessor, after identifiers and keywords have been smashed to zero. There aren't any enumeration constants, because they are ~~identifiers~~ zero. Similarly there aren't any sizeof or _Alignof expressions. And there can't be any casts because you can't write a type without at least one identifier. (One situation where smash-keyword-to-zero does not cause a syntax error is an expression like (unsigned)-1 and I bet that turns up in real-world preprocessor expressions.) And since there can't be any casts, there can't be any floating constants.

And therefore the preprocessor does not need any floating point support at all.

I am slightly surprised that such a fundamental simplification requires such a long chain of reasoning to obtain it from the standard. Perhaps (like my original question about sizeof) I have overlooked the relevant text.

Finally, my thanks to Mark Wooding and Brian Mastenbrook for pointing me at the crucial words in the standard.