dotat unifdef: expand macros, reduce #ifdefs

Some notes on differences between unifdef version 2 and version 3.

Improvements in unifdef 3

The main new features are macro expansion and expression simplification.

Newer C++ features such as binary integer literals, raw string literals, and user-defined literals are now handled properly.

Instead of working one line at a time, unifdef now loads the whole file into memory, so it can handle preprocessor directives that span multiple lines.

The expression evaluator is more correct, for instance is uses intmax_t instead of long, and it knows about signed vs unsigned and undefined behaviour. It now supports ?: and character constants.

Regressions in unifdef 3

Support for non-C-like languages has been dropped. This was the -iD, -iU, and -t options which (selectively) disabled lexing of strings and comments.

Complement mode, -c, which reversed which lines were deleted and which were kept. (You can script it up with comm(1).)

Short-circuit evaluation of && and || cannot be disabled.

Changes in unifdef 3

Diagnostics are different. Error exit codes are more specific.

The -d debugging/diagnostics option now takes an argument.

Why rewrite unifdef?

Version 2 works line-at-a-time, which makes it hard to handle C well. So old unifdef has a bunch of limitations related to preprocessor directives that span multiple lines. And there's a load of extra complexity involved in detecting when these limitations are triggered, and handling them gracefully. The unifdef parser state machine is at least twice as big because of this.

One of the worst parts of C's lexical syntax is that backslash-newline can occur anywhere. Old unifdef's line-at-a-time design means it has to give up when it encounters backslash-newline, so the rest of its lexer does not have to deal with the full implications. This means it isn't just a matter of better buffering to change the old code to work better with multi-line preprocessor directives: lots of other code can't support it either.

And there are some embarrassing shortcuts. I think the worst one is that strings are treated the same as comments, because they can't legitimately occur in #if directives, so it's convenient to pretend strings don't exist in a similar way to comments.

But despite being a bit crappy, unifdef is successful and its limitations don't stop it being useful on real-world code. And it's economical, about 1300 lines of code (not counting comments and blank lines).

The main features I want are under the headline idea of a "partial preprocessor", i.e. macro expansion and expression simplification. They both require infrastructure that the old code lacks.

My other aim is a bit more esoteric: to make unifdef conform much more closely to the standards (de facto as well as de jure). The success of old unifdef shows that this isn't necessary for a tool to be useful, but old unifdef definitely needs manual help in difficult situations. Other authors of C source analysis tools have written about the difficulties of getting a tool that works in the lab to be sufficiently trouble-free in the real world. Maybe unifdef is too simple for it to have this problem, and the effort to improve it will be a waste; or maybe it's so obviously limited that it doesn't get pushed hard. Maybe we'll find out which...

About version numbers

In 2002 I started working on unifdef, using CVS because that was the version control system used by the various BSDs. In the first few years, its release version number was 1.NNN, which was just the CVS revision of unifdef.c. There's a tradition (going back to SCCS) of embedding the version control revision number in the source file, and unifdef used this to include a version string in the binary that could be read by SCCS what or RCS ident.

In 2010, I uplifted unifdef to git, which does not have CVS-style revision numbers. So I replaced the CVS $Keyword$ tags with manufactured ones containing the output from git describe, and I decided to bump the major version to 2.N.

The v1/v2 major version bump was partly for administrivial reasons, but it also made better historical sense. Dave Yost's pre-ANSI-C unifdef was clearly version 1, and my rewrite was version 2. But I was 8 years late in applying that logic, partly because unifdef v2 evolved directly from unifdef v1.

Version 3 is a complete rewrite, and deprecates some command line options, so the major version number bump is fully justified. (And you can still see it using what or ident.)