unifdef: expand macros, reduce #ifdefs
Some notes on differences between unifdef version 2 and version 3.
unifdef 3The main new features are macro expansion and expression simplification.
Newer C++ features such as binary integer literals, raw string literals, and user-defined literals are now handled properly.
Instead of working one line at a time, unifdef now loads the whole
file into memory, so it can handle preprocessor directives that span
multiple lines.
The expression evaluator is more correct, for instance is uses
intmax_t instead of long, and it knows about signed vs unsigned
and undefined behaviour. It now supports ?: and character constants.
unifdef 3Support for non-C-like languages has been dropped. This was the -iD,
-iU, and -t options which (selectively) disabled lexing of strings
and comments.
Complement mode, -c, which reversed which lines were deleted and
which were kept. (You can script it up with comm(1).)
Short-circuit evaluation of && and || cannot be disabled.
unifdef 3Diagnostics are different. Error exit codes are more specific.
The -d debugging/diagnostics option now takes an argument.
unifdef?Version 2 works line-at-a-time, which makes it hard to handle C well.
So old unifdef has a bunch of limitations related to preprocessor
directives that span multiple lines. And there's a load of extra
complexity involved in detecting when these limitations are triggered,
and handling them gracefully. The unifdef parser state machine is at
least twice as big because of this.
One of the worst parts of C's lexical syntax is that backslash-newline
can occur anywhere. Old unifdef's line-at-a-time design means it has
to give up when it encounters backslash-newline, so the rest of its
lexer does not have to deal with the full implications. This means it
isn't just a matter of better buffering to change the old code to work
better with multi-line preprocessor directives: lots of other code
can't support it either.
And there are some embarrassing shortcuts. I think the worst one is
that strings are treated the same as comments, because they can't
legitimately occur in #if directives, so it's convenient to pretend
strings don't exist in a similar way to comments.
But despite being a bit crappy, unifdef is successful and its
limitations don't stop it being useful on real-world code. And it's
economical, about 1300 lines of code (not counting comments and blank
lines).
The main features I want are under the headline idea of a "partial preprocessor", i.e. macro expansion and expression simplification. They both require infrastructure that the old code lacks.
My other aim is a bit more esoteric: to make unifdef conform much
more closely to the standards (de facto as well as de jure). The
success of old unifdef shows that this isn't necessary for a tool to
be useful, but old unifdef definitely needs manual help in difficult
situations. Other authors of C source analysis tools have written
about the difficulties of getting a tool that works in the lab to be
sufficiently trouble-free in the real world. Maybe unifdef is too
simple for it to have this problem, and the effort to improve it will
be a waste; or maybe it's so obviously limited that it doesn't get
pushed hard. Maybe we'll find out which...
In 2002 I started working on unifdef, using CVS because that was the
version control system used by the various BSDs. In the first few
years, its release version number was 1.NNN, which was just the CVS
revision of unifdef.c. There's a tradition (going back to SCCS) of
embedding the version control revision number in the source file, and
unifdef used this to include a version string in the binary that
could be read by SCCS what or RCS ident.
In 2010, I uplifted unifdef to git, which does not have CVS-style
revision numbers. So I replaced the CVS $Keyword$ tags with
manufactured ones containing the output from git describe, and I
decided to bump the major version to 2.N.
The v1/v2 major version bump was partly for administrivial reasons,
but it also made better historical sense. Dave Yost's pre-ANSI-C
unifdef was clearly version 1, and my rewrite was version 2. But I
was 8 years late in applying that logic, partly because unifdef v2
evolved directly from unifdef v1.
Version 3 is a complete rewrite, and deprecates some command line
options, so the major version number bump is fully justified.
(And you can still see it using what or ident.)