The other day I learned about the Rust crate lexopt which describes itself as,
A pathologically simple command line argument parser.
Most argument parsers are declarative: you tell them what to parse, and they do it. This one provides you with a stream of options and values and lets you figure out the rest.
For “pathologically simple” I still rather like getopt(3) despite its lack of support for long options. Aaron S Cohen wrote getopt in around 1979, and it was released into the public domain by AT&T in 1985. A very useful 50-ish lines of code! It still has almost everything required by POSIX nearly four decades later.
But the description of lexopt
made me think getopt()
could be
simpler. The insight is that the string of options that you have to
pass to getopt()
is redundant with respect to the code that deals
with the return values from getopt()
. What if you just get rid of
the options string?
I thought I would try it. Turns out, not much is lost in getting rid of the options string, and a few things are gained.
My new code is half the size or less of getopt()
, and has more
functionality. I’m going to show how how this was done, because it’s
short (ish), not because it is interesting. Then I’ll try to tease out
a lesson or two.
examples
A typical getopt()
man page example starts,
while ((ch = getopt(argc, argv, "bf:")) != -1) {
switch (ch) {
It’s one of the more ugly loop conditions that commonly occurs in C.
There’s a minor footgun if you declare char ch
on systems where char
is unsigned, but that’s rare. There’s also an issue that getopt()
has
generally been vague about how to restart the loop – it doesn’t
expose all the necessary internal state.
My alternative function for getting an option from argv is called
argv_opt()
. Its loop header looks like,
for (char opt = argv_opt(argv); opt; opt = argv_opt(NULL)) {
switch (opt) {
A slightly longer but more basic idiom. A straightforward way to
restart the loop. A more sensible return value when the loop is
finished: '\0'
cannot be a valid option whereas -1
could (in the
1990s) have been 'ÿ'
.
When an option needs a value, getopt()
sets optarg
for you:
case 'f':
file = optarg;
break;
However argv_opt()
doesn’t know if an option needs a value. You have
to tell it by calling argv_val()
to consume the value – and you
might find out it is missing:
case('f'):
file = argv_val();
if(file == NULL)
argv_err(usage, opt, "requires an argument");
break;
By default, getopt()
prints error messages for you, but you still
have to handle the error. The usual way is to write a usage()
function, or write it inline like,
case '?':
fprintf(stderr,
"usage: example [-b] [-f file] args...");
exit(1);
My functions include a small helper that prints the message and exits.
static const char *usage =
"example [-b] [-f file] args...";
// ... for ... switch ...
case('h'):
argv_err(usage, opt, NULL);
default:
argv_err(usage, opt, "is not recognized");
If options can conflict or if they are required, then argv_err()
helps to keep error messages consistent.
if(aflag && bflag)
argv_err(usage, 'a', "conflicts with '-b'");
if(inplace && !file)
argv_err(usage, 'i', "requires '-f'");
implementation
The code is slightly more than half the length of trad getopt()
. I
decided it’s simpler not to try to be re-entrant.
static char **vec;
static int arg, pos;
char argv_opt(char *argv[]) {
When (re-)starting the loop we normally want to skip the program name,
but not if that would make us overrun the end of argv
.
if(argv != NULL) {
vec = argv;
arg = vec[0] ? 1 : 0;
pos = 0;
}
The pos
variable keeps track of where we were inside a cluster of
options like -ab
. Move to the next argument when we reach the end
of the cluster.
if(pos > 0 && vec[arg][pos + 1] == '\0') {
arg++;
pos = 0;
}
When we reach the end of the options, pos
is set to -1
. An
argument that doesn’t start with -
ends the options. An argument
that is just "-"
often indicates stdin
, and needs to be returned
as a non-option argument.
if(pos < 0 || vec[arg] == NULL ||
vec[arg][0] != '-' || vec[arg][1] == '\0') {
pos = -1;
return('\0');
}
An argument "--"
indicates the end of the options and is skipped.
if(vec[arg][1] == '-' && vec[arg][2] == '\0') {
arg++;
pos = -1;
return('\0');
}
The previous two clauses are basically the same as the first
half-dozen lines of original getopt()
. Most of the rest of the code
in getopt()
is scanning the options string. But we don’t have that
so we can just bump to the next option.
This is where most of the code size reduction happens.
return(vec[arg][++pos]);
}
Option values are a little more flavoursome than I showed in the
examples: there’s also a function for getting an optional value, for
things like sed -i
inplace editing. Optional values require a GNU
extension for getopt()
.
char *argv_optval(void) {
char *val = &vec[arg][pos + 1];
arg++;
pos = 0;
return(val);
}
To get a mandatory option value, look for an optional value, or if it isn’t present, use the next argument (taking care not to overrun).
char *argv_val(void) {
char *val = (pos > 0) ? argv_optval() : "";
return(val[0] ? val : vec[arg] ? vec[arg++] : NULL);
}
There’s not much to the error handler:
noreturn void
argv_err(const char *usage, char opt, const char *err) {
if(err) fprintf(stderr, "option '-%c' %s\n", opt, err);
fprintf(err ? stderr : stdout, "usage: %s\n", usage);
exit(err ? 1 : 0);
}
That’s all!
discussion
When I understood the idea of lexopt
, I thought that getting rid of
getopt()
s semi-declarative option string might simplify things,
but I was surprised by how much code could be deleted. There wasn’t
much code to start with!
This experience reminds me of the end-to-end argument, which says,
functions placed at low levels of a system may be redundant or of little value when compared with the cost of providing them at that low level.
That is straightforwardly the thing that motivated this hack. And when
I was refining argv_val()
and argv_err()
I repeatedly found that
there was already logic in the caller that made features in the
lower-level code redundant. Making the API less redundant meant there
was less need for communication across the API boundary, so removing
the redundant features caused cascading simplifications.
It’s interesting that most successors to getopt()
have aimed to be
more declarative, but what argv_opt()
does is remove its most
declarative feature. The name lexopt
refers to parsing technology,
treating the job of reading the command line as lexical analysis. The
declarative approach adds a metalanguage that describes the command
line language, and a metalanguage interpreter that runs the
declarative description in order to interpret the command line. Two
layers of little languages! That’s a lot of machinery for such a small
task. No wonder removing the metalayer is simpler, because the lower
layer isn’t (usually) complicated enough to justify so much machinery.
I didn’t look at the lexopt
implementation until after
I wrote argv_opt()
. It’s 1000 lines of code! Plus 1000 lines of
tests! It’s not 2x bigger, it’s 20x bigger! As well as long options,
I’m pretty sure it handles things I haven’t even thought of yet. (Good
Rust code is like that.) But I suppose lexopt
is comparing itself to
clap, which is notorious for turning simple command line
utilities into monsters
Anyway, you can use the code in this article under 0BSD or
MIT-0 licences, but I recommend you use something better instead
– not C, or if you must C, standard getopt()
.