.@ Tony Finch – blog


The other day I learned about the Rust crate lexopt which describes itself as,

A pathologically simple command line argument parser.

Most argument parsers are declarative: you tell them what to parse, and they do it. This one provides you with a stream of options and values and lets you figure out the rest.

For “pathologically simple” I still rather like getopt(3) despite its lack of support for long options. Aaron S Cohen wrote getopt in around 1979, and it was released into the public domain by AT&T in 1985. A very useful 50-ish lines of code! It still has almost everything required by POSIX nearly four decades later.

But the description of lexopt made me think getopt() could be simpler. The insight is that the string of options that you have to pass to getopt() is redundant with respect to the code that deals with the return values from getopt(). What if you just get rid of the options string?

I thought I would try it. Turns out, not much is lost in getting rid of the options string, and a few things are gained.

My new code is half the size or less of getopt(), and has more functionality. I’m going to show how how this was done, because it’s short (ish), not because it is interesting. Then I’ll try to tease out a lesson or two.

examples

A typical getopt() man page example starts,

    while ((ch = getopt(argc, argv, "bf:")) != -1) {
        switch (ch) {

It’s one of the more ugly loop conditions that commonly occurs in C. There’s a minor footgun if you declare char ch on systems where char is unsigned, but that’s rare. There’s also an issue that getopt() has generally been vague about how to restart the loop – it doesn’t expose all the necessary internal state.

My alternative function for getting an option from argv is called argv_opt(). Its loop header looks like,

    for (char opt = argv_opt(argv); opt; opt = argv_opt(NULL)) {
        switch (opt) {

A slightly longer but more basic idiom. A straightforward way to restart the loop. A more sensible return value when the loop is finished: '\0' cannot be a valid option whereas -1 could (in the 1990s) have been 'ÿ'.

When an option needs a value, getopt() sets optarg for you:

        case 'f':
            file = optarg;
            break;

However argv_opt() doesn’t know if an option needs a value. You have to tell it by calling argv_val() to consume the value – and you might find out it is missing:

        case('f'):
            file = argv_val();
            if(file == NULL)
                argv_err(usage, opt, "requires an argument");
            break;

By default, getopt() prints error messages for you, but you still have to handle the error. The usual way is to write a usage() function, or write it inline like,

        case '?':
            fprintf(stderr,
                    "usage: example [-b] [-f file] args...");
            exit(1);

My functions include a small helper that prints the message and exits.

    static const char *usage =
        "example [-b] [-f file] args...";

    // ... for ... switch ...

        case('h'):
            argv_err(usage, opt, NULL);
        default:
            argv_err(usage, opt, "is not recognized");

If options can conflict or if they are required, then argv_err() helps to keep error messages consistent.

    if(aflag && bflag)
        argv_err(usage, 'a', "conflicts with '-b'");
    if(inplace && !file)
        argv_err(usage, 'i', "requires '-f'");

implementation

The code is slightly more than half the length of trad getopt(). I decided it’s simpler not to try to be re-entrant.

    static char **vec;
    static int arg, pos;

    char argv_opt(char *argv[]) {

When (re-)starting the loop we normally want to skip the program name, but not if that would make us overrun the end of argv.

        if(argv != NULL) {
            vec = argv;
            arg = vec[0] ? 1 : 0;
            pos = 0;
        }

The pos variable keeps track of where we were inside a cluster of options like -ab. Move to the next argument when we reach the end of the cluster.

        if(pos > 0 && vec[arg][pos + 1] == '\0') {
            arg++;
            pos = 0;
        }

When we reach the end of the options, pos is set to -1. An argument that doesn’t start with - ends the options. An argument that is just "-" often indicates stdin, and needs to be returned as a non-option argument.

        if(pos < 0 || vec[arg] == NULL ||
           vec[arg][0] != '-' || vec[arg][1] == '\0') {
            pos = -1;
            return('\0');
        }

An argument "--" indicates the end of the options and is skipped.

        if(vec[arg][1] == '-' && vec[arg][2] == '\0') {
            arg++;
            pos = -1;
            return('\0');
        }

The previous two clauses are basically the same as the first half-dozen lines of original getopt(). Most of the rest of the code in getopt() is scanning the options string. But we don’t have that so we can just bump to the next option.

This is where most of the code size reduction happens.

        return(vec[arg][++pos]);
    }

Option values are a little more flavoursome than I showed in the examples: there’s also a function for getting an optional value, for things like sed -i inplace editing. Optional values require a GNU extension for getopt().

    char *argv_optval(void) {
        char *val = &vec[arg][pos + 1];
        arg++;
        pos = 0;
        return(val);
    }

To get a mandatory option value, look for an optional value, or if it isn’t present, use the next argument (taking care not to overrun).

    char *argv_val(void) {
        char *val = (pos > 0) ? argv_optval() : "";
        return(val[0] ? val : vec[arg] ? vec[arg++] : NULL);
    }

There’s not much to the error handler:

    noreturn void
    argv_err(const char *usage, char opt, const char *err) {
        if(err)	fprintf(stderr, "option '-%c' %s\n", opt, err);
        fprintf(err ? stderr : stdout, "usage: %s\n", usage);
        exit(err ? 1 : 0);
    }

That’s all!

discussion

When I understood the idea of lexopt, I thought that getting rid of getopt()s semi-declarative option string might simplify things, but I was surprised by how much code could be deleted. There wasn’t much code to start with!

This experience reminds me of the end-to-end argument, which says,

functions placed at low levels of a system may be redundant or of little value when compared with the cost of providing them at that low level.

That is straightforwardly the thing that motivated this hack. And when I was refining argv_val() and argv_err() I repeatedly found that there was already logic in the caller that made features in the lower-level code redundant. Making the API less redundant meant there was less need for communication across the API boundary, so removing the redundant features caused cascading simplifications.

It’s interesting that most successors to getopt() have aimed to be more declarative, but what argv_opt() does is remove its most declarative feature. The name lexopt refers to parsing technology, treating the job of reading the command line as lexical analysis. The declarative approach adds a metalanguage that describes the command line language, and a metalanguage interpreter that runs the declarative description in order to interpret the command line. Two layers of little languages! That’s a lot of machinery for such a small task. No wonder removing the metalayer is simpler, because the lower layer isn’t (usually) complicated enough to justify so much machinery.

I didn’t look at the lexopt implementation until after I wrote argv_opt(). It’s 1000 lines of code! Plus 1000 lines of tests! It’s not 2x bigger, it’s 20x bigger! As well as long options, I’m pretty sure it handles things I haven’t even thought of yet. (Good Rust code is like that.) But I suppose lexopt is comparing itself to clap, which is notorious for turning simple command line utilities into monsters

Anyway, you can use the code in this article under 0BSD or MIT-0 licences, but I recommend you use something better instead – not C, or if you must C, standard getopt().