Everyone quotes command line arguments the wrong way

Background

At one time or another, we all need to pass arbitrary command line arguments to some program. These arguments could be filenames, debugging flags, hostnames, or any other kind of information: the point is that we are to take a string and make sure some child program receives exactly that string in its argv array no matter what this string contains. The task is harder than it appears.

For better or for worse1, Windows knows about only one command line string for each process. Because one string is not terribly useful, libraries conspire to provide the illusion of multiple command line arguments: before creating a subprocess, a program combines all argument strings into one command line string, and the newly-born subprocess, before calling main, splits this string into arguments and passes the arguments as argv. In principle, each program can parse the command line string differently, but most use the convetion that CommandLineToArgvW and the Microsoft C library understand. This convention is a good one because it provides a way to encode any command line argument as part of a command line string without losing information.

The problem is that there is no ArgvToCommandLineW. How do we construct an argument string understood by CommandLineToArgvW?

Test program

For exposition's sake, we'll be using this small program to generate the example output below:

 
#include <stdio.h>
int __cdecl
wmain(int, wchar_t** argv)
{
    for (int arg = 0; argv[arg]; ++arg) {
        wprintf (L"%d: [%s]\n", arg, argv[arg]);
    }
}

Popular solutions: clear, simple, and wrong

The C runtime library is useless

Our first instinct should be look for a library function that's already solved the problem. Were we to conduct this search, we would quickly find functions provided by the C runtime specifically for running subprocesses: the _exec and _spawn families. These functions appear to be precisely what we need: they take an arbitrary number of distinct command line arguments and promise to launch a subprocess. Unfortunately and counter-intuitively, these functions do not quote or process these arguments: instead, they're all concatenated into a single string, with arguments separated spaces, and this string is then passed to the child, where it's eventually interpreted by CommandLineToArgvW. This approach works only for arguments that do not themselves contain spaces.

Thus, if we run child.exe like this:

 _spawnlp (_P_WAIT, L"child.exe", L"child.exe", L"argument 1", L"argument 2", NULL)

child.exe receives five command line parameters:

 
0: [child.exe]
1: [argument]
2: [1]
3: [argument]
4: [2]

That's not what we want! So, while the C runtime process-launching functions appear to be what we need, they're actually useless for solving the command line argument problem.

Adding quotation marks is insufficient

Having seen the problems with the previous approach, one might suggest that we heed the advice provided in the the C runtime documentation and surround arguments containing spaces with double-quote characters. This solution is also wrong.

Recall2 that the CommandLineToArgV convention3 stipulates that arguments containing spaces be surrounded by double quotation marks. Following the above approach and surrounding arguments with quotation marks produces good results for simple cases, and many people stop here.

 
child.exe argument1 "argument 2"  "\some\path with\spaces"

Is correctly interpreted as:

 
0: [child.exe]
1: [argument1]
2: [argument 2]
3: [\some\path with\spaces]

So far, so good: but what if our arguments are more complex? Bear in mind that our convention also stipulates that we precede a double quotation mark that is part of an argument with a backslash, and that we precede with another backslash a backslash that precedes a quotation mark which itself actually terminates the argument and is not included as part of that argument4. So, if we follow our simplistic approach above and use this as our command line:

 
child.exe argument1 "she said, "you had me at hello""  "\some\path with\spaces"

then the child sees:

 
0: [child.exe]
1: [argument1]
2: [she said, you]
3: [had]
4: [me]
5: [at]
6: [hello]
7: [\some\path with\spaces]

We don't want that either! The problem becomes more insidious when quotes are unbalanced:

 
child.exe argument1 "argument"2" argument3 argument4
0: [child.exe]
1: [argument1]
2: [argument2 argument3 argument4]

Arguments ending with backslashes also lead to undesired interpretations:

 
child.exe "\some\directory with\spaces\" argument2

0: [child.exe]
1: [\some\directory with\spaces" argument2]

Many popular programs (including command shells, the authors of which really should know better) use this simple approach. Developers test only with simple argument strings, leaving users confused and puzzled when their command lines are occasionally mangled.

The correct solution

We've seen that properly quoting an arbitrary command line argument is non-trivial, and that doing it incorrectly causes subtle and maddening problems. The function below properly quotes an argument; translate it into your language and coding style of choice.

 
void
ArgvQuote (
    const std::wstring& Argument,
    std::wstring& CommandLine,
    bool Force
    )
    
/*++
    
Routine Description:
    
    This routine appends the given argument to a command line such
    that CommandLineToArgvW will return the argument string unchanged.
    Arguments in a command line should be separated by spaces; this
    function does not add these spaces.
    
Arguments:
    
    Argument - Supplies the argument to encode.

    CommandLine - Supplies the command line to which we append the encoded argument string.

    Force - Supplies an indication of whether we should quote
            the argument even if it does not contain any characters that would
            ordinarily require quoting.
    
Return Value:
    
    None.
    
Environment:
    
    Arbitrary.
    
--*/
    
{
    //
    // Unless we're told otherwise, don't quote unless we actually
    // need to do so --- hopefully avoid problems if programs won't
    // parse quotes properly
    //
    
    if (Force == false &&
        Argument.empty () == false &&
        Argument.find_first_of (L" \t\n\v\"") == Argument.npos)
    {
        CommandLine.append (Argument);
    }
    else {
        CommandLine.push_back (L'"');
        
        for (auto It = Argument.begin () ; ; ++It) {
            unsigned NumberBackslashes = 0;
        
            while (It != Argument.end () && *It == L'\\') {
                ++It;
                ++NumberBackslashes;
            }
        
            if (It == Argument.end ()) {
                
                //
                // Escape all backslashes, but let the terminating
                // double quotation mark we add below be interpreted
                // as a metacharacter.
                //
                
                CommandLine.append (NumberBackslashes * 2, L'\\');
                break;
            }
            else if (*It == L'"') {

                //
                // Escape all backslashes and the following
                // double quotation mark.
                //
                
                CommandLine.append (NumberBackslashes * 2 + 1, L'\\');
                CommandLine.push_back (*It);
            }
            else {
                
                //
                // Backslashes aren't special here.
                //
                
                CommandLine.append (NumberBackslashes, L'\\');
                CommandLine.push_back (*It);
            }
        }
    
        CommandLine.push_back (L'"');
    }
}

To construct a command line string for a program from arbitrary arguments, we encode each argument (including the program name) with the above function and follow all but the last with a single space. We can then pass the resulting string as the lpCommandLine parameter to CreateProcess and be confident that the child process will decode each argument to exactly the arguments we were initially given.

cmd.exe

One might conclude that we're done: after all, we now understand how to send arbitrary strings through CreateProcess so that they emerge unchanged on the other side --- but life is not that simple. Often, we can't directly supply our command line CreateProcess, but instead must pass it through a level of indirection before reaching our intended child process. The most common indirection is a trip trough the venerable cmd.exe, which we encounter when we use the system function, construct a script for later execution, write a makefile for nmake, and do many other things. Because the quoting rules for CommandLineToArgvWcmd's differ from those of cmd, we cannot give a command line intended for the former to the latter and and expect our arguments to survive the trip. Because this difference is subtle, it's easy to forget about it and leave latent bugs in programs that interact with cmd.

cmd is essentially a text preprocessor: given a command line, it makes a series of textual transformations, then hands the transformed command line to CreateProcess. Some transformations replace environment variable names with their values. Some transformations, such as IO redirection, have useful side effects. Some transformations, such as those triggered by the &, ||, && operators, split command lines into several parts. It's important to note that cmd doesn't know or care about command line arguments per se: to cmd, the world is made up of whole command lines. Like the post office delivering postcards, cmd doesn't try to understand what is handles: it merely does its limited job (via these transformations), then leaves it another program to figure out what the final command line means.

All of cmd's transformations are triggered by the presence of one of the metacharacters (, ), %, !, ^, ", <, >, &, and |. " is particularly interesting: when cmd is transforming a command line and sees a ", it copies a " to the new command line, then begins copying characters from the old command line to the new one without seeing whether any of these characters is a metacharacter. This copying continues until cmd either reaches the end of the command line, runs into a variable substitution, or sees another ". In the last case, cmd copies a " to the new command line and resumes normal processing. This behavior is almost, but not quite like what CommandLineFromArgvW does with the same character4; the difference is that cmd does not know about the \" sequence and begins interpreting metacharacters earlier than we would expect. It should be apparent why the commands below produce the indicated output:

 
C:\> child "hello world" >\\.\nul

C:\> child "hello"world" >\\.\nul
0: [child]
1: [helloworld >\\.\nul]

C:\> child "hello\"world" >\\.\nul
0: [child]
1: [hello"world]
2: [>\\.\nul]

C:\>

If we relying on cmd's "-behavior to protect arguments, quotation marks will produce unexpected behavior. If we pass untrusted data as command line parameters, then the bugs caused by this convention mismatch become a security issues:

 
C:\> child "malicious argument\" &whoami"
0: [child]
1: [malicious-argument"]
ntdev\dancol

Here, cmd is interpreting the & metacharacter as a command separator because, from its point of view, the & character lies outside the quoted region. whoami, of course, can be replaced by any number of harmful commands. Note that this command is properly formatted for use with CreateProcess: it's the passing through cmd that causes trouble.

A better method of quoting

While the " metacharacter cannot fully protect metacharacters in our command lines against unintended shell interpretation, the ^ metacharacter can. When cmd transforms a command line and sees a ^, it ignores the ^ character itself and copies the next character to the new command line literally, metacharacter or not. That's why ^ works as the line continuation character: it tells cmd to copy a subsequent newline as itself instead of regarding that newline as a command terminator. If we prefix with ^ every metacharacter in an argument string, cmd will transform that string into the one we mean to use.

 
C:\> child ^"malicious argument\^"^&whoami^"
0: [child]
1: [malicious argument"&whoami]

It's important to also ^-escape " characters: otherwise, cmd would literally copy ^ characters appearing between " pairs, mangling the argument string in the process:

 
C:\> child "malicious-argument\^"^&whoami"
0: [child]
1: [malicious-argument\^&whoami]

In effect, because cmd's " handling is useless for our purposes, we use ^ to tell cmd to not attempt to be smart about quote detection.

Conclusion

In general, we can safely pass arbitrary command line arguments to programs, provided we take a few basic precautions.

Do:

  1. Always escape all arguments so that they will be decoded properly by CommandLineToArgvW, perhaps using my ArgvQuote function above.
  2. After step 1, then if and only if the command line produced will be interpreted by cmd, prefix each shell metacharacter (or each character) with a ^ character.

Do not:

  1. Simply add quotes around command line argument arguments without any further processing.
  2. Allow cmd to ever see an unescaped " character.

Notes

1 Worse.
2 You did follow my links above, yes?
3 I know you didn't.
4 Just to be clear: CommandLineFromArgvW neither knows nor cares about cmd's metacharacters and looks only for " and \.