I'VE GOT THE BYTE ON MY SIDE

57005 or alive

Git for Windows accidentally creates NTFS alternate data streams

Jul 20, 2016 git neat windows bug

As part of the small minority of devs at my company who primarily run Windows, I’m accustomed to working around occasional Unix-specific behaviors in our build and deployment systems. Cygwin makes most stuff just work, I can fix simple incompatibilities myself, and as a last resort I can always boot into OSX for a while if needed.

One oddity that took me quite some time to diagnose, though, was Git’s strange behavior when dealing with files in our repo whose names contained a colon.

What happens when you sync a file with a colon in the filename?

Besides the inital drive prefix (e.g. C:\), Windows does not permit the colon character in file or directory paths. Unix has no such restriction. So what happens if a Git repo of Unix origin contains a file with a colon in the name, and that repo is cloned on a Windows machine?

I’ve created a sample repo that contains a single file foo:bar with the content hello. Cloning the repo with a default installation of Git for Windows you get no errors or warnings:

C:\src
> git clone https://github.com/latkin/filetest.git
Cloning into 'filetest'...
remote: Counting objects: 3, done.
remote: Total 3 (delta 0), reused 0 (delta 0), pack-reused 3
Unpacking objects: 100% (3/3), done.
Checking connectivity... done.

Instead of a file named foo:bar, though, you get a file named foo, with nothing in it:

C:\src
> cd .\filetest\
C:\src\filetest
> dir -force

    Directory: C:\src\filetest

Mode                LastWriteTime         Length Name
----                -------------         ------ ----
d--h--        7/17/2016   5:53 PM                .git
-a----        7/17/2016   5:47 PM              0 foo

That’s kind of strange on its own, but even more peculiar is that Git has a different opinion of what things look like:

C:\src\filetest
> git status
On branch master
Your branch is up-to-date with 'origin/master'.
Untracked files:
  (use "git add <file>..." to include in what will be committed)

        foo

nothing added to commit but untracked files present (use "git add" to track)

Git notices the untracked file foo, but seems to think foo:bar is both present and contains the expected content. How strange…

Confusing matters further is that when you enable the Git config option core.fscache (which is enabled by default in version 2.8.2 and later), the working set suddenly changes - now foo:bar is reported as missing:

C:\src\filetest
> git config core.fscache true
C:\src\filetest
> git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
  (use "git add/rm <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        deleted:    foo:bar

Untracked files:
  (use "git add <file>..." to include in what will be committed)

        foo

no changes added to commit (use "git add" and/or "git commit -a")

That’s what we originally would have expected given that only the foo file was created when we cloned the repository, but why is it different from core.fscache = false? And why was this empty file foo created in the first place?

Alternate data streams

The root cause of all this is a relatively obscure NTFS feature called alternate data streams. Some good summary links here and here.

Briefly, files in NTFS are not simple buckets of data, but rather a collection of 1 or more data streams. What we normally think of as a file’s contents is really the contents of the primary, unnamed stream. One can also create and add data to alternate, named streams. These streams are directly addressable by appending :streamname to the normal file path. e.g. the stream MyStream in file qwerty.txt can be accessed via the path qwerty.txt:MyStream.

So although foo:bar is not a legal Windows file name, Windows file APIs are nonetheless happy to accept it for read and write operations because it is indeed legal as a path to something in the filesystem, namely the bar alternate stream of the file foo.

What Git does

Once you are aware of alternate data streams, Git’s behavior starts to make sense.

When cloning, Git naively blasts content into the path foo:bar. That is a 100% legal path, so no errors are raised by the OS. The result is a file foo with no content in the primary data stream (hence reported as length 0), but 6 bytes in an alternate stream bar:

C:\src\filetest
> Get-Item .\foo -Stream * | ft Stream,Length

Stream Length
------ ------
:$DATA      0
bar         6

C:\src\filetest
> cat .\foo:bar
hello

When checking the status of the working set, Git uses different algorithms depending on whether core.fscache is enabled.

When core.fscache is false, file metadata checks are done one at a time, ultimately invoking GetFileAttributesEx for each path. Git has no clue it’s even dealing with an alternate stream, because these file APIs behave exactly the same as they would with a normal file path. Does foo:bar exist? Yep! Does the last modified time on foo:bar match what Git expects? Yep! Is the content of foo:bar what Git expects? Yep! Well alright, that file must be unchanged.

When core.fscache is true, Git pre-caches file metadata per directory, then reads it from the cache instead of invoking file APIs directly. This leads to a different view of the world - when enumerating files in the containing directory, Windows only mentions foo, since that’s the only file present. Thus the cache, when asked for the metadata of foo:bar, believes this file does not exist.

Conclusion

In my opinion, this is all rather silly and should never have been allowed to happen in the first place. Git should simply detect the bogus filename, issue an error, and never even attempt to write the file to disk. This is how other valid-in-Unix-but-invalid-in-Windows filenames are handled already (e.g. a file named \Windows\System32\crypt32.dll will be blocked). Such files would then (correctly) be reported as missing from the working set, regardless of core.fscache setting.

I opened a bug against Git for Windows to track this issue, and provided a PR with a fix, but these have sat dormant with no feedback for the past 4 months. This week I’m making noise again on the PR, hopefully that will spur some action by the maintainers.