Some thoughts about git

I was originally planning to witter about distributed version control vs. centralized version control, especially the oft-neglected problem of breaking up a large cvs / svn / p4 repository. This was partly triggered by Linus's talk about git at Google in which he didn't really address a couple of questions about how to migrate a corporate source repository to distributed version control. But in the end I don't think I have any point other than the fairly well-known one that distributed version control systems work best when your systems are split into reasonably modestly-sized and self-contained modules, one per repository. Most systems are modular, even if all the modules are in one huge central repository, but the build and system integration parts can often get tightly coupled to the repository layout making it much harder to decentralize.

Instead I'm going to wave my hands a bit about the ways in which git has unusual approaches to distributed version control, and how bzr in particular seems to take diametrically opposing attitudes. I'm not saying one is objectively better than the other, because most of these issues are fairly philosophical and for practical purposes they are dominated by things like quality of implementation and documentation and support.

Bottom-up

Git's design is very bottom-up. Linus started by designing a repository structure that he thought would support his goals of performance, semantics, and features, and worked upwards from there. The upper levels, especially the user interface, were thought to be of secondary importance and something that could be worked on and improved further down the line. As a result it has a reputation for being very unfriendly to use, but that problem is pretty much gone now.

Other VCSs take a similar approach, for example hg is based on its revlog data structure, and darcs has its patch algebra. However bzr seems to be designed from the top down, starting with a user interface and a set of supported workflows, and viewing its repository format and performance characteristics as of secondary importance and something that can be improved further down the line. As a result it has a reputation for being very slow.

Amortization

Most VCSs have a fairly intricate repository format, and every operation that writes to the repository eagerly keeps it in the canonical efficient form. Git is unusual because its write operations add data to the repository in an unpacked form which makes writing cheaper but makes reading from the repository gradually less and less efficient - until you repack the repo in a separate heavy-weight operation to make reads faster again. (Git will do this automatically for you every so often.) The advantage of this is that the packed repository format isn't constrained by any need for incremental updates, so it can optimise for read performance at the expense of greater pack write complexity because this won't slow down common write operations. Bzr being the opposite of git seems to do a lot more up-front work when writing to its repository than other VCSs, e.g. to make annotation faster.

Thus git has two parallel repository formats, loose and packed. Other VCSs may have multiple repository formats, but only one at a time, and new formats are introduced to satisfy feature or performance requirements. Repository format changes are a pain and happily git's stabilized very early on - unlike bzr's.

Laziness

As well as being slack about how it writes to its repository, git is also slack about what it writes. There has been an inclination in recent VCSs towards richer kinds of changeset, with support for file copies and renames or even things like token renames in darcs. The bzr developers think this is vital. Git, on the other hand, doesn't bother storing that kind of information at all, and instead lazily calculates it when necessary. There are some good reasons for this, in particular that developers will often not bother to be explicit about rich change information, or the information might be lost when transmitting a patch, or the change might have come from a different VCS that doesn't encode the information. This implies that even VCSs that can represent renames still need to be able to infer them in some situations.

Git's data structure helps to make this efficient: it identifies files and directories by a hash of their contents, so if the hash is the same it doesn't need to look any closer to find differences because there aren't any - and this implies a copy or rename. This means that you should not rename or copy a file and modify it in the same commit, because that makes git's rename inference harder. Similarly if you rename a directory, don't modify any of its contents (including renames and permissions changes) in the same commit.

Mercurial also uses hashes to identify things, but they aren't pure content hashes: they include historical information, so they can't be used to identify files with the same contents but different histories. Thus efficiency forces hg to represent copies explicitly.

Any more?

I should say that I know very little about bzr, and nothing about tla, mtn, or bk, so if any of the above is off the mark or over-states git's weirdness, then please correct me in a comment!