Lessons from PostgreSQL's Git transition

Did you know...?

LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

October 12, 2010

This article was contributed by Josh Berkus

The PostgreSQL Project finally switched from CVS to Git in September 2010, and did its first release based on the new Git repository on October 5. Making the switch happen took years and resulted in at least one near-disaster. Other projects that are contemplating, or working on, a transition in their version control system may find useful lessons in how PostgreSQL fared.

A History of CVS

Switching version control systems is a relatively straightforward process for a young project, but not for an old one. From 1996 through mid-September 2010, the PostgreSQL Project was developed using CVS. In fact, the date the CVS server went live — July 8, 1996 — is generally considered the "birthday" of the open-source project, which came out of the ten-year-old university project. Twenty-one major releases and 154 minor releases were committed, branched, and packaged on CVS. A large web and development infrastructure existed around CVS, as well as a multi-step, multi-role release procedure.

In 2004, as Subversion was beginning to become popular, PostgreSQL contributors first started to argue about switching away from CVS. However, Subversion was not seen as sufficiently mature at the time, or as offering enough advantages over CVS. This discussion, and occasional flamewars, continued to crop up on the main "pgsql-hackers" mailing list. Those who wanted to migrate off of CVS split into multiple camps based on the system they preferred, including Subversion, Arch, and Monotone. And, of course, a large group of developers saw no reason to change at all.

As with many projects which operate by rough consensus, where there is no consensus, there is no action. The community verdict was to wait for one version control system to become the clear leader. In retrospect, this turned out to have been a good decision.

First Git Mirror

In 2007, one of the git-cvsimport maintainers from New Zealand decided to set up a persistent and frequently updated Git mirror for the PostgreSQL CVS tree. This mirror was extremely popular and one Google Summer of Code student even did his project using the Git mirror instead of the CVS repository. An increasing number of PostgreSQL contributors started using Git mirrors.

In 2008 a few members of the PostgreSQL web team decided to set up an "official" git.postgresql.org, using FromCVS for conversion, and Gitweb with custom administration scripts for the web version. This was done without obtaining community consensus, and caused some controversy. But once the flames died out, many people started using the repository. However, the mirror became a source of frustration for the developers who depended on it. Synchronization with CVS was often undependable, with changes made in CVS failing to show up in Git.

While one or two Subversion mirrors were created well before the Git mirror, and there was even a Bazaar mirror on Launchpad, none of these were at all popular. By mid-2008, it was clear that a majority of PostgreSQL developers favored an eventual switch to Git.

Evaluation

pgCon is the annual international PostgreSQL conference in Ottawa, Canada. By pgCon 2009, PostgreSQL was one of only a handful of major projects still using CVS. It was time to switch to something, and a long discussion ensued at the developer meeting. With a version of Git available for Windows, a major hurdle had been cleared, but several issues remained to be solved, most of them having to do with project infrastructure.

First, the project has an automated regression testing infrastructure called the PostgreSQL Buildfarm. This network of donated servers and virtual machines does daily or hourly CVS checkouts, builds PostgreSQL, and runs regression tests. The Buildfarm would need to be updated to use Git, which was a challenge because not all of the operating systems represented in the build farm (such as UnixWare and AIX) had Git packages available.

The second challenge was the code review process. Using CVS, PostgreSQL contributors submitted their patches in context diff format to the pgsql-hackers mailing list and the CommitFest application. The committers did not want to change this process.

It was also unclear whether it would be possible to recreate past releases from Git.

Decision to Switch

By pgCon 2010, these issues had been resolved. The Buildfarm code had been patched, and the web team planned to set up a CVS mirror of Git for the build farm members who could not run Git. Developers had tested several back branches, and tweaked the conversion process until it was possible to produce a back-branch release which was identical to the one produced by CVS.

The Buildfarm developers, primarily Andrew Dunstan, worked on transitioning build farm members to using the Git mirror in advance of the switchover. However, the git.postgresql.org mirror proved too unreliable for this purpose, and Andrew had to set up another mirror on Github using some custom scripts. That took longer than the changes to the Buildfarm client code. But it worked, and the build farm servers started to convert to Git.

Accordingly, those at the developer meeting set a date: on August 18th, 2010, PostgreSQL would switch over. The Git repository would become the canonical code base and the CVS repository would become the mirror. This date was based on the assumption that 9.0 would be released by then, and August 18th would be after the first Alpha release for 9.1. Suitably, at pgCon Andrew did a talk on "Git for PostgreSQL developers".

Failure to Switch Over

By August 13th, things were looking somewhat different; 9.0 was not released yet. But everything else was scheduled, so the conversion went ahead. First, the developers froze the CVS tree. Next, Magnus Hagander employed the cvs2git tool to do a final conversion from CVS to Git. Then all of the PostgreSQL developers were asked to test the new repository. Things looked OK.

Then, the day before the Git repository was expected to be opened to new patches, committer Robert Haas noticed a problem:

The first few revs look OK, but [then] you get to this:

2010-02-28
        PostgreSQL...
        This commit was manufactured by cvs2svn to create branch REL8_3_STABLE

Prior to that commit, this history is nonsense - it appears to be the history of our 9.0 development prior to that date. I would say we're going back to good old CVS.

It seems that, where we had ported patches from later versions to earlier versions, cvs2git had manufactured inappropriate merge commits. Nobody had noticed because it didn't affect the head of any branch, just the history.

We reverted to CVS, and postponed switchover until after 9.0 was released.

The Second Conversion

Over the next few weeks, web team member Magnus Hagander worked with CVS2git developers Max Bowsher and Michael Haggerty to clean up issues with the conversion. In addition to the false merge commits, there were quite a number of other artificial commits and weird history in the converted version, such as branches which didn't exist and reappearing deleted files. A good portion of this was due to the fact that not only were we running an old version of CVS, but repository steward Marc Fournier and others had also done "CVS surgery" a number of times to fix problems over the years.

A second conversion test broke the ability to recreate any release. We discovered the new Debian server we were running it on did not default to ISO date format and had changed all the date strings for the releases. Fixed. A fair amount of "dirty history" in CVS simply could not be converted and needed to be patched in CVS before conversion. That was patched by Tom Lane.

Magnus, Max, Robert, and Tom stuck with it and resolved all of the issues which were considered roadblocks. cvs2git's next release will contain several improvements which are a result of our conversion. Several community members also added documentation to the PostgreSQL wiki on how we would use Git for development, including how to use git.postgresql.org, and how to commit.

On September 20, we released version 9.0 of PostgreSQL. Since Magnus was not heavily involved in the 9.0 release, he was able to spend his time preparing. So, on September 21st, we switched to Git.

Not Done Yet

However, the actual day of conversion was hardly the end. There was still a lot of minor cleanup to do. Contributors had to prune fictitious branches, move tags which appeared in the wrong place, and clean up many other issues.

Because the community wanted to preserve old releases exactly as they had been in CVS, we needed to preserve the file header tags in each file and remove them only from the "tip" of each branch. These are the comments which CVS automatically creates in each file which look like this:

    * $PostgreSQL: pgsql/src/tools/fsync/test_fsync.c,v 1.27 2010/02/26 02:01:39 momjian Exp $

Git doesn't use these tags since they don't work with atomic multi-file commits, so they needed to be removed from current development without removing them from the history.

Administrators also had to fix people's usernames, since contributors had freely used idiosyncratic nicknames and incomplete user information for git.postgresql.org before it was the canonical repository. These were changed to their main e-mail addresses with their real names. And, most of all, hackers who had waited until the conversion to learn Git were asking for command help on the mailing lists.

After the conversion, the project had to resolve policies and contents for .gitignore files, which tell Git what kinds of files to ignore. These are used to prevent developers from accidentally sending in patches containing editor backup files or build artifacts.

One issue which still isn't resolved is the CVS mirror for the Git repository. git-cvsserver turned out to have serious scalability limits, which makes it too limited to support the build farm servers. As a result, Andrew Dunstan required all of the build farm machine owners to migrate their nodes to Git immediately rather than gradually. However, there are still a few machines which are very valuable for testing and cannot run Git and, so far, there is no solution for those.

No Merge Commits

PostgreSQL had introduced a new workflow for reviewing patches in 2008, called "CommitFests" — a bi-monthly process where committers and reviewers clear out the pending patch queue. By 2010, the bugs had been worked out of the CommitFests and everyone thought they were working well. Among other things, the new workflow was helping train new contributors. So nobody wanted to change it. Also, many committers were okay with the migration only if they could still review patches the way they were used to.

The result was the adoption of a policy which will surprise veteran Git users. Per Magnus's blog:

... We still allow any developers (and committers) to use whatever parts of git they want as they develop, but for commits going into the main tree, we are making a number of restrictions ... We will not allow merge commits ... We will not use the author field in git to tag it with the patches original author ... we will require that author and committer are always set to the same thing, and we will then credit the author(s) (along with the reviewer(s)) in the commit message ...

Yes, that's correct. No merge commits. To submit a patch, extract it as a context diff and e-mail it. Committers are to apply the patch under their own names, without branch history. The project has decided, more-or-less, to use Git like it was CVS as far as commits to the main repository are concerned. Rather than adapt the PostgreSQL project's workflow to Git, Git would be adapted to the project's workflow.

It's possible, even likely, that eventually the PostgreSQL project will move towards the "normal Git workflow" where branches and merges would be used for feature work. But it definitely won't happen this year.

The First Release

The conversion wasn't really complete until the project did a release from it. That happened on Tuesday, October 5th with a combined security update. Unsurprisingly, there was a problem.

Security releases usually have fairly tight timing; from the moment someone commits the security patches, the issues fixed are publicly visible. Due to that, and limited familiarity with the Git commands, the source packager missed the final commit on the latest branch (9.0.1). That caused the release to be delayed by a day.

Git can't be blamed for everything, though. The release was delayed for another 2 hours because the Subversion repository which holds the www.postgresql.org web code locked up. But the release did go out.

Benefits of Git

Now that the migration is mostly done, many people are discovering the benefits of working on an up-to-date, distributed, version control tool.

Robert Haas has documented how to get commit summaries and sizes from Git. He wrote a perl script (which Tom Lane improved) that allows you to produce a changelog suitable for release notes from Git. Andrew Dunstan also found new ways to sync his local repository.

The pgAdmin project has also switched to Git and the pgWeb project will hopefully soon follow.

In the future, Git should both allow developers to work on longer-lived forks more easily and to test them more fully; we've already seen this with synchronous replication and SE-Postgres. Hopefully the translation teams will also be able to take advantage of forks on git.postgresql.org to collaborate on translating the docs and messages. Most importantly, Git should help prevent bit-rot in features which take months or years to develop.

Best of all, most PostgreSQL developers were able to continue hacking away without being involved in the switchover at all.

Lessons Learned

Assuming that there are any projects out there who have not yet switched to their distributed version control system of choice, here's a few things to learn from our migration:

Start with a Git mirror.
Designate a specific "Git migration team". Make sure they have lots of free time.
Your first attempt to migrate will probably fail, so you need to be prepared for more than one.
Changing your infrastructure, workflow, and build tool dependencies is harder than the repository conversion.
Make friends with the conversion tool authors.
Write lots of docs about the new tools and workflow.
The more history you have on your current system, the more work conversion is going to be.
Things which are broken in your current history are not going to fix themselves when you migrate.
When testing the conversion, make sure to look at more than HEAD and branch-tips.

The biggest lesson, though, is not to be in a hurry! It was over three years from PostgreSQL's first Git mirror to final conversion, and 16 months of actual preparation. If you take your time and are ready to retry things that don't work the first time, you should be able to have a successful migration to Git.

Index entries for this article
GuestArticles	Berkus, Josh

(Log in to post comments)

Merge Commits

Posted Oct 12, 2010 19:22 UTC (Tue) by daglwn (guest, #65432) [Link]

I'd like to know more about why merge commits caused problems with patch review. Banning merge commits seems like a huge step backwards. It throws away much of git's power. I always cringe when someone suggests taking an automated process and making it manual. There's more potential for human error.

Merge Commits

Posted Oct 12, 2010 19:40 UTC (Tue) by mingo (subscriber, #31122) [Link]

... We still allow any developers (and committers) to use whatever parts of git they want as they develop, but for commits going into the main tree, we are making a number of restrictions ... We will not allow merge commits ... We will not use the author field in git to tag it with the patches original author ... we will require that author and committer are always set to the same thing, and we will then credit the author(s) (along with the reviewer(s)) in the commit message ...

Ouch, indeed this looks like a broken Git workflow.

The 'cannot review patches' argument appears to be a bad excuse - trees with merge commits are just as easy to review as linear trees. (In fact often they are easier to review as they show the natural progress of a feature instead of some artificial after-the-fact representation of it. True history is also easier to debug and bisect, etc.)

From this list alone it appears to me that someone is trying to keep central control/power, and got surprised during the Git conversion that a distributed SCM works against that.

They wont enjoy the full power of Git unless they start handling their contributors as equals and allow them to become sub-maintainers - with merge commits, true history, etc.

Merge Commits

Posted Oct 12, 2010 19:48 UTC (Tue) by dmk (guest, #50141) [Link]

Give them time. They first have to de-cvs themselves...

Merge Commits

Posted Oct 12, 2010 19:52 UTC (Tue) by corbet (editor, #1) [Link]

That's my impression too, after having questioned this on their mailing list a couple months or so ago. I think it's mostly a matter of not wanting to change too many things at once. The tool change is now done; one assume that the workflow changes will come in their own time.

Merge Commits

Posted Oct 13, 2010 2:54 UTC (Wed) by njs (subscriber, #40338) [Link]

Heck, I *like* it when my RDBMS developers are super-conservative and risk-averse...

Merge Commits

Posted Oct 13, 2010 6:46 UTC (Wed) by mingo (subscriber, #31122) [Link]

I have no problems with RDBMS developer being conservative and progressing slowly and meticulously.

We should be careful to not base that kind of healthy conservatism on misunderstandings though - and the merge arguments seem to stem from misunderstandings of Git workflows.

In any case i'd like to congratulate the PostgreSQL project for making the difficult transition to Git - i don't think they will regret it! :-)

Merge Commits

Posted Oct 15, 2010 7:22 UTC (Fri) by dark (guest, #8483) [Link]

No, no, an essential part of conservatism is the assumption that new things are not yet fully understood. If they (as a project) have misunderstandings about Git workflows then that is an excellent reason to stick with their current workflows for now.

Merge Commits

Posted Oct 15, 2010 7:52 UTC (Fri) by mingo (subscriber, #31122) [Link]

No, no, an essential part of conservatism is the assumption that new things are not yet fully understood. If they (as a project) have misunderstandings about Git workflows then that is an excellent reason to stick with their current workflows for now.

Saying that "we are sticking with our existing workflow because we don't understand the Git workflow yet" is of course fine and is a valid approach, but that is not what they did: instead they explicitly claimed things about the Git workflow which is simply not true, and justified their steps with those (incorrect) assumptions.

Claiming/believing things that are not true is obviously not a productive element of 'conservativism'.

Merge Commits

Posted Oct 13, 2010 2:32 UTC (Wed) by yarikoptic (subscriber, #36795) [Link]

Yeap... and then some time they would discover

git diff branch1...branch2
and
git log branch1..branch2

and why those two of the "same kind" whenever

git diff branch1..branch2
git log branch1..branch2

are not quite brothers ;-)

Merge Commits

Posted Oct 12, 2010 20:30 UTC (Tue) by jberkus (guest, #55561) [Link]

As I said elsewhere in the article, the goal was to NOT change the current workflow for patch approval. Keep in mind that the PostgreSQL project has a smaller ecosystem than Linux; there's only 26 committers and around 100 active major contributors, so the centralization is not considered a problem.

More importantly, several committers did not even try Git until the migration happened. Before we could even consider changing workflows, all contributors will need to be comfortable with Git. That's at least 6 months off, which really means for the 9.2 development cycle *at the soonest*.

The changes to the Commitfests don't need to be dramatic; in a lot of ways, linking to a git snapshot would be much easier than the current e-mail-and-link-to-archive method. However, when you have people who have 14 years of experience reviewing context-diff patches for the project, they're not going to adjust quickly to another method. And there's no reason to make them adjust quickly, either.

Merge Commits

Posted Oct 12, 2010 20:50 UTC (Tue) by daglwn (guest, #65432) [Link]

And there's no reason to make them adjust quickly, either.

Oh, but there is. As others have pointed out, a lot of tools rely on various git conventions. It's easy enough to create a context diff from git. That seems like a separate issue from how the merge is actually done. I don't see any reason not to use git's merge power to make life so much easier.

Merge Commits

Posted Oct 13, 2010 11:55 UTC (Wed) by marcH (subscriber, #57642) [Link]

> As I said elsewhere in the article, the goal was to NOT change the current workflow for patch approval.

Considering all the problems you have been through this sounds more than reasonable. One thing at a time. Moreover time just works for you now.

Thanks for a great article.

Merge Commits

Posted Oct 13, 2010 11:59 UTC (Wed) by marcH (subscriber, #57642) [Link]

> trees with merge commits are just as easy to review as linear trees. (In fact often they are easier to review as they show the natural progress of a feature instead of some artificial after-the-fact representation of it.

I do not see how a linear tree with serialized features is more difficult to review. AFAIK the only information lost is the concurrent progress of different features... how is this harmful?

> True history is also easier to debug and bisect, etc.

Please explain that too. An example maybe?

Merge Commits

Posted Oct 13, 2010 12:25 UTC (Wed) by mingo (subscriber, #31122) [Link]

I do not see how a linear tree with serialized features is more difficult to review. AFAIK the only information lost is the concurrent progress of different features... how is this harmful?

I pull quite a few trees from sub-maintainers and I generally find non-rebased trees easier to review, for multiple reasons:

- The timeline is visible. Was the feature done on a single day? Done over several days, weeks or months? Which bit took the most time?

- Bugs are visible and give me the maintainer a way to see the natural stability (and the natural problem points of a feature) - helping me judge whether to merge something or not. If i see a tree that has been rebased on the day it got sent to me i lose this kind of info.

- Progression of the feature is more visible: it usually starts with a 'baby feature' commit, then goes down towards maturity.

- I can pull something that i know is old enough and is reasonably stable - looking at the timestamps. With a rebased tree you never really know. It might be fine - or not.

So as long as a tree is maintained in an orderly fashion (i.e. it does not have messy changelogs and messy bugs and messy merges, etc.) a tree with true history is much more valuable to maintainers (and future reviewers/bugfinders) than a rebased tree.

True history is also easier to debug and bisect, etc.
Please explain that too. An example maybe?

Trees with true history tend to follow development practices more closely, so they tend to be more fine-grained. Those kinds of trees are easier to bisect - even though it will also have 'broken' commits in them, with live bugs.

Also, another problem we saw with rebases in the Linux kernel were trees rebased to some new base - triggering new, not-thought-of-before interactions - or basing on a buggy new base.

On the other hand, real-history trees tend to be based on something stable that works fine for a group of developers for a longer period of time. That's a pretty good practical guarantee.

To put it in a different way: true-history trees tend to be done 'defensively', with every commit having a real meaning and having some real testing - because this is the tree that the developers worked on for a long time. (I don't pull from people who don't have clean maintenance practices)

A rebased/linearized tree tends to be done with the knowledge that the end-result tested out fine - and often the intermediate steps are not reliable or suffer bit-rot due to the rebase. We've had many problems with that in the Linux kernel. A rebase is a 'risks every single commit' kind of global operation, with associated global risks.

So, in theory you are right, a linear tree can be better than a true-history tree - simply because any problem of a true-history tree can be eliminated via a rebase.

In practice though, after years of experience with them in the Linux kernel context, they are markedly worse.

YMMV. With a small enough project you can use just about any workflow with Git and not feel any pain really. But if you want your project to grow (and i'm sure most of us want to see PostgreSQL grow) then sooner or later 'Git best practices' need to be considered.

The IMO two best Git workflows in existence are the Git project itself, and the Linux kernel.

Merge Commits

Posted Oct 12, 2010 20:58 UTC (Tue) by SEJeff (guest, #51588) [Link]

All of git.gnome.org bans merge commits and we have no problems.

http://git.gnome.org/browse/gitadmin-bin/tree/pre-receive...

Actually, I took those commit hooks and adopted them where I just set up an internal git repo. It makes the history look much cleaner.

Merge Commits

Posted Oct 12, 2010 23:06 UTC (Tue) by smurf (subscriber, #17840) [Link]

Wrong. That filter just blocks commits with auto-generated merge comments at their tip.

It does nothing to prevent one from "properly" merging a feature branch into mainline. That's a LOT less restrictive than PostgreSQL.

Merge Commits

Posted Oct 13, 2010 12:12 UTC (Wed) by SEJeff (guest, #51588) [Link]

Ah I was incorrect. Thats a very good point you make there. Using git without the amazing merge capabilities just seems wrong.

It seems like if they want to force git to behave like cvs, they should just use svn.

Lessons from PostgreSQL's Git transition

Posted Oct 12, 2010 19:41 UTC (Tue) by ernstp (guest, #13694) [Link]

"Yes, that's correct. No merge commits. To submit a patch, extract it as a context diff and e-mail it. "

Sounds like a perfect use case for Gerrit and a cherry-pick policy!

Lessons from PostgreSQL's Git transition

Posted Oct 12, 2010 20:31 UTC (Tue) by jberkus (guest, #55561) [Link]

Yeah, you can see a lot of discussion around cherry-pick on the mailing list archives. If anyone has advice on how to use these properly, it's welcome.

For the build farm...

Posted Oct 12, 2010 20:16 UTC (Tue) by ejr (subscriber, #51652) [Link]

A URL that retrieves a specific head as a tar.gz should suffice. No need for any client other than an HTTP fetcher. The git web interfaces I know already include the facility.

For the build farm...

Posted Oct 12, 2010 21:04 UTC (Tue) by gevaerts (subscriber, #21521) [Link]

That's a huge difference in bandwidth used

For the build farm...

Posted Oct 12, 2010 23:03 UTC (Tue) by ewen (subscriber, #4772) [Link]

The work around that comes to mind is rsync and a wrapper that logs into a system that can run git and updates to the right version before the rsync. rsync will take an argument (-e) to log into the remote system and spawn rsync in daemon mode, and it's possible to use this to invoke a wrapper that logs into the remote system, does some prep, then spawns rsync in daemon mode. (--rsync-path is another means of inserting such a wrapper.) It'll be slightly fiddly to work out the right details, but once it's done it should be look after itself (assuming enough space on the proxy system to pull the whole git tree and keep it up to date). (And of course rsync should be at least as bandwidth efficient as CVS.)

Ewen

For the build farm...

Posted Oct 15, 2010 17:03 UTC (Fri) by chad.netzer (subscriber, #4257) [Link]

For that to work well, remember to use the '--rsyncable' option when creating the .tar.gz file, or simply don't compress the tarfile at all.

Lessons from PostgreSQL's Git transition

Posted Oct 12, 2010 20:18 UTC (Tue) by malefic (guest, #37306) [Link]

> We will not use the author field in git to tag it with the patches original author ... we will require that author and committer are always set to the same thing, and we will then credit the author(s) (along with the reviewer(s)) in the commit message

What was the reasoning behind this? Why not retain the original author in the Author field? It's there for a reason. A number of repo analysis tools rely on the field to provide a proper picture contribution-wise.

Lessons from PostgreSQL's Git transition

Posted Oct 13, 2010 12:09 UTC (Wed) by petereisentraut (guest, #59453) [Link]

Because usually a patch has more than one author or contributor, so it's not clear how to map that without skewing the picture that you are in fact pretending to unskew.

Lessons from PostgreSQL's Git transition

Posted Oct 13, 2010 14:32 UTC (Wed) by malefic (guest, #37306) [Link]

I see. In that case it would probably make sense to standardize the way people are listed in the commit message, e.g. by making use of tags: "Reviewed-by:", "Authored-by:" or "Reported-by:" in case of bugs.

Lessons from PostgreSQL's Git transition

Posted Oct 13, 2010 15:44 UTC (Wed) by petereisentraut (guest, #59453) [Link]

Yeah, I think something like that will happen down the line.

Lessons from PostgreSQL's Git transition

Posted Oct 13, 2010 14:40 UTC (Wed) by sbohrer (guest, #61058) [Link]

Multiple contributors to a patch? Surely if there were multiple authors they each contributed small stand alone changes that could be sent as separate patches. Or is there a lot of pair programming going on?

Lessons from PostgreSQL's Git transition

Posted Oct 13, 2010 15:51 UTC (Wed) by petereisentraut (guest, #59453) [Link]

There is a lot of peer review happening before a patch is proposed for inclusion in the mainline. (Look for "commit fest" to learn about the details.)

Of course, you could separate those into chunks that can each be attributed to exactly one person. But why should you? I'd rather have one good patch by author A, based on an idea by B, prototype by C, reviewed by D, with additional documentation editing by E instead of a dozen crappy commits mixed with another dozen small cleanups "to serve the pure git workflow".

Lessons from PostgreSQL's Git transition

Posted Oct 13, 2010 19:45 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

Git allows you to do both. Just review the diff in its entirety, but commit it 'as-is'.

Lessons from PostgreSQL's Git transition

Posted Oct 13, 2010 19:55 UTC (Wed) by dlang (guest, #313) [Link]

the git best practices method would be to have the work on this feature done in a separate branch.

the initial commit would be by author C (the prototype), with a follow-up patch(s) by author A, with credit to author B in the commit message or a comment in the code, a reviewed by: tag crediting author D (possibly in the merge commit), and an additional patch for the documentation by author E.

this then gets merged in and everyone can see what was done on this, these aren't mixed in with other commits for unrelated things (until after it's in the mainline, at which point fixes are going to be separate patches anyway)

Lessons from PostgreSQL's Git transition

Posted Oct 13, 2010 22:42 UTC (Wed) by marcH (subscriber, #57642) [Link]

> ... instead of a dozen crappy commits mixed with another dozen small cleanups "to serve the pure git workflow".

Abilities to rewrite and clean up history is where git shines more than any other similar tool. Developers with some basic git experience never let crappy commits survive long enough to escape their local hard drives.

Lessons from PostgreSQL's Git transition

Posted Oct 14, 2010 1:15 UTC (Thu) by dmarti (subscriber, #11625) [Link]

...and some of the "crappy" stuff committed locally turns out to have actually worked, so you're more likely to get in the habit of committing small changes. (I don't like the Subversion model where "commit it to revision control" and "everyone else on the project can point at your code and laugh" are the same.)

Lessons from PostgreSQL's Git transition

Posted Oct 12, 2010 20:30 UTC (Tue) by fhuberts (subscriber, #64683) [Link]

Regarding 'no merge commits'...

We also don't allow them in our projects: everything that you want integrated into the main repository has to be rebased on the main repository branch on which you want it integrated.
This is to keep conflict resolution with the developer of the code, the one who actually know how to fix the conflict, rather than with the integrator.
It keeps integration cost very low and also keeps at linear history, which is again perfect for keeping features (consisting of multiple commits) together, which is again for for all kind of other things like bisecting and reverting.

Not so special this one thing :-)

Lessons from PostgreSQL's Git transition

Posted Oct 12, 2010 20:34 UTC (Tue) by daglwn (guest, #65432) [Link]

You're describing something different. The PostgreSQL folks aren't using the merge capability of git at all. They're doing manual diff+patch. This makes sense in a CVS environment where merging back to truck was incredibly, outrageously, ludicrously painful. Not so for git.

Lessons from PostgreSQL's Git transition

Posted Oct 13, 2010 12:12 UTC (Wed) by petereisentraut (guest, #59453) [Link]

The PostgreSQL folks may or may not be using the merge capability in their personal workflow. I certainly use it. We just require that what you apply to the master branch is rebased.

Lessons from PostgreSQL's Git transition

Posted Oct 12, 2010 20:41 UTC (Tue) by fhuberts (subscriber, #64683) [Link]

very nice read BTW.

Thanks for the writeup

Lessons from PostgreSQL's Git transition

Posted Oct 13, 2010 4:24 UTC (Wed) by pdmccormick (subscriber, #69601) [Link]

I'd like to second that. I appreciate LWN.net's unique technical focus. Many thanks for the fascinating read.

Lessons from PostgreSQL's Git transition

Posted Oct 13, 2010 22:57 UTC (Wed) by marcH (subscriber, #57642) [Link]

> This is to keep conflict resolution with the developer of the code, the one who actually know how to fix the conflict, rather than with the integrator. It keeps integration cost very low...

I do not see how it makes a difference, could you please detail? I assume the "integrator" is a guy pulling from developers' repositories, correct?

> ... and also keeps at linear history, which is again perfect for keeping features (consisting of multiple commits) together, which is again for for all kind of other things like bisecting and reverting.

See above the interesting post from Ingo explaining... the exact opposite.

Lessons from PostgreSQL's Git transition

Posted Oct 14, 2010 5:23 UTC (Thu) by fhuberts (subscriber, #64683) [Link]

Yes the integrator is responsible for pushing to the central ('release') repo. We have only a few integrators per repo (we have a corporate setup).

I also agree with most of what Ingo says, but AFAIC there's not much difference. The thing is though that we have just switched over to Git and the devs have to get used to many things. Many devs are not of Linux kernel quality and do not develop their features in an ordely and structured fashion. Rebasing gives them the tool to clean up their mess before submitting it for review and integration.
Note that we still require small commits that build towards the feature, etc etc.

Once everybody is properly accustomed to Git and what it can do we can start to think about other policies :-)

Hope that clarifies.

Lessons from PostgreSQL's Git transition

Posted Oct 12, 2010 21:23 UTC (Tue) by ballombe (subscriber, #9523) [Link]

> The community verdict was to wait for one version control system to become the clear leader. In retrospect, this turned out to have been a good decision.

This article rather shows the opposite. If they had switched to SVN in 2007, they would have been in a far better position to switch to git now, both technically and socially.

Lessons from PostgreSQL's Git transition

Posted Oct 12, 2010 21:43 UTC (Tue) by knan (subscriber, #3940) [Link]

And you base that opinion on what, exactly?

Lessons from PostgreSQL's Git transition

Posted Oct 12, 2010 21:45 UTC (Tue) by jberkus (guest, #55561) [Link]

On the contrary:

a) If we had migrated to SVN in 2007, we'd probably wouldn't migrate to git or anything else until 2012, or later.

b) We'd as likely have gone to Monotone or Arch. And then we'd be in trouble now ... we'd end up supporting that project just so we could continue to use it.

c) I don't see any way in which we'd be "in a far better position to switch to git now, both technically and socially." SVN is just CVS++.

Lessons from PostgreSQL's Git transition

Posted Oct 12, 2010 23:08 UTC (Tue) by brouhaha (subscriber, #1698) [Link]

Subversion has a much better designed repository ("filesystem") that doesn't tend to accumulate a bunch of bizarre cruft the way CVS does. Exporting a repository from Subversion and importing it into another source control system is *much* easier than doing it from CVS.

That said, I don't think that using Subversion as an intermediate step for a few years would have helped that much.

It does sound, though, like you're trying to force Git to act like a centralized VCS. If that's really what you want, I don't think Git was the right choice.

Lessons from PostgreSQL's Git transition

Posted Oct 12, 2010 23:16 UTC (Tue) by daglwn (guest, #65432) [Link]

Git is a fine choice for a centralized VCS. One can easily use it with the CVS/SVN model. The mistake is throwing away use of git's automated merging tools and doing it by hand. That's no better than CVS.

Lessons from PostgreSQL's Git transition

Posted Oct 13, 2010 2:42 UTC (Wed) by yarikoptic (subscriber, #36795) [Link]

just my blunt 0.1 cents: SVN imho rests on an analogy which, at first sight, is user-friendly; but it impairs mind clarity to absorb the main notion of DVCS such as git and hg later on -- it is 'branch is just a directory, you can copy it'. CVS is closer to the notion of GIT/HG branches since there they still live in a "hyperspace" and are appreciated as such.

on the other hand, git-svn is such a handy tool with its bidirectional flow, that it makes "transition" much easier, whenever people could develop entirely in git and commit back to SVN.

Altogether, I think that direct jump CVS -> GIT was the right route; it could be only better if done earlier ;)

Lessons from PostgreSQL's Git transition

Posted Oct 13, 2010 11:50 UTC (Wed) by marcH (subscriber, #57642) [Link]

Subversion tagging and branching (or rather: the lack of) is indeed a regression compared to CVS:

<http://en.wikipedia.org/wiki/Apache_Subversion#Subversion...>

With Subversion you basically end up having to manage tags manually in some external document. For fun search the Subversion mailing lists: you will find numerous people explaining how convenient is the lack of tags!

Lessons from PostgreSQL's Git transition

Posted Oct 13, 2010 9:34 UTC (Wed) by cmot (guest, #53097) [Link]

| That said, I don't think that using Subversion as an intermediate step for a
| few years would have helped that much.

Remember that a big chunk of the conversion was in the support infrastructure. Having to revamp the buildfarm etc. infrastructure twice wouldn't have helped.

(I don't know the situation in the pg universe, but in Debian, quite a few administrators of support infrastructure are not intimately involved in development, so they may not see the ideas behind a change that means quite a lot of work to them...)

Lessons from PostgreSQL's Git transition

Posted Oct 21, 2010 14:15 UTC (Thu) by jnareb (subscriber, #46500) [Link]

> Subversion has a much better designed repository ("filesystem") that
> doesn't tend to accumulate a bunch of bizarre cruft the way CVS does.
> Exporting a repository from Subversion and importing it into another source
> control system is *much* easier than doing it from CVS.

Unfortunately because of "branches are folders" idea, with non-enforced
convention for being in repository hierarchy, makes it easy to screw up
repository in bizzare ways... differently than CVS, but as badly.

Those mishandled SVN repositories are PITA to import to Git.

Lessons from PostgreSQL's Git transition

Posted Oct 13, 2010 11:54 UTC (Wed) by lmb (subscriber, #39048) [Link]

So, Postgres switched, but it took them longer than the slightly larger Linux kernel project. I'm not sure that is something to be proud of.

Beyond CVS conversion oddities (sure, those suck), their install seems to do without the most powerful features of git - no merges, no author/submitter history, etc. In fact, it appears they are using it as a centralized and linearized CVS replacement, for which they might as well have used SVN (except for storing full local history).

In particular the "no merges" bit is really weird. For review, reviewing the history of a feature/bugfix in an offered pull tree would appear to be much better. I could understand an "I'll not merge if there are conflicts to resolve" policy, but that?

Yes, of course, it takes a while for people to adjust, but I think this is self-inflicted to a degree; a number of the problems (short of the conversion madness, but that's a one-time effort) seem to stem from the desire to not fully adopt git. (Which explains why specialized documentation for the project is needed, etc.)

Lessons from PostgreSQL's Git transition

Posted Oct 13, 2010 12:03 UTC (Wed) by marcH (subscriber, #57642) [Link]

> So, Postgres switched, but it took them longer than the slightly larger Linux kernel project. I'm not sure that is something to be proud of.

I do not recall Linux ever using CVS.

Lessons from PostgreSQL's Git transition

Posted Oct 13, 2010 12:32 UTC (Wed) by mpr22 (subscriber, #60784) [Link]

You are correct. Linus went so far as to state - on the basis of his experience using CVS at Transmeta - that passing around tarballs+patches was actually superior to CVS.

Lessons from PostgreSQL's Git transition

Posted Oct 13, 2010 13:53 UTC (Wed) by marcH (subscriber, #57642) [Link]

Exaggeration to get a point across? No, not that type of guy...

Lessons from PostgreSQL's Git transition

Posted Oct 13, 2010 14:07 UTC (Wed) by rahulsundaram (subscriber, #21946) [Link]

Well, he stood by what he said and never used CVS.

Lessons from PostgreSQL's Git transition

Posted Oct 13, 2010 12:45 UTC (Wed) by njd27 (subscriber, #5770) [Link]

When Linux was imported into git all history was thrown away so it's not a comparable process.

Lessons from PostgreSQL's Git transition

Posted Oct 13, 2010 14:36 UTC (Wed) by lmb (subscriber, #39048) [Link]

Point taken, though that too is a (possibly) viable process.

But from what I read here, the main point of contention was not the cvs2git conversion (which always is trouble, since CVS just doesn't have all the data one would like - if it did, it wouldn't suck so much), but the processes etc surrounding it afterward.

Switching to a distributed VCS, and then rejecting one or two of the main aspects - distributed development and merge capabilities - to restrict it to CVS-like work flows seems somewhat weird, and I'm not surprised that such an attitude carries with it a lot of self-fulfilling prophecies.

Lessons from PostgreSQL's Git transition

Posted Oct 13, 2010 15:00 UTC (Wed) by alvherre (subscriber, #18730) [Link]

The processes were difficult, but the actual CVS-to-git conversion was pretty problematic too, and required quite a bit of tweaking of the CVS repository (see this mailing list message for the details) and help from the cvs2svn developers.

We've already switched. We're not going back. As people said above repeatedly, changes in workflow will come in time.

Lessons from PostgreSQL's Git transition

Posted Oct 13, 2010 18:41 UTC (Wed) by iabervon (subscriber, #722) [Link]

If you look more closely at what they're doing, they've actually been doing distributed development for a while; they just do it without VCS support, and admit changes to the project state in a centralized fashion. Leaving aside how the published project state gets updated, they've been emailing context diffs to the mailing list and revising them. It's a small step here to use git to keep track of what you're doing with a patch set (especially with the base state published in git), and to use git to communicate changes to reviewers. And then you find that you've got your unaccepted changes in a form that your build farm is (technically) capable of testing, and maybe you could switch to regression testing commits before updating the public repository instead of after. And then maybe you start having the reviewers see what version the author has based the patch on, and think about whether any of the features accepted since matter to that, instead of thinking that either "patch" or people will find conflicts in functionality.

In any case, they're very much not coming from a centralized development background, where people don't communicate code to each other except through the central repository.

Lessons from PostgreSQL's Git transition

Posted Oct 20, 2010 7:24 UTC (Wed) by augustz (guest, #37348) [Link]

There are a significant number of weaknesses in cvs, especially for software like Postgresql that would benefit from reasonable version control, beginning with atomic commits and I won't bore you with other CVS oddities.

So git, WITHOUT changing from a CVS style workflow, is going to be much nicer for the project to use as their authoritative vc. So switching to git while keeping CVS workflow is not weird at all. It will improve results for the project.

Not adopting all gitisms at once is also not weird. Any larger organization moves slowly, rightly I think personally. They have an approach that works well. Let's wait a year for things to shake out before worrying about going with all the gitisms.

Credit also goes to the project for respecting the consensus that developed which was to switch to git without abandoning work styles. If you read the discussions, this was important to a number of committers and I think a good thing to respect. The value of the commiter I think outweighs adopting a specific ism of a specific tool.

Thanks for a good write-up Josh!

Lessons from PostgreSQL's Git transition

Posted Oct 13, 2010 19:19 UTC (Wed) by rgmoore (✭ supporter ✭, #75) [Link]

So, Postgres switched, but it took them longer than the slightly larger Linux kernel project. I'm not sure that is something to be proud of.

That seems unreasonable. Git was custom built to complement the way the kernel developers wanted to work, not the way the Postgres project wants to work. That has a huge impact on how easy the transition is.

How the Postgres transition issues could have been avoided

Posted Oct 13, 2010 14:45 UTC (Wed) by daniel (guest, #3181) [Link]

1. Go back in time
2. Invent Mercurial
3. Avoid CVS
4. Relax

How the Postgres transition issues could have been avoided

Posted Oct 13, 2010 14:56 UTC (Wed) by flewellyn (subscriber, #5047) [Link]

Step 1 left as an exercise to the reader, I take it?

How the Postgres transition issues could have been avoided

Posted Oct 13, 2010 15:37 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

That's what happens when your transactions are not properly serialized...

How the Postgres transition issues could have been avoided

Posted Oct 14, 2010 1:05 UTC (Thu) by emk (subscriber, #1128) [Link]

*snort* OK, that's the first comment I've laughed at in a while. Thank you.