Revision control

Published on 2007-11-17. Modified on 2010-10-10.

In connection with some corrupted files in CVS I decided that it was time to take a look at some other revision control systems. This article is a summary of my three day journey into the land of revision controlling.

Revision control

A revision control system, also known as version control system (VCS), source control or source code management (SCM), is a set of tools that allows you to keep control with the development of files. It is the management of multiple revisions of the same unit of information. Such files could be software source code, or it could be articles, or even images.

The best way to understand a revision control system is to think of a log or a diary. Let's say you are developing software and your want to keep track of the changes you make over time to the files, each time you change something, or each time you add or remove a piece of code, you make a note in the diary about the changes. You write about why you made the changes etc.

Let's imagine that your project grew and began to involve more people. To still keep track of changes you make an online version of the diary, and make everybody who is working on the project, commit changes to your files, and you make them write about it in the diary.

A revision control system is like a log or dairy, but it's more that than, a revision control system takes charge of the files in a project and the system can merge different changes made by different people.

The biggest problem when a lot of people are working on the same files, whether it's software or text in a book, is to keep the different changes that people make from getting messed up. If four people are working on the same file, how do you keep that file intact? And how do you fit the four pieces they each create together? You do that by the help of a revision control system.

A revision control system can help you keep a track of things, and it can help you merge different commits into a single file, but the system can't decide on your behalf, so often times you have to figure out how to deal with what's called a "merging conflict".

A revision control system can also revert changes and it can almost work like a nice backup system.

My story

I have always loved programming and I have developed a lot of different application, both professionally and as a hobby. In both cases I used to keep my files backed up on CDs (yes, this was written a long time ago), and I used to keep small notes about important stuff. After doing that for many years I suddenly stumbled upon the revision control system called CVS. I learned about CVS via OpenBSD and after setting the CVS system up, I never looked back. With CVS I could easily keep a log of every single detail of change I made to a file and at the same time CVS worked as a backup system.

Another great advantage was that I could keep all my important files on a single server and then pull those files down unto both my laptop and my desktop. If I made a change to a file using my laptop, and I commited that change to CVS, I could run an update on my desktop and it would grab those changes from the CVS server and voilà everything was in sync.

After running with CVS for about three years I one day needed to pull down a set of files that I had added into the CVS at the beginning of its usage. The files were compressed using Tar and GZip. Somehow I had missed the information that CVS needs to be told not to mess with binary files. The result of my lack of knowledge resulted in corrupted files. After the incident I decided to take a deeper look into CVS, and I decided to also take a look at other solutions.

I have always been happy with CVS, but that was because I didn't know about other systems at that time. One thing that did annoy me was CVS's lack of support to rename files. If you need to rename a file that has been commited to CVS, you have to make a copy of that file, delete the original and then commit the renamed version as a new file hence loosing all the commit history related to that file. CVS's handling of moving files around is also a big mess and in most cases the easiest way to deal with that is to manually remove files on the server and then recommit from the client.

Because CVS is a centralized system it keeps small pieces of information in each directory of the system. When you develop a lot of subdirectories CVS keeps its own directory in each of those as well. Sometimes you need to move a lot of stuff around, but you don't want to move the CVS directories as well.

All in all I have always been happy with CVS, but I have also been annoyed with the above mentioned problems.

I finally decided to take a good and hard look at other solutions and this article is about my experiences with those systems, the tests I made, and the final decision about which system I would end up using and why.

One thing is for sure, after running the tests, I will never return to CVS.

Different revision control systems

My main concern about a new revision control system was that it had to be open source. Besides from that I was pretty open to new ways of doing things.

First thing first. I quickly discovered that there exists different kinds of ways to run a revision control system.

A centralized system uses a centralized model where all the revision control functions are performed on a shared server. If two developers try to change the same file at the same time, without some method of managing access, the developers may end up overwriting each others work. Centralized revision control systems solve this problem in one of 2 different source management models, file locking and version merging.

A decentralized system and a distributed system are more or less the same, they both take a peer-to-peer approach, as opposed to the client-server approach of a centralized system. Rather than a single central repository on which clients synchronize each peer's working copy of the code, base synchronization is conducted by exchanging patches (change-sets) and code pull requests from peer to peer.

A centralized system requires that one is able to connect to the server whenever one wants to do version control work. This can be a problem if your server is on some other machine on the Internet and you're not. Or, even worse, you are on the Internet but the server is not operational. A decentralized revision control systems deals with these problems by keeping branches on the same machine as the client. This allows the user to save his changes (commits) whenever he wants - even if he is offline. The user only needs Internet access when he wants to access the changes in someone else's branch that are located somewhere else.

On a decentralized system each developer works on his own local repository. Repositories are cloned by anyone and are often cloned many times. There may be many "central" repositories. Access control lists are not employed. Instead code from disparate repositories are merged based on a web of trust, i.e., historical merit or quality of changes. Lieutenants are project members who have the power to dynamically decide which branches to merge. Network is not involved in most operations.

Advantages of a the decentralized model

Disadvantages of the decentralized model

There really are none.

The "disadvantages" that some people describe are from a personal habitual point of view rather than from a technical and practical point of view. Most people are used to a centralized system and out of habit they don't want things to change.

NOTE: This was originally written about a year before GitHub was launched.

One argument is that a distributed system can end up with a person as the central point of control, rather than a server, but in my opinion that would only be because of a lack of proper organization.

Even for my own personal use of a revision control system I find a decentralized system a joy to work with. The ability to make local commits even if you are offline is great, and I often find myself in use of that option.

The tests

I must start by pointing out that the tests I made in no way can compare to real day to day work done by many of the big open source projects like the Linux kernel or the FreeBSD project. I am only a single person and my tests reflects my daily usage of a revision control system. I develop small pieces of code and I write articles and books, but I don't have to cooperate with a lot of people doing that.

The main advantages that I currently gain from using a revision control system is having extra copies of files, documentation of the commits (the log), and the ability to revert back to prior releases of software that I have made, and last but not least, synchronization between files on different computers without having to copy data back and forth.

First I made a list of systems to test, next I divided the different system up into groups. All non open source systems where removed from the list. Next I removed those systems that aren't in active developments. Then I looked at each system documentation and online community. I also made sure that the system could run over OpenSSH. Looking at the documentation I then decided which system looked the most user friendly regarding the commands. Last but not least I asked a some questions on the different IRC channels related to each system.

I ended up with the following systems to test:

Subversion is the only centralized system in my test.

On each system I did the following tests, but not limited to:

BZR

Bazaar (formerly Bazaar-NG) is a distributed revision control system sponsored by Canonical Ltd., designed to make it easier for anyone to contribute to open source software projects. As of 2007 the best known user of Bazaar is the Ubuntu project.

The development team's focus is on ease of use, accuracy and flexibility. Branching and merging upstream code is designed to be very easy, with focus on users being productive with just a few commands. Bazaar can be used by a single developer working on multiple branches of local content, or by teams collaborating across a network.

Bazaar is written in the Python programming language, with packages for major Linux distributions, Mac OS X and Windows. Released under the GNU General Public License, Bazaar is free software.

Installing BZR and getting it up and running was very easy:

# apt-get install bzr

Then setup a repository:

$ bzr whoami "John Doe"
$ mkdir /my_project
$ cd /my_project
$ bzr init
$ touch my_file.txt
$ bzr add
$ bzr commit -m "Initial commit."

Checkout of a project via ssh:

$ bzr checkout bzr+ssh://example.com/bzr

I liked BZR right away and I found it well documented. The system is user friendly but compared to both Git and Mercurial it is very slow. Committing a lot of files took forever. I didn't find any compelling reason to use BZR when compared to the other systems except that when compared to SVN it's decentralized.

BZR failed my test on merging the two different repositories into one. Perhaps I did something wrong, but I tried several times, and I tried with different merge options, but I ended up with the same result each time.

I like the fact that BZR only has one directory in the root of the repository .bzr

Merging two non-related repositories on BZR

Merging on BZR wasn't possible and I got the following error:

bzr: ERROR: Branches have no common ancestor, and no merge base revision was specified.

I am pretty sure that I made some mistake, but nonetheless it should be quite easy to achieve with a decentralized system. I read the documentation on merging, but maybe I missed something.

Git

Git is a distributed revision control project created by Linus Torvalds. Git's design was inspired by BitKeeper and Monotone and it was originally designed only as a low-level engine that others could use to write front-ends for. However, the core Git project has since become a complete revision control system that is usable directly. Several high-profile software projects now use Git for revision control, most notably the Linux kernel.

The installation is really easy on Debian:

# apt-get install git

Then for some usage:

$ git config --global user.name "Your Name Comes Here"
$ git config --global user.email you@yourdomain.example.com
$ mkdir /my_project
$ cd /my_project
$ git init
$ touch my_file.txt
$ git add .
$ git commit -m "My initial commit"

Clone a project via ssh:

$ git clone ssh://user@example.com[:port]/path/to/repo.git/

I find Git very cool. It's almost as fast as Mercurial and it is very user friendly. The documentation however is not that great and several times I had to get help on the git IRC channel to get the information that I needed.

Git's manpages are not much better. There are a few git commands (such as log) that take arguments that other git commands accept. Sometimes this fact isn't documented and a person is left guessing what the full range of accepted arguments are.

Update September 2010: The Git documentation and the man pages are really great now.

I like the fact that Git only has one directory in the root of the repository .git

Git handles binary files very well.

A great feature of Git is that you can force other file types to be binary by adding a .gitattributes file to your repository. This file contains a list of patterns, followed by attributes to be applied to files matching those patterns. By adding .gitattributes to the repository all cloned repositories will pick this up as well.

For example, if you want all pdf files to be treated as binary files you can have this line in .gitattributes:

*.pdf -crlf -diff -merge

This means that all files with the .pdf extension will not have carriage return/line feed translations done, won't be diffed and merges will result in conflicts leaving the original file untouched.

A great annoyance with Git is that the repository requires periodic optimization with git-gc. If something like this needs to be done periodically, the tool should just find a way to do it automatically. Otherwise, I am going to forget to do it and get frustrated when things are slow.

Linus Torvalds started a thread about people being unaware of the importance of optimizing a git repository: http://kerneltrap.org/mailarchive/git/2007/9/5/257021, notice the answers too.

I don't like having to optimize a repository on a periodic level, and I find it time consuming (but that's just me).

As some stated on Stack Overflow:

Git and Mercurial both turn in good numbers but make an interesting trade-off between speed and repository size. Mercurial is fast with both adds and modifications and keeps repository growth under control at the same time. Git is also fast, but its repository grows very quickly with modified files until you repack - and those repacks can be very slow. But the packed repository is much smaller than Mercurial's.

Personally I don't need my repository to be repacked, and if I forget to do so on my old laptop, I could easily imagine getting into trouble running a command that requires caching.

Update September 2010: Several git commands now automatically run git-gc and it is no longer a problem.

Merging two non-related repositories on Git

Merging on Git was super easy as with Mercurial, but finding out how to do it was difficult.

Update: September 2010: Due to the lack of documentation at the time of writing.

On the client:

$ cd git
$ git pull ssh://example.com/another_git_repo
...
$ git push

I expected a merge command, but the merging process automatically follows the pull command.

Darcs

Darcs is a distributed revision control system developed by David Roundy and it was designed to replace traditional centralized source control systems such as CVS and subversion. Darcs is written in Haskell and among other tools it uses QuickCheck. Many of its commands are interactive, allowing users to commit changes or pull specific files selectively. This feature is designed to encourage more specificity in patches. As a result of this interactivity, Darcs has fewer distinct commands than many comparable revision control systems has.

I must admit that I didn't like Darcs, but that wasn't because I found the system bad. I just didn't feel at home. Darcs uses a completely different command structure than CVS and it takes some getting used to.

However, there are several benefits to the way Darcs works. Each developer essentially has his own private branch and can check in anything to that branch without affecting others. Developers can also send each other patches without affecting the main repository. Darcs supports sending patches via email, which eliminates the need for a publicly accessible server.

Update September 2010: When I wrote this I had not yet discovered the power of Git branches.

Darcs used to have some significant bugs, but I don't know if it still has these. The most severe of them was "the Conflict bug" - an exponential blowup in time needed to perform conflict resolution during merges, reaching into the hours and days for "large" repositories. A redesign of the repository format and wide-ranging changes in the code base are planned in order to fix this bug, and work on this was planned to start in Spring 2007.

Installation of Darcs is also easy on Debian:

# apt-get install darcs

Then some initial setup:

$ mkdir /my_project
$ cd /my_project
$ darcs init
$ touch my_file.txt
$ darcs add -r *
$ darcs record -am "My initial commit."

Checkout a project via ssh:

$ darcs get sftp://user@example.com:10022/foo/bar

Darcs may be a great revision control system, but I just didn't like it.

Subversion (SVN)

Subversion (SVN) is a version control system (VCS) initiated in 2000 by CollabNet Inc.

Projects using Subversion include the Apache Software Foundation, KDE, GNOME, Free Pascal, GCC, Python, Ruby, Sakai, Samba, and Mono. SourceForge.net and Tigris.org also provide Subversion hosting for their open source projects, Google Code and BountySource systems use it exclusively. Subversion is also finding adoption in the corporate world.

Update September 2010: Most of the projects mentioned have since migrated to Git or Mercurial.

The goal of the Subversion project is to build a version control system that is a compelling replacement for CVS in the open source community. Subversion is meant to be a better CVS, so it has most of CVS's features. Generally, Subversion's interface to a particular feature is similar to CVS's, except where there's a compelling reason to do otherwise.

Installation of SVN requires multiple packages on both the server and the client.

On the server:

# apt-get install subversion subversion-tools
# svnadmin create /svn

On the client:

# apt-get install subversion subversion-tools
# exit
$ cd ~
$ svn checkout svn+ssh://example.com/svn

I really liked SVN and because it is a centralized system like CVS, I felt right at home. In my personal opinion SVN is like a better an improved version of CVS. It's CVS with the good stuff, without the bad stuff, and with some better stuff. I had no problems in any of my tests and merging two completely different repositories was very easy though not as easy as with Mercurial.

A thing I like about both SVN and Mercurial is their support for keywords expansion and the way SVN handles that on a per file basis is really cool. Another cool thing is the possibility to make checkouts on single subdirectories in a repository.

SVN has a really good documentation in the Subversion book.

Because of the centralized nature of SVN I dislike the system in general, but if I had to use a centralized system my choice would definitely be SVN.

Merging two non-related repositories on SVN

Merging on SVN involves the server, which I don't like.

On the server:

# svnadmin dump subversion2 > subversion2_dump.txt
# svnadmin load /subversion1 < /subversion2_dump.txt

On the client:

$ rm -rf subversion*
$ svn checkout svn+ssh://example.com/subversion1

Very simple and easy, but I just don't like having to deal with the server. The reason for the need of server interaction is because of the centralized system design.

Mercurial

Mercurial is written in Python, with a binary diff implementation written in C. All mercurial commands begin with hg, a reference to the chemical symbol for mercury. Mercurials major goals include high performance and scalability, serverless, fully distributed collaborative development, robust handling of both plain text and binary files, and advanced branching and merging capabilities, while remaining conceptually simple. It includes an integrated web interface.

The creator and lead developer of Mercurial is Matt Mackall. The full source code is available under the terms of the GNU General Public License, making Mercurial free software. Like Git and Monotone, Mercurial uses SHA-1 hashes to identify revisions.

A lot of great projects use Mercurial, amongst them are SUN, The Mozilla Foundation, Google Code and many more.

Installation is again easy on Debian:

# apt-get install mercurial

Then some initial setup:

$ mkdir /my_project
$ cd /my_project
$ hg init
$ touch my_file.txt
$ hg addremove
$ hg commit

Cloning a repository via ssh:

$ hg clone ssh://example.com//hg

Mercurial is a charm to work with. It is very fast, it is very well documented, and it is very user friendly.

Update September 2010: I ended up using Mercurial for several years before I eventually changed to Git. Both Git and Mercurial are really great revision control systems, but Git is more powerful.

With all the different tests I did, including merging two completely independent repositories, Mercurial was the most easy and intuitive system to use.

The only thing I don't like about Mercurial is that it keeps your personal configuration file ~/.hgrc outside of the repository. The file doesn't contain any important stuff and its perfectly possible to work without ever touching that file, but once you have setup the file, you don't want to loose it, especially if you work with keyword expansion on a per file basis. Each filename has to go into that file. Contrary to that Git keeps the setup file .git/config inside the repository.

Mercurial supports personal setup of keywords which means that I can have both the keyword "Date" and "Dato" do the same thing, where "Dato" is danish for "Date". If you need keyword extension in Mercurial you have to download a file called keyword.py and set things up right. It's quite easy but I still think that this function should be natively supported.

Mercurial generally makes no assumptions about file contents. Thus, most things in Mercurial work fine with any type of file. The exceptions are commands like diff, export, and annotate, that work well on files intended to be read by humans, and merge, where processing binary files makes very little sense at all.

The question naturally arises, what is a binary file anyway? It turns out there's really no good answer to this question, so Mercurial uses the same heuristic that diff uses. The test is simply if there are any NULL bytes in a file. For diff, export, and annotate, this will get things right almost all of the time and it will not attempt to process files it thinks are binary. If necessary, you can force these commands to treat files as text with:

$ hg -a

Merging is another matter. The actual merging of individual files in Mercurial is handled entirely by external programs and Mercurial doesn't pretend to tell these programs what files they can and cannot merge. The example mergetools.hgrc currently makes no attempt to do anything special for various file types, but it could easily be extended to do so. But precisely what you would want to do with these files will depend on the specific file type and your project needs. If you need to merge binaries, you need to have a tool which manages binary merge. Joachim Eibl's kdiff3 version ships qt4 version that recognizes binary files. Pressing "cancel" and "do not save" leaves you with the version of the file you have currently in the file system.

I haven't found anything to complain about with Mercurial.

I ended up with a .hgrc that looked something like this:

[ui]
username = Foo
[extensions]
hgk = /usr/share/python-support/mercurial/hgext/hgk.py
keyword = /usr/share/python-support/mercurial/hgext/keyword.py
[keyword]
**.php =
**.txt =
**.cpp =
**.java =
**.css =
[keywordmaps]
RCSFile = {file|basename},v
Author = {author|email}
Forfatter = {author|email}
Header = {root}/{file},v {node|short} {date|utcdate} {author|user}
Source = {root}/{file},v
Date = {date|utcdate}
Dato = {date|utcdate}
Id = {file|basename},v {node|short} {date|utcdate} {author|user}
Revision = {node|short}

Merging two non-related repositories on Mercurial

On the client:

$ cd mercurial-first
$ hg pull –force ssh://example.com//mercurial-second

Then:

pulling from ssh://example.com//mercurial-second
searching for changes
warning: repository is unrelated
adding changesets
adding manifests
adding file changes
added 1 changesets with 2 changes to 2 files (+1 heads)
(run ‘hg heads’ to see heads, ‘hg merge’ to merge)

Then:

$ hg merge
2 files updated, 0 files merged, 0 files removed, 0 files unresolved

Success! If you do a hg log command, the log contains both repositories entries, like they had belonged together from the beginning.

Conclusion

The winners list looks like this:

  1. Mercurial
  2. Git
  3. SVN
  4. BZR
  5. Darcs

Update September 2010: As mentioned, I eventually switched over to Git.

I know this article hasn't been an in dept analysis of every aspect of the different systems and I haven't really elaborated on the details of each test I did, but I still wanted to write about it, at least to be able to come back and remind myself why I chose Mercurial.

My choice on Mercurial was more a personal matter of "taste" rather than a matter of technicality. Any of the systems could fulfill my needs perfectly, especially since I am used to working with CVS.

Every single one of the systems are better than CVS.

I liked both Mercurial, Git and SVN, but Mercurial is just more nice to work with.

Have a nice one!