Revision control

Outdated content

Published on 2007-11-17. Modified on 2010-10-10.

In connection with some corrupted files in CVS I decided that it was time to take a look on some other revision control systems. This article is a summary of my three day "journey" into the land of revision controlling.

Revision control

Revision control (also known as version control system (VCS), source control or source code management (SCM)) is the management of multiple revisions of the same unit of information.

A revision control system is a set of tools that allows you to keep control with the development of files. The files could be software source code, or it could be articles, or even images.

The best way to understand a revision control system is to think of a log or a diary. Let's say you are developing software and your want to keep track of the changes you make over time to the files, each time you change something, or each time you add or remove a piece of code, you make a note in the diary about the changes. You write about why you made the changes etc.

Let's imagine that your project grew and began to involve more people. To still keep track of changes you make an online version of the diary, and make everybody who is working on the project, commit changes to your files, and you make them write about it in the diary.

That's basically how a revision control system works, except that its more complicated and it does it much better.

A revision control system is like a log or dairy, but its more that than, a revision control system takes charge of the files in a project, and the system can merge different changes made by different people.

The biggest problem when a lot of people are working on the same files, whether its software or writing on a book or documentation, is to keep the different changes that people make from messing things up. If four persons are working on the same file, how do you keep that file intact, and how do you fit those pieces together? You do that by using a revision control system!

A revision control system can help you keep a track of things, and it can help you merge different commitments into a single file, but the system can't decide on your behalf, so often times you have to figure out how to deal with whats called a "merging conflict".

A revision control system can also revert changes and it can work like a nice backup system.

My story

I have always loved programming and I have developed a lot of different application, both professionally and as a hobby. In both cases I used to keep my files backed up on CDs (yes, this was written a long time ago), and I used to keep small notes about important stuff. After doing that for many years I suddenly stumbled upon the revision control system called CVS. I learned about CVS via OpenBSD and after setting the CVS system up, I never looked back. With CVS I could easily keep a log of every single detail of change I made to a file and at the same time CVS worked as a backup system.

Another great advantage was that I could keep all my important files on a single server and then pull those files down unto both my laptop and my desktop. If I made a change to a file using my laptop, and I commited that change to CVS, I could run an update on my desktop and it would grab those changes from the CVS server and voilà everything was in sync.

After running with CVS for about three years I one day needed to pull down a set of files that I had added into the CVS at the beginning of its usage. The files were compressed using Tar and GZip. Somehow I had missed the information that CVS needs to be told not to mess with binary files. The result of my lack of knowledge resulted in corrupted files. After the incident I decided to take a deeper look into CVS, and I decided to also take a look at other solutions.

I have always been happy with CVS, but that was because I didn't know about other systems at that time. One thing that did annoy me was CVS's lack of support to rename files. If you need to rename a file that has been commited to CVS, you have to make a copy of that file, delete the original and then commit the renamed version as a new file hence loosing all the commit history related to that file. CVS's handling of moving files around is also a big mess and in most cases the easiest way to deal with that is to manually remove files on the server and then recommit from the client.

Because CVS is a centralized system it keeps small pieces of information in each directory of the system. When you develop a lot of subdirectories CVS keeps its own directory in each of those as well. Sometimes you need to move a lot of stuff around, but you don't want to move the CVS directories as well.

All in all I have always been happy with CVS, but I have also been annoyed with the above mentioned problems.

I finally decided to take a good and hard look at other solutions and this article is about my experiences with those systems, the tests I made, and the final decision about which system I would end up using and why.

One thing is for sure, after running the tests, I will never return to CVS.

Different systems

My main concern about a new revision control system was that it had to be open source. Besides from that I was pretty open to new ways of doing things.

First thing first. I quickly discovered that there exists different kinds of ways to run a revision control system.

A centralized system uses a centralized model where all the revision control functions are performed on a shared server. If two developers try to change the same file at the same time, without some method of managing access, the developers may end up overwriting each others work. Centralized revision control systems solve this problem in one of 2 different source management models, file locking and version merging.

A decentralized system and a distributed system are more or less the same, they both take a peer-to-peer approach, as opposed to the client-server approach of a centralized system. Rather than a single central repository on which clients synchronize each peer's working copy of the code base synchronization is conducted by exchanging patches (change-sets) and code pull requests from peer to peer.

A centralized system requires that one is able to connect to the server whenever one wants to do version control work. This can be a bit of a problem if your server on some other machine on the Internet and you are not. Or, even worse, you are on the Internet but the server is not operational. A decentralized revision control systems deals with these problems by keeping branches on the same machine as the client. This allows the user to save his changes (commits) whenever he wants - even if he is offline. The user only needs Internet access when he wants to access the changes in someone else's branch that are located somewhere else.

The main differences from centralized systems

Each developer works on his own local repository. Repositories are cloned by anyone and are often cloned many times. There may be many "central" repositories. Access control lists are not employed. Instead code from disparate repositories are merged based on a web of trust, i.e., historical merit or quality of changes. Lieutenants are project members who have the power to dynamically decide which branches to merge. Network is not involved in most operations.

Advantages of a the decentralized model

Disadvantages of the decentralized model

There really are none. The "disadvantages" that some people describe are from a personal habitual point of view rather than from a technical and practical point of view. Most people are used to a centralized system and out of habit they don't want things to change (Note: remember, this was written about a year before GitHub was launched).

One argument is that a distributed system can end up with a person as the central point of control, rather than a server, but in my opinion that would only be because of a lack of proper organization.

Even for my own personal use of a revision control system I find a decentralized or distributed system a joy to work with. The ability to make local commits even if you are offline is great, and I often find myself in use of that option.

The test

I must start by pointing out that the tests I made in no way can compare to real day to day work done by many of the big open source projects like the Linux kernel or the FreeBSD project. I am only a single person and my tests reflects my daily usage of a revision control system. I develop small pieces of code and I write articles and books, but I don't have to cooperate with a lot of people doing that.

The main advantages that I gain from using a revision control system is backup, documentation (the log), and the ability to revert back to prior releases of software that I have made, and using the exact same files on different computers without having to copy data back and forth.

First I made a list of systems to test, next I did a scanning process in which I divided the different system up into groups. All non open source systems where removed from the list. Next I removed those systems that aren't in active developments. Then I looked at each system documentation and online community. I also required a simple setup running over OpenSSH. From the documentation I then decided which system looked the most user friendly regarding the commands. Last but not least I asked a lot of questions on the different IRC channels related to each system.

When I was done sorting I ended up with the following systems to test:

I installed each system on my Debian GNU/Linux revision server and tested the system using one of my desktops which also runs GNU/Linux Debian.

On each system I did the following tests, but not limited to:

On each system I had to mess a bit around with file permissions on the initial directory, I haven't displayed that in the test.


Bazaar (formerly Bazaar-NG) is a distributed revision control system sponsored by Canonical Ltd., designed to make it easier for anyone to contribute to open source software projects. As of 2007 the best known user of Bazaar is the Ubuntu project.

The development team's focus is on ease of use, accuracy and flexibility. Branching and merging upstream code is designed to be very easy, with focus on users being productive with just a few commands. Bazaar can be used by a single developer working on multiple branches of local content, or by teams collaborating across a network.

Bazaar is written in the Python programming language, with packages for major Linux distributions, Mac OS X and Windows. Released under the GNU General Public License, Bazaar is free software.

Installing BZR and getting it up and running was very easy:

# apt-get install bzr

Then setup a repository:

$ bzr whoami "John Doe"
$ mkdir /my_project
$ cd /my_project
$ bzr init
$ touch my_file.txt
$ bzr add
$ bzr commit -m "Initial commit."

Checkout of a project via ssh:

$ bzr checkout bzr+ssh://myserver/bzr

I liked BZR right away and I found it well documented. The system is user friendly but compared to both Git and Mercurial it is very slow. Committing a lot of files took forever. I didn't find any compelling reason to use BZR when compared to the other systems except that when compared to SVN it's decentralized.

BZR failed my test on merging the two different repositories into one. Perhaps I did something wrong, but I tried several times, and I tried with different merge options, but I ended up with the same result each time.

I like the fact that BZR only has one directory in the root of the repository .bzr

Merging two non-related repositories on BZR

Merging on BZR wasn't possible and I got the following error:

bzr: ERROR: Branches have no common ancestor, and no merge base revision was specified.

I am pretty sure that I made some mistake, but nonetheless it should be quite easy to achieve with a decentralized system. I read the documentation on merging, but maybe I missed some points.


Git is a distributed revision control project created by Linus Torvalds. Git's design was inspired by BitKeeper and Monotone and it was originally designed only as a low-level engine that others could use to write front ends to. However, the core Git project has since become a complete revision control system that is usable directly. Several high-profile software projects now use Git for revision control, most notably the Linux kernel.

The installation is really easy on Debian:

# apt-get install git

Then for some usage:

$ git config --global "Your Name Comes Here"
$ git config --global
$ mkdir /my_project
$ cd /my_project
$ git init
$ touch my_file.txt
$ git add .
$ git commit

Clone a project via ssh:

$ git clone ssh://[user@]host.xz[:port]/path/to/repo.git/

I find Git very cool. Its almost as fast as Mercurial and it is very user friendly. The documentation is not that great and several times I had to get help on the git IRC channel to get the information that I needed.

Update september 2010: This is no longer the case. Git documentation is really great now.

Git's manpages are not much better. There are a few git commands (such as log) that take arguments that other git commands accept. Sometimes this fact isn't documented and a person is left guessing what the full range of accepted arguments are.

I like the fact that Git only has one directory in the root of the repository .git

Git handles binary files very well.

A great feature of Git is that you can force other file types to be binary by adding a .gitattributes file to your repository. This file contains a list of patterns, followed by attributes to be applied to files matching those patterns. By adding .gitattributes to the repository all cloned repositories will pick this up as well.

For example, if you want all pdf files to be treated as binary files you can have this line in .gitattributes:

*.pdf -crlf -diff -merge

This means that all files with the .pdf extension will not have carriage return/line feed translations done, won't be diffed and merges will result in conflicts leaving the original file untouched.

A great annoyance with Git is that the repository requires periodic optimization with git-gc. If something like this needs to be done periodically, the tool should just find a way to do it automatically. Otherwise, I am going to forget to do it and get frustrated when things are slow.

Linus Torvalds started a thread about people being unaware of the importance of optimizing a git repository:, notice the answers too.

I don't like having to optimize a repository on a periodic level, and I find it time consuming (but that's just me).

Update september 2010: Some git commands now automatically run git-gc.

As some stated on Stack Overflow:

Git and Mercurial both turn in good numbers but make an interesting trade-off between speed and repository size. Mercurial is fast with both adds and modifications and keeps repository growth under control at the same time. Git is also fast, but its repository grows very quickly with modified files until you repack - and those repacks can be very slow. But the packed repository is much smaller than Mercurial's.

Personally I don't need my repository to be repacked, and if I forget to do so on my old laptop, I could easily imagine getting into trouble running a command that requires caching.

Update september 2010: I only use Git now, but both Git and Mercurial are really great revision control systems. Git does however have some advantages IMHO. One such advantage is that it is much more powerful.

Merging two non-related repositories on Git

Merging on Git was super easy as with Mercurial, but finding out how to do it was difficult (due to the lack of documentation at the time of writing - not so anymore).

On the client:

$ cd git
$ git pull ssh://webserver/another_git_repo
$ git push

I expected a merge command, but the merging process automatically follows the pull command.


Darcs is a distributed revision control system developed by David Roundy and it was designed to replace traditional centralized source control systems such as CVS and subversion. Darcs is written in Haskell and among other tools it uses QuickCheck. Many of its commands are interactive, allowing users to commit changes or pull specific files selectively. This feature is designed to encourage more specificity in patches. As a result of this interactivity, Darcs has fewer distinct commands than many comparable revision control systems has.

I must admit that I didn't like Darcs, but that wasn't because I found the system bad. I just didn't feel at home. Darcs uses a completely different command structure than CVS and it takes some getting used to.

However, there are several benefits to the way Darcs works. Each developer essentially has his own private branch and can check in anything to that branch without affecting others. Developers can also send each other patches without affecting the main repository. Darcs supports sending patches via email, which eliminates the need for a publicly accessible server.

Update september 2010: When I wrote this I had not yet discovered the power of Git branches.

Darcs used to have some significant bugs, but I don't know if it still has these. The most severe of them was "the Conflict bug" - an exponential blowup in time needed to perform conflict resolution during merges, reaching into the hours and days for "large" repositories. A redesign of the repository format and wide-ranging changes in the code base are planned in order to fix this bug, and work on this was planned to start in Spring 2007.

Installation of Darcs is also easy on Debian:

# apt-get install darcs

Then some initial setup:

$ mkdir /my_project
$ cd /my_project
$ darcs init
$ touch my_file.txt
$ darcs add -r *
$ darcs record -am "Initial commit."

Checkout a project via ssh:

$ darcs get s

Darcs may be a great revision control system, but I just didn't like it.

Subversion (SVN)

Subversion (SVN) is a version control system (VCS) initiated in 2000 by CollabNet Inc.

Projects using Subversion include the Apache Software Foundation, KDE, GNOME, Free Pascal, GCC, Python, Ruby, Sakai, Samba, and Mono. and also provide Subversion hosting for their open source projects, Google Code and BountySource systems use it exclusively. Subversion is also finding adoption in the corporate world.

Update september 2010: Most of the projects mentioned have migrated to Git or Mercurial.

The goal of the Subversion project is to build a version control system that is a compelling replacement for CVS in the open source community. Subversion is meant to be a better CVS, so it has most of CVS's features. Generally, Subversion's interface to a particular feature is similar to CVS's, except where there's a compelling reason to do otherwise.

Installation of SVN requires multiple packages on both the server and the client.

On the server:

# apt-get install subversion subversion-tools
# svnadmin create /svn

On the client:

# apt-get install subversion subversion-tools
# exit
$ cd ~
$ svn checkout svn+ssh://myserver/svn

I really liked SVN and because it is a centralized system like CVS, I felt right at home. In my personal opinion SVN is like a better an improved version of CVS. It's CVS with the good stuff, without the bad stuff, and with some better stuff. I had no problems in any of my tests and merging two completely different repositories was very easy though not as easy as with Mercurial.

A thing I liked about both SVN and Mercurial is their support for keywords expansion and the way SVN handles that on a per file basis is really cool. Another cool thing is the possibility to make checkouts on single subdirectories in a repository.

SVN has a really good documentation Subversion book.

Because of the centralized nature of SVN I dislike the system in general, but if I had to use a centralized system my choice would definitely be SVN.

Merging two non-related repositories on SVN

Merging on SVN involves the server, which I don't like.

On the server:

# svnadmin dump subversion2 > subversion2_dump.txt
# svnadmin load /subversion1 < /subversion2_dump.txt

On the client:

$ rm -rf subversion*
$ svn checkout svn+ssh://webserver/subversion1

Very simple and easy, but I just don't like having to deal with the server. The reason for the need of server interaction is because of the centralized system design.

Mercurial (hg)

Mercurial is written in Python, with a binary diff implementation written in C. All mercurial commands begin with hg, a reference to the chemical symbol for mercury. Mercurials major goals include high performance and scalability, serverless, fully distributed collaborative development, robust handling of both plain text and binary files, and advanced branching and merging capabilities, while remaining conceptually simple. It includes an integrated web interface.

The creator and lead developer of Mercurial is Matt Mackall. The full source code is available under the terms of the GNU General Public License, making Mercurial free software. Like Git and Monotone, Mercurial uses SHA-1 hashes to identify revisions.

A lot of great projects uses Mercurial, amongst them are SUN, FreeBSD kernel development, The Mozilla Foundation, Google Code and many more.

Installation is again easy on Debian:

# apt-get install mercurial

Then some initial setup:

$ mkdir /my_project
$ cd /my_project
$ hg init
$ touch my_file.txt
$ hg addremove
$ hg commit

Cloning a repository via ssh:

$ hg clone ssh://myserver//hg

Mercurial is a charm to work with. It is very fast, it is very well documented, and it is very user friendly.

Update september 2010: I ended up using Mercurial for several years before I eventually changed to Git. Both Git and Mercurial are really great revision control systems, but Git does have some advantages in my opinion. One such advantage is that it is much more powerful.

With all the different tests I did, including merging two complete independent repositories, Mercurial was the most easy and logical system to use.

The only thing I don't like about Mercurial is that it keeps your personal configuration file ~/.hgrc outside of the repository. The file doesn't contain any important stuff and its perfectly possible to work without ever touching that file, but once you have setup the file, you don't want to loose it, especially if you work with keyword expansion on a per file basis. Each filename has to go into that file. Contrary to that Git keeps the setup file .git/config inside the repository.

Mercurial supports personal setup of keywords which means that I can have both the keyword "Date" and "Dato" do the same thing, where "Dato" is danish for "Date". If you need keyword extension in Mercurial you have to download a file called and set things up right. It's quite easy but I still think that this function should be natively supported.

Mercurial generally makes no assumptions about file contents. Thus, most things in Mercurial work fine with any type of file. The exceptions are commands like diff, export, and annotate, that work well on files intended to be read by humans, and merge, where processing binary files makes very little sense at all. The question naturally arises, what is a binary file anyway? It turns out there's really no good answer to this question, so Mercurial uses the same heuristic that programs like diff use. The test is simply if there are any NULL bytes in a file. For diff, export, and annotate, this will get things right almost all of the time and it will not attempt to process files it thinks are binary. If necessary, you can force these commands to treat files as text with:

$ hg -a

Merging is another matter. The actual merging of individual files in Mercurial is handled entirely by external programs and Mercurial doesn't pretend to tell these programs what files they can and cannot merge. The example mergetools.hgrc currently makes no attempt to do anything special for various file types, but it could easily be extended to do so. But precisely what you would want to do with these files will depend on the specific file type and your project needs. If you need to merge binaries, you need to have a tool which manages binary merge. Joachim Eibl's kdiff3 version ships qt4 version that recognizes binary files. Pressing "cancel" and "do not save" leaves you with the version of the file you have currently in the file system.

I haven't found anything to complain about with Mercurial.

I ended up with a .hgrc that looked something like this:

username = Foo
hgk = /usr/share/python-support/mercurial/hgext/
keyword = /usr/share/python-support/mercurial/hgext/
**.php =
**.txt =
**.cpp =
**.java =
**.css =
RCSFile = {file|basename},v
Author = {author|email}
Forfatter = {author|email}
Header = {root}/{file},v {node|short} {date|utcdate} {author|user}
Source = {root}/{file},v
Date = {date|utcdate}
Dato = {date|utcdate}
Id = {file|basename},v {node|short} {date|utcdate} {author|user}
Revision = {node|short}

Merging two non-related repositories on Mercurial

On the client:

$ cd mercurial-first
$ hg pull –force ssh://webserver//mercurial-second


pulling from ssh://webserver//mercurial-second
searching for changes
warning: repository is unrelated
adding changesets
adding manifests
adding file changes
added 1 changesets with 2 changes to 2 files (+1 heads)
(run ‘hg heads’ to see heads, ‘hg merge’ to merge)


$ hg merge
2 files updated, 0 files merged, 0 files removed, 0 files unresolved

Success! If you do a hg log command, the log contains both repositories entries, like they had belonged together from the beginning.


The winners list looks like this:

  1. Mercurial
  2. Git
  3. SVN
  4. BZR
  5. Darcs

Update september 2010: As mentioned I eventually switched over to Git.

I know this article hasn't been an in dept analysis of every aspect of the different systems and I haven't really elaborated on the details of each test I did, but I still wanted to write about it, at least to be able to come back and read this once I (maybe) forgot, why I chose Mercurial.

My choice on Mercurial was more a personal matter of "taste" rather than a matter of technicality. Any of the systems could fulfill my needs perfectly, especially since I am used to working with CVS. Every single one of the systems are better than CVS.

I liked both Mercurial, Git and SVN, but Mercurial is just more nice to work with - IMHO.