503 Service Unavailable

2007-06-10

Distributed versus centralized version control systems

Filed under: Programming,Software — rg3 @ 21:55

Version control systems, sometimes called revision control systems or source code management systems, are programs whose purpose is to let you track changes made to a set of, usually, plain text files. They are mainly used to track changes in the source code of programs, but they may be used for other purposes. They’re very useful and, if you’re a programmer and don’t use any, you should consider starting to use them. It doesn’t matter if you work alone or with more people, or if your project is very small or very big. A version control system will be helpful in the vast majority of cases, and it works like a time machine.

Nowadays there are several version control systems to choose from. In general, most of them fall into two categories: distributed or centralized. And they differ in how you are expected to work, make changes and publish those changes. Still, they have some points in common.

Common concepts

There’s usually a repository, a place in which your changes are recorded. Accessing it, you can view a log of the changes or recover previous versions or revisions, of the source code. New revisions are created when you commit a group of changes you’ve made. Let’s suppose you’re adding a function to a C program. You may add the prototype to a header file and the function definition to another source file. After that, you commit your changes and a new revision is created. There’s a revision before the function was added, and there’s a revision after the function was added. Together with the changes, most systems usually store the current time and date, the responsible of those changes and a log message you provide, which typically has a short and descriptive summary line and optionally a body explaining the changes in more detail if needed.

Centralized

The two most famous centralized version control systems are CVS and Subversion. They are called centralized because all the collaborators of a project work against one central repository. I have only used Subversion, because I’m relatively new to version control and when I started using these systems Subversion was already receiving possitive criticism and was beginning to replace CVS in many projects, most notably KDE. Recently, Subversion replaced CVS as the version control system offered at SourceForge.net.

Like I said, everybody works against a central repository and this repository is located in a different directory to the one holding the files you work with, which is called the “working directory”. The repository can be located in the same machine or in a different machine accessed by SSH or by a specific system protocol or by other means. The repository must be available everytime you want to commit your changes. Who can commit changes to a repository? In the simplest case, anyone having write access to the repository directories either because you own it or because your user group can write to it. Or maybe there’s an authentication mechanism in place and you need to provide a username and password. For example, Subversion can use its own server (svnserve) to give access to a repository over the network and it’s easily configured, using a text file, to require authentication to commit changes (file conf/passwd inside the repository).

You are expected to make changes and commit them to the repository solving any conflicts that arise. This happens sometimes when, while you changed a file, someone commited a change to that same file and the program can’t automatically apply your changes to it. After every commit or group of commits, you must run a command to update your working directory and bring to it the changes made by others.

That’s the working routine: modify-commit-update. In my humble opinion, this adapts very well to enterprise-like situations, where a group of developers in a flat hierarchy are working on a project and, with some exceptions, each developer is working on a different thing. In these situations, it’s unusual to have different project branches so there’s almost no need for branch merges (a weak point of Subversion according to many experts) and there’s almost no bureaucracy because everybody trusts everybody and, by having commit privileges to the repository, nobody needs to approve your changes. Everybody is committing changes and receiving changes all the time, via the central repository.

Distributed

There are several distributed version control systems and probably the most famous ones are git, Mercurial and Monotone, but there are others. Some time ago I had a look at git and Mercurial and chose Mercurial. Being no expert, I thought git was more complicated and had less documentation, and Mercurial had been chosen over git by the OpenSolaris developers, among other factors.

With these systems there’s no central repository. Everybody has one or more and there’s almost no distinction between working copies and repositories. To be more specific, each working copy has its own repository. The directory you’re working on at a given moment holds the source files and the repository. You make changes and commit them, creating a new revision or, as Mercurial calls it, changeset. Usually, there’s somebody who manages and owns the “official” repository. For example, in the Linux kernel Linus Torvalds manages the kernel development and everybody tries to get their changes into his repository, because that’s the official kernel.

The mechanism to distribute changes is different to that of centralized systems. Changesets are pushed to or pulled from repositories. If somebody has given you push privileges, you can push your changesets into their tree. More frequently, I think, you ask somebody to pull changes from your repository. This is what Torvalds does. He frequently pulls changes from people he trusts. Let’s suppose you are working with a repository in which the most recent changeset or revision is number 1. You start commiting a new feature to the program and commit changesets A, B and C. Your project’s history is 1-A-B-C. Somebody you work with started working that day with revision 1, like you did, and commited changes D and E. The other project history is 1-D-E. Then, they ask you to pull their changes. The common practice in this system is to clone or copy your repository with 1-A-B-C (just in case problems arise) and pull from them. When doing this, you create two branches, both starting at revision 1. You then merge both braches, maybe resolving conflicts in the way, and end up in revision 2, which combines the changes from both branches, joining them. If everything goes well, you can push everything, including the merge, to the “official” repository and everybody should pull from it. That’s the working routine: modify-commit-etc-pull-merge.

The advantages of this scheme should be obvious. First, this system scales much better with the number of people. It should also work better when there’s a hierarchy. Branching and merging fit much better into this model. Finally, I think it’s simpler when you work alone or with only a few people and setting up a repository is complicated or is not a possibility. I remember working at college with a good friend of mine (who I think reads this blog — Hi, Álvaro!). We had to create a PHP website and sending each other our changes was a somehow chaotic process. After the initial days we became used to it and coped with the situation, but it would have been much easier if Mercurial had existed back then. One of us would have had the “official” version and we would have emailed each other the changesets. I remember changes to the same files were desperating because we had to do the merges by hand. Tools like KDiff3 that exist nowadays would have completely automated this process, and Mercurial has a nice feature to bundle one or more changesets in a platform-independent file that can be sent over by email. Each one of us worked at home and we had no spare machines nor time nor energy to set up a CVS repository. A distributed version control system would have been the ideal solution in our situation.

Torvalds’ controversy

Recently there was some controversy over this topic because Linus Torvalds talked about it in a meeting and expressed his preference for distributed version control over centralized systems, ditching Subversion as a dumb idea. I think he always creates controversies because he has strong opinions and uses strong words to voice them. Still, some people were very upset. He tried to argue about two things. First, that a distributed version control system scales much better with the number of people. I won’t dispute that. I think it’s clearly true. However, he tried to explain that a distributed solution removed a lot of bureaucracy, and I think that’s not true in many situations. It’s true that, in a project with a moderately high number of people or contributors, who gets commit privileges to a central repository is always a matter of discussion, causes problems and involves too much politics and bureaucracy. On the other hand, in an enterprise-like situation it’s obvious who has commit privileges: the ones working on the project. Bureaucracy over. So I don’t think we should discard centralized systems at all. You set up privileges at the beginning and never discuss about it. In an open source project with an enterprise-like organization, with no clear leader and in which 99% of the contributions come from a core group of developers, those are the ones with commit privileges. For the other 1% of changes, let the contributors mail you patches. In my opinion, it depends on the project, but many times you can use a centralized system and have less bureaucracy, because you don’t have to be pushing or pulling all the time, you don’t need to revise other people’s changes if you don’t want to, etc.

Finally, I think people can adapt to situations. If you foresee that Subversion is going to be a better option, use Subversion. People will adapt to it with no problems. If you foresee a distributed solution is going to be better, use it. The safest and most flexible option is a distributed system, in my humble opinion. It’s the safe bet, it’s the one I use more nowadays, but not the only one and not always the best one.

2007-06-02

SlackRoll has been released

Filed under: Software — rg3 @ 12:13

Yesterday I published version 1 of SlackRoll, a package or update manager for Slackware Linux. For those who don’t know Slackware very well, Slackware is maintained mainly by a single individual, a man called Patrick Volkerding. It outstands for being a very simple distribution (internally, not simple to use), quite stable and having a very simple package manager (called pkgtools) that doesn’t track any sort of dependencies and doesn’t know about remote repositories. The user needs to download packages by hand as patches or new versions are published, and use the pkgtools to install or upgrade those packages. This simple system is robust in the sense that it’s very hard to break and is unlikely to fail and leave your system in an unusable state. However, the lack of a way to automatically download updates and new packages (among other things) has prompted the appearance of some tools to do this job, which sit on top of the classic pkgtools. The most famous ones are swaret, slapt-get and slackpkg, this last one being distributed as part of the official Slackware, inside the extra directory. I have used those three at different points in time.

Slackware is a very flexible distribution and leaves room for the user to decide how to manage their system. Some users think the limited number of official packages and the lack of dependency checking is a problem, and like to download packages from the user-driven site linuxpackages.net, which sometimes provide dependency information that slapt-get can use. Some other users think the official package selection is quite complete and compile themselves the few packages they need and are not present in the official tree. Sites like slackbuilds.org let you download many so-called SlackBuilds, shell scripts that extract, compile and create a Slackware package from official package sources if you don’t know how to create them yourself. For this second group of users, the semi-official slackpkg is a good option, letting you download packages automatically, detecting new packages, packages that have been removed and updates. It also has a mechanism to blacklist packages so they are not upgraded normally. That’s very useful to avoid disasters upgrading important packages and to let you have custom versions of official packages.

I consider myself to be in that second group. My system mainly has official packages and a handful of unofficial ones I compile myself. However, while I think slackpkg is a very good tool, I was not fully satisfied by it. In particular, I thought it was a bit slow (it’s a big shell script), I didn’t like its interface (it uses dialog for some operations, lets you see the output of wget when it downloads… in other words, it’s not very uniform) and I also didn’t really like the blacklisting mechanism because removing items from the blacklist had to be done by hand. In my state of partial dissatisfaction I started thinking about creating my own package manager, in the line of slackpkg but fixing its “problems”. One day I had so many ideas in my head that I decided to get them out and put them on paper. I grabbed a pen an decided how everything could be done. I decided which package states I needed, I came to the conclussion that I needed three package lists (local, remote and persistent) and immediately saw the main problem was creating the algorithm to update the persistent list/database. I tought about it and put the algorithm on paper. It hasn’t been modified since then. The main work involved two afternoons and evenings, but since then I’ve fixed some bugs, added some commands and changed small aspects of it from time to time, but without adding significant development time. It can still be considered a two-day job.

The end result is a Python script that reflects my view on Slackware package management, with a uniform interface, which works fast enough for me (local operations take less than one second in my system) and which lets you download and install packages and updates automatically and is, I think, quite good at showing you which packages have been added or removed from the official tree. Unlike slapt-get and sometimes swaret, it’s able to detect reverts to previous versions properly. This happens from time to time in the rolling tree when a new version introduces too many problems.

Its name stems from the fact that I initially designed the program to work with the rolling Slackware tree (slackware-current) but after it was finished I thought it could be easily adapted to work with the stable tree doing some trivial changes. I kept the name, however. I also want to thank Patrick Volkerding for maintaining Slackware and for kindly answering a batch of questions I sent him. Remember, if your Slackware system has mainly official packages and a few unofficial ones, please give slackroll a try. You may like the way it works. For me, more users mean more eyes to spot bugs and suggestions to drive slackroll to perfection. All the relevant information and instructions can be found in its webpage.

Happy Slackin’!

Create a free website or blog at WordPress.com.