Martin Pool's blog

Tom Lord interview, and related things

Interview with Tom Lord, designer of Arch. Slashdot, LWN coverage.

To be brief and a bit brutal: Arch is very clever in many ways. However, Tom is way too aggressive as an advocate. Arch might scale up to large projects, but it doesn't scale down very well to beginning users on small projects. It's complex to get started, and I'm worried by signs that work is going into adding more complex features rather than reducing it. Although you can make it very fast, that's not the default.

Earlier versions were very much bound into projects being run the way Tom wanted them: wierd file conventions, only committing from clean trees, and so on. It's fine to suggest them, but trying to force them on people at the same time as they learn a new system is not a good idea. Tool designers need to know where they want to force change, and where they want comfortable familiarity.

I hope these issues are fixed. Arch is probably the most promising large-project version control system at the moment, but it really needs to get over the usability hump to realize its full potential. I feel they have about a 75% chance of getting there in the next one or two years.

One remarkable thing about the LWN page is that Larry McVoy confirms that BitMover refused to sell a BitKeeper licence to the employer of a person involved with free version control products. It's his right to refuse to sell, or to revoke a revocable licence, but this is a risk that needs to be considered.

Arch vs tla

Google asked me: what is the difference between arch and tla?

The short answer is that they are two names for the same thing. The project was originally conceived of as arch: I suppose the idea of an arch connotes elegance, and it has a r-c sound to suggest revision control.

However arch is already in use as a command on Unix: it prints the machine architecture (e.g. i686). It's kind of a waste of a word, but nevertheless it exists and is depended upon by some scripts. So the program can't be actually called arch. For a while it was called larch, and there were forks with different command names. Some people say that Arch is the design and tla is the implementation.

Now it has settled on tla, which is either Tom Lord's Arch, three letter acronym, or doesn't stand for anything at all.

The short story is that Arch and tla are interchangeable when talking to people, but for computers you need to spell it tla.

Loss of a server in Arch and Darcs

I wrote a while ago on some things I think are less than perfect in Arch.

I think the one that bugs me most is that branches are bound to a particular location, rather than being purely distributed. (I use the word branch here for the comfort of a general audience; in Arch they would strictly be called versions which I think is a bit misleading.) I want to try to explain this a bit more.

The machine hosting sourcefrog.net crashed because of hardware problems the other week and was offline for a couple of days. I wanted to work on two projects which are hosted there, librsync and distcc. Because I am a version-control gourmet, distcc is in Arch and librsync is stored in Darcs.

Because sourcefrog is quite close to where I live, I normally work directly against its repository from Arch. I would have the choice of making downstream repositories on each machine I work on, but that would introduce a lot of "noise" merges every time I moved code from those machines onto sourcefrog. Since there's only one distcc branch, and I'm the only person who commits, I'd rather just work directly to that branch.

A consequence of this is that when sourcefrog is down, I can't commit or update at all. I am stuck.

Or almost stuck. In fact, I can cheat: make an archive on my laptop and a new branch in that archive, and commit from my working copy onto that branch. When the main machine is back up, I can merge from my branch back to sourcefrog.

This is pretty neat. I don't think I could easily do it in either Subversion or CVS. With those systems, I'd probably keep hacking and just make one big commit at the end. (Which is not really such a bad thing, but not ideal.) At best, I could keep snapshots of the tree at different and commit each one by hand as a separate patch.

On the other hand, what I did is not documented, and I'm not sure it's entirely kosher. It does require a certain amount of understanding Arch internals and fiddling to get the merge to work back. It is a testament to the elegance and flexibility of the Arch design that it's possible to use it in this unintended way.

By contrast in Darcs having your server go down makes no difference at all, except that you can't publish to that particular server. Because everything is always committed locally and then pushed up the natural way of working means there's little dependency on anything but the local machine. All of this doesn't leave any major permanent record, because revision names don't depend on the machine to which they were originally committed. With the server offline you can make changes, record them, roll them back, and make branches. If the machine's going to be down for a while you can start committing to a different server, or email your changesets to someone else.

You can do this in Arch but it's more natural in Darcs.

I think at the moment I would compare them like this:

Arch has a lot of structure and metadata to let you see the history of every changeset and to organize large trees. That might be good for very large projects. It's good for small projects, though the sheer complexity can be a disincentive.

Darcs is much simpler. I think you can show someone all they need to know in ten minutes. It's naturally very distributed. I rarely or never need to wait for network traffic.

What's wrong with Arch?

[comments welcome]

I gave a talk about new version-control systems the other week at our LUG. Tridge challenged me with ok, so what's wrong with Arch? I think it's important to see the bad points in whatever you're adovcating. Distributed version control is pretty new, even the stable ones are themselves an experiment. The differences between competing systems are not just accidents of implementation, but also fundamentally different ideas about what software version control means, and how it should be done.

So: what's wrong with Arch? I like it quite a lot, but I'm going to put that aside and, just for this article, look for problems.

There is an elegant underlying simplicity to Arch, but it is expressed in a complex way: there are many commands printed by tla --help and that can confuse the novice user. It's actually possible to get by with a reasonably small subset, but the tutorial does not make that very clear.

Many of the commands expose lower-level operations that might be useful in writing scripts or fixing problems. For example, tla sync-tree lets you tell arch pretend I've merged these patches, without actually merging the text, a little cheat which can be useful in resolving some merges. Exposing atomic operations is an admirable goal; more programs should do it. But perhaps splitting them out into separate programs would make it easier to understand. I think to some extent this is driven by Tom's expressed desire for Arch to become a platform for consulting work, rather than primarily something people can just install and use. (Perhaps the project is moving back from that position now.)

Perhaps this will make Arch a more desirable option for larger projects which want to do more complex operations.

It seems bizarre that despite all these commands there are some glaring gaps. For example, there is no single command to revert a file to its previous state. It is suggested instead that one get the diff and apply it through patch --reverse, or that one copy it from the pristine previous version. Both of these work, certainly, and they can be scripted, but it's puzzling that they're not built in.

Another gaping hole is that there is no command to find just the changesets that touched a particular file. Accomodating renames makes this slightly harder than in CVS, but only very slightly, since the file has a persistent ID. I often do svn log CPU.cpp, but on Arch I have to do without. Darcs can do this too, with darcs changes CPU.cpp.

On the other hand there is an excellent multi-level tla undo, which saves the removed changes so that they can be put back with tla redo if you change your mind.

In general Arch is prone to "there's more than one way to do it", which can be both good and bad. For example it handles renamed files very well, by associating a file id that remains constant for the life of the file even if it is renamed. This allows Arch to correctly merge changes across renames, something notably lacking (last I looked) from Subversion. File ids are a a fine and elegant design. However, the implementation is complex and confusing: the id can be stored in an external file, can be derived from the name, or can be stored in the file in either of two different syntaxes. You can mix these methods within a single tree, and can customize to some extent the rules on which one is used. I guess you can make a case for any particular case being useful, but the end result is complex and hard to get to understand. Choosing only one method might not have hurt too much, and might have simplified the system.

Another area where Arch can be criticized for too much choice is in handling non-versioned files. Most vc systems have to accomodate files which exist in the source directory but that should not be versioned. The classic example is *.o files. CVS handles this fairly with a list of patterns in .cvsignore. Fine.

Arch allows you to classify files using regexps into Source, Junk, Precious and Backup. Each class is treated slightly differently, but personally I am never sure if my .o files are more accurately Junk or Precious. I suppose there are cases where the distinction would be useful, but again I wonder if it would not have been simpler to simply follow CVS in saying *.o is ignored. Leave it up to the user to decide which files ought to be automatically deleted and when. Being able to customize it to have simple behaviour is not as good as just being simple.

Some people think it uses too much disk: in some configurations you will have four inodes per source file. (Source, it's id, pristine source and pristine source id.) This is pretty much constrained to people working on very large trees on very old hardware, and I don't think it is a general argument against arch. In arch's favour, it can intelligently manage hardlinked trees so that additional working copies are very cheap.

To share your source, Arch depends on having a read-only web server. This is an enormous advance over CVS, which requires a special cvs pserver. On the other hand, it is substantially harder than the current stanadrd method of mailing a patch. I asked a while ago if this could be added, and despite some confusion about how it would be done it looks like it might go in eventually. Darcs has this already, which I count as a major feature.

Florian Weimer collected a long list of design issues, which caused a lively discussion.

Arch has a bit of a fetish about long names: one regularly has to type identifiers like mbp@sourcefrog.net--2004/librsync--callback--0.11. This would be less painful if it were possible to use relative names more often: if I type an incomplete name it could be interpreted relative to wherever I'm standing at the moment. Unfortunately common operations like merging between a local and remote archive require giving a full name. (OK, it's not all that bad if you can copy&paste, or go back through command history. But it's a bit gross that it is necessary.)

By contrast, Darcs has barely any naming at all: branches are filesystem directories, and identified by their directory name (and hostname, if remote.) You can arrange directories in whatever organization most makes sense to you, and of course give commands like darcs pull ../upstream to move between them.

Finally I have one issue which I think has not been mentioned before, which is a kind of meaning/mechanism mismatch in the way distributed operation works. Arch has excellent support for maintaining and merging between multiple branches. It also has good disconnected support: I can take my laptop to a desert island for a month, hack away, and come back and import all my changes, along with their history. Importantly I can also integrate those changes with whatever has happened while I've been away. So far, so wonderful.

The way I set up to do work on my laptop is to create a new branch, stored in an archive on my laptop. Suppose the main branch is mbp@foo.org--2004/foo--main--0, and on my laptop jolly I have mbp@foo.org--jolly/foo--main--0. This is pretty clean: I can commit to the branch stored on my laptop when I'm offline, and I can merge back into the main branch when I'm online.

The problem is that this mixes mechanism with meaning. I don't want changes done on my laptop to look any different from those done online. I want to only create different branches for different streams of development, nof for changes that happen to occur on disconnected machines.

Once the changes have been merged upstream you can still see what was done, but only indirectly: all the individual commits get wrapped up in a single change called something like merge from jolly, unless I manually go through and commit them.

I think this is a bit of a problem. I like the ability to zip up changes from a downstream branch when applying them as a single unit to an upstream branch. But I want to be able to do disconnected work completely orthogonal to which branch I'm working on, and without needing to create new branches.

Darcs never wraps up commits into larger commits, as far as I can tell. All of my commits, once merged upstream, appear as part of the same branch because it doesn't really remember which branch a change was originally made on. That solves the immediate problem. But it does seem like in some projects you really would want to remember the way patches got bundled up...

I don't know if there is a perfect solution. Maybe either of them is good enough. What do you think?

Arch reference card

tla reference card in various formats, plus code to produce it.

Some thoughts on arch security

GNU Arch has some pretty powerful and novel security properties for a version control system.

I have been helping maintain cvs.samba.org for a few years, so perhaps I have a pretty good idea, at least from the perspective of people doing open source or distributed development.

The word "security" means different things to different people. Some organizations, for example, would like to make sure that unauthorized persons don't see the source code, or even that developers who are allowed to see one part can't see another part. Others might want to make sure that any changes which are committed pass all the appropriate reviews and quality checks. I think Arch could probably do a pretty good job in helping with that, but they're not really the facet of security that I want to write about tonight.

What I am concerned with is that in recent years there have been quite a few criminal intrusions into development systems. Somebody tried to get a change into the Linux kernel source through the bitkeeper-cvs gateway. Somebody had a trojan installed on the machine of a senior developer at Valve software. Someone else got part of the Windows source code through compromising a developer's machine. Even if the source is not confidential the risk of unauthorized changes can be enormously disruptive.

CVS and Subversion are both commonly operated by free software projects in this mode: developers have SSH access to the server, and everyone else has anonymous read access.

It was originally planned that Subversion would run as an Apache 2 module using SSL and Apache authentication, so that there would be no need for developers to have local accounts. For various reasons this turned out to be pretty unpopular, and my impression is that almost all free software projects are using svn+ssh.

CVS, Subversion and co require a special server process both for committers and anonymous users. Arch does not: archives can be published just as read-only directories on a web or ftp server. This is a good thing: one less program to worry about, one less listening port. You can use whatever web server you think is least likely to be compromised.

Using SSH as a transport is one of my favourite Unix design patterns, and it is certainly much better than each SCM system inventing its own authentication protocol. But it does have several problems. Firstly, you need to be able to create Unix accounts for contributors. This has been a method of entry for attackers on open source projects before. Administrators can try to limit which commands can run, but there is a risk that contributors might break out of such a jail. On an older project, many dormant contributors may still have shell accounts, and these remain a possibility of intrusion.

Arch doesn't require that any two developers have access to a single system. This isn't just a theoretical possibility; it is the standard way of using it.

One good consequence is that there doesn't need to be any assessment of whether a contributor is "good enough" or "trusted enough" to have commit access. For CVS this is a big deal: someone who has commit access could destroy the whole repository or rewrite history, but people without commit access can't really work well. With Arch, there is no such decision point: anyone can work comfortably without needing to be specially trusted, and every change can be considered on its merits.

Most version control systems, including Arch, present the user model that once revisions are committed, they cannot be changed. The archive is conceptually read-only. However, as far as I know, only Arch makes the revisions physically read-only: each one is a directory containing a few files, such as this one. There is little chance of a later update changing or corrupting any previous work. (I have seen svn need to have its database rebuild from time to time, but arch never has.) This seems to have several really good properties against either accidental or intentional damage. On a machine that is shared by several developers, one might have a cron script that chowned and chmodded committed revisions as extra protection. Tools like tripwire would immediately pick up any new additions or changes (although changeset signing would probably trap that already.)

Arch stores checksums for each commit, which should trap accidental or hardware damage. These can then be gpg-signed, which should give pretty good assurance that, at least, the changeset came from a developer's machine.

If the worst happens and a machine is compromised, Arch's distributed design makes it likely that the affected or destroyed archive will be widely mirrored, so the changes can be detected and an older version can be restored.

[draft, more to come]

How Arch Works


Photo from jdub.

Tom Lord wrote a good whitepaper on How Arch Works. I think this answers a lot of questions about why it is the way it is that might be troubling people who have just read the tutorial.

Doing powerful distributed version control with *no* server-side computation is just brilliant, with great results for scalability, security, reliability and simplicity. One idea that good makes a good year.

I should really write something about arch security vs CVS.

arch rocks: mirroring (updated)

There are plenty of good free software developers in the world who don't have big machines on good pipes where they can put their CVS server or downloads. People outside of the US often don't realize just how slow the combination of a modem and intercontinental latency can be. A former Prime Minister called Australia the "arse end of the world" for a reason.

Anyhow, generally what has happened until now is that these developers either find a bigger project like gnome.org or samba.org to host them, or they sign up for something like Sourceforge. Now Sourceforge is a pretty valuable thing, but it has been patchy recently. If that's where your CVS is hosted, you don't have much choice but to just not commit while it's offline.

Another drawback is that if you put CVS on sourceforge, then every time you diff or commit it needs to go all the way to California and back. This is pretty slow. When I did this over a modem, it would take a good fraction of a minute just to diff a reasonably small tree. It is annoying. It grinds you down.

What I really wanted was to have my working repository close by: either on my own disk, or at least in the same city. At the same time, I wanted my public tree to be on a fast machine on a fat pipe.

I suppose I could have kludged it up using cvsup or rsync but they're not completely satisfying solutions.

Finally, GNU Arch solves this, in a truly elegant way. Anyone can mirror a public archive. ("Archive" in Arch ~= "repository", it holds the history of changes.) In fact, several sites such as sourcecontrol.net have set up to just mirror all the open source software they can find.

(If you want to follow another developer's work closely, you can mirror their archive onto your own machine, and their entire history is available for quick consultation, even when you're offline. Conversely, and unlike Bitkeeper, you are not *required* to keep their full history if you don't want it. If you merely want to download their most recent tree, or the patches to update to the most recent tree, that's what you get.)

Other people mirror just intermittently, as a backup in case a primary archive is lost. Even the humblest programmer can now adopt Linus's backup technique: write good software, and the world will do your backups for you!

What's more, because changesets are strongly GPG-signed, people using the archive can feel sure that they're getting the changes as the original author wrote them, without any accidental or intentional modifications.

Microsoft wrote a while ago that free software development scales up to the size of the internet better than Microsoft's own processes. Arch removes the scalability bottleneck of a single CVS server.

This is a really cool thing.

Arch rocks: working on untrusted machines

Sometimes you need to do some development work on a machine you don't really trust. I seem to most often find myself in this situation when I'm trying to fix a portability problem by working on an exotic machine lended by some kind person. Another case might be that you're at someone else's house or office while travelling and want to do a little work or fix a bug they're experiencing.

One way you can do this for CVS is to copy your SSH key on, or forward SSH authorization from some other machine. Now, from this machine where you're doing development, you can get checkouts and commit your changes back.

But a problem is that this requires giving access to your SSH key to a machine which I don't really trust to not be compromised or have keystroke loggers. I don't really think it's hostile but at the same time I am not completely happy to let it see a key which allows changes to all my repositories, reading my mail, etc.

Under CVS you can get an anonymous checkout, and then perhaps mail yourself the diffs. It's OK. It's tolerable for small changes, but if you want to do more than one commit then it's pretty annoying. I suppose you can work around that by using Quilt along with CVS.

Under Arch, you can just treat yourself on that machine as another untrusted committer with their own archive. Make your changes, including renames and permission changes, and commit them to a local archive. When you're done, put that archive somewhere where it can be read later, or mail the changesets to yourself. Later on, review them and merge them in. At no point does your SSH or GPG key have to go onto the untrusted machine for you to be able to do regular development. I think that's pretty damn cool.

arch wins

Colin, you were right: arch really is beautiful. I think the simplicity of the design borders on genius.

more on arch

more on arch.

Generating complete patches from Arch

One thing that would be good for arch is being able to generate complete diffs that can be sent upstream to maintainers who aren't using Arch. At the moment you can send a changeset as a tarball, which is both the native format and something that can reasonably easily be read by somebody else:

,patch-10/
|-- mod-dirs-index
|-- mod-files-index
|-- modified-only-dir-metadata
|-- new-files-archive
|   |-- src
|   |   |-- lift-conn.h
|   |   `-- sftp-const.h
|   `-- {arch}
|       `-- superlifter
|           `-- superlifter--mainline
|               `-- superlifter--mainline--0.1
|                   `-- mbp@sourcefrog.net--2004-happy
|                       `-- patch-log
|                           `-- patch-10
|-- orig-dirs-index
|-- orig-files-index
|-- original-only-dir-metadata
|-- patches
|   `-- src
|       `-- lift.c.patch
`-- removed-files-archive

Pulling this apart is slightly too much work to ask of somebody who doesn't really care about using arch. If you post a tarball containing patches and newfiles to most development lists, you're likely to be knocked back and asked to just post plain diffs.

It would be nice if you could say tla show-changeset --diffs -N and get pseudo-diffs for added and removed files as well as changed files. Obviously this can't handle renamed files, but renaming files is not something that non-core developers do on this kind of project. (Actually I think, but I haven't checked, that arch would understand moved files in this case if the files had arch-tag unique IDs.) Adding new files is pretty common, so that really needs to be supported.

This would also make the submitter's patch log comments visible as files added under the {arch} directory. I'm not sure if that would be so popular with upstreams not using arch. Perhaps there needs to be an option to exclude them.

I think that if the generated patch was applied to an arch directory, it would correctly update everything including the patch log. This isn't as good, in a sense, as getting the changeset as a tarball -- you might not get permission changes, for example.

But I think being able to push around changes over email is a worthwhile model, and it's good if arch can support it. In particular many of the people currently doing loose distributed development are doing so by mailing patches. Those people are closest to using arch so it's good if the transition can be easier.

advocacy

One of the best bits of lca has been Robert Collins explaining difficult bits of arch. I'm much more convinced that it's a cool thing.

How to get diffs in Arch

Robert Collins replied to my grumbles about arch:

"what changed since patch-32" are presumably supported but I can't work out a good way to get it out.

assuming we are working with example@example.org/example--devo--1.0

tla get example@example.org/example--devo--1.0 example
cd example
tla changes example@example.org/example--devo--1.0--patch-32

For example, tla what-changed --diffs leaves temporary file droppings in the current directory

That was a bug - AFAIK it no longer occurs (using latest tla).

There is no built-in command to find out what changed between two revisions, even though this is probably one of the most-used commands for a version control system.

tla get example@example.org/example--devo--1.0 example
cd example
tla changes <revision to compare against>

Of course, you may already have figured this out - if you, sorry for telling ya to suck eggs ;)

I hadn't figured it out. In my defense I think some of those were not there, or at least not reasonably documented, in the version that I tried.

Colin Walters on Arch

Colin writes approvingly of Arch.

I recently switched to using the Arch revision control system for Rhythmbox, and I am incredibly happy with it now. I've already had one person who had no account on the GNOME CVS make a local branch of my tree (without write access to it either), start hacking on it, and then I was able to star-merge their changes back in easily. We've actually merged since several times as he made more bugfixes. There have been no conflicts from repeatedly applied patches. Great stuff.

It looks good to me. I just wish there were more documentation, but I suppose that will come in time. In the meantime, he has an IRC log with some description of how to use it.

aj on arch

aj is also looking at Arch.

To clarify my earlier comments: I think some of the problems with Arch at the moment are to do with the documentation not being up to date, but some are real shortcomings in the program. For example, tla what-changed --diffs leaves temporary file droppings in the current directory. Being able to get the changeset in structured form is nice, but leaving mess behind when I just look at the diffs is not. There is no built-in command to find out what changed between two revisions, even though this is probably one of the most-used commands for a version control system. (It's fine that this is not a fundamental operation and it's very clean that it's built on top of diffing trees, but there ought to be an easy interface for common operations.) I'm sure they'll fix it (or explain why I'm mistaken) in the fullness of time.

Subversion, meanwhile, is still answering the question “Is Subversion stable enough for me to use for my own projects?” with a less than reassuring “We think so!”.

I think that is more to do with modesty than it is with Subversion really being unstable. The Subversion developers on the whole seem less pugnacious than some other developers. I don't think that's a bad thing.

Conclusions on arch

Looks very promising; still too early to actually use unless you really like dogfood. Some of this might be just due to problems with documentation: simple operations like "what changed since patch-32" are presumably supported but I can't work out a good way to get it out.

First thoughts on arch/tla

Tom Lord's arch version control system seems to be coming along quite nicely.

I first looked at it some months ago but was scared off by documentation that made it a bit hard to understand what was really happening inside. A lot of the design concepts were not really stated, and this can give a worrying impression that there is no strong design.

(I think in tools for technical users it's really important to make the documentation expose enough of the internal model that they can feel comfortable with it and make predictions about cases that are not explicitly described. I think R5RS does this moderately well, for example, but I haven't seen anything that does such a good job for Common Lisp.)

Subversion's manual, by contrast, begins with a clear high-level description of what's happening inside Subversion — not the bits and bytes, but rather the concept of how trees are versioned.

This now seems to be addressed and a clear design can now be seen under the arch tutorial. They do this in a nice way: peeking under the covers at each stage of the tutorial, so you can see what effect it's having on the archive files. And arch is particularly suited to this, because it stores history in an excellent Unixy way: directories containing gzipped patches plus some metadata.

This is very reassuring: you can imagine recovering all your history even if all arch implementations went away, or writing a new compatible implementation (which has happened a few times.) Few version control systems have these desirable properties. It also shows excellent Unix taste to use plain files when possible.

Some particularly nice features of arch from WhyArch

Archives 2008: Apr Feb 2007: Jul May Feb Jan 2006: Dec Nov Oct Sep Aug Jul Jun Jan 2005: Sep Aug Jul Jun May Apr Mar Feb Jan 2004: Dec Nov Oct Sep Aug Jul Jun May Apr Mar Feb Jan 2003: Dec Nov Oct Sep Aug Jul Jun May