Krazy Glew's Blog: Friday, July 02, 2010

I want more from a version control tool.
I'm not particularly happy with the present generation of distributed version control tools:
Mercurial (hg) or git. Nor Bazaar (bzr).
Nor their predecessor Bitkeeper.
And certainly not CVS, SVVS, or RCS.

= To Do List =

At the top, to put it in my face:

* I'd like to extract code from git-pack-objects (or mercurial) that takes a list of not-hash-named files, and compacts them into delta storage. And then wrap my own version control concepts, in layered user level filesystem scripts, around it
* Leverage existing merge tools.

= Blog Wiki Crosslinks =

http://wiki.andy.glew.ca/wiki/The_Version_Control_Tool_I_Want

= My Main Complaint: Subprojects, Lack Of =

My main complaint:
I want a distributed version control system that naturally supports single file or sub-directory checkouts, subprojects, etc.
Ironically, this is somewhere where CVS (and CVS's immediate predecessors, various RCS tools (some of which I wrote at various companies))
was better than its successors SVN, BitKeeper, Git, Mercurial.

== My VC'ed Home Directory as a Tree of Subprojects ==

For many years - at least since 1985, and I think even a bit earlier - I have been a heavy user of version control.
In particular, I have long VC'ed my home directory.

Now, my home directory is not monolithic. I might check out different parts of it on different systems.
But I do not, at least historically have not, needed to divide it into separate repositories, and logically merge those together.
I have tried, and have not liked, various tools that allow multiple repositories to be logically merged into a single workspace.
(I think such tools are great for when you have to merge different repositories, from different organizations and/or under different VC tools. But I think they are overkill when I can get away with everything being in a single repository.)

E.g. under glew-home, my repository, I have subdirectories like glew-home/src, glew-home/src/include, glew-home/src/libag, glew-home/src/libag/number, glew-home/src/libag/builtin_initialization, etc. Currently roughly 50 of these libraries.
Each in a separate directory. Each designed to be used in isolation, although there are occasional interdependencies. Often header only.

When I started doing this, long, long ago, I might create a library libag.a that somebody would have to link into a binary.
This was an impediment to people using my libraries. Projects would not want my entire libag; they would only want the minimum, the stuff that they were using, and nothing else. Hence my evolution: a module in a directory, check out that directory, place it anywhere, include its header and, usually, that was all you needed.

If I were to structure these as independent projects, that would be roughly 50 independent repositories. Much more of a hassle to manage.

Fortunately, CVS made it possible to check out an arbitrary subdirectory tree. (This is one reason why I evolved to a directory per library module - one directory, to get the source file (typically a header) and any test files associated with it.) That subdirectory could be placed anywhere in the target workspace. But it could always get checked back into my master tree.

Of course, projects using my libraries would want to keep their own version control history for their local copy of my libraries. I therefore created a version of CVS where the "CVS" subdirectory name that contained or pointed to metadata could be specified on the command line. So the default CVS metadir might be for the project using my library, whereas CVS.libag would point to the master tree for my library.
This is not too unlike having a local repository that you check in to, and a parent repository that you push to.
Which is not to say that the other good things of distributed version control are not desired.

My main point: I want to be able to check out an arbitrary subdirectory. And then check it back in.
And not have to do manual merges to all of the projects that include that subdirectory.
And certainly not to have to do stupid things like cloning the entire tree, and then building a workspace with only a subtree.
I understand the desire for atomic checkins of an entire project.
I think I have figured out [[how to reconcile multifile atomicity and subset checkins]], see that section.

I say again: it is NOT good enough to have to structure each separate module as a separate project with separate repositories.
If nothing else, often library modules begin as tools in some other module,
and then get refactored into independence. I want to preserve the entire history, including the history with the not-yet-nascent module was part of some other module.

Like I said, CVS could handle this well enough, since it just allowed subtrees to be checked out.
But it had all of the other stupidities of centralized version control.
SVN didn't handle it so well as CVS: its attempts at atomic checkin got in the way.
Git and Hg just seem to have given up on this.
If I am wrong: PLEASE TELL ME. I would rather not have to write a new VC tool.

I was actually starting to write a DVCS in my copious spare when Bitkeeper cut off Linux. Git and Mercurial got written far quicker than I could write mine. But, after years of trying to cope with their limitations, I am about to give. (Also, especially, way back then Intel owned evertything I did, with no guarantee of it being open-sourceable. Whereas with my new employer, IV, I am allowed to open source this.

= Other Good Stuff I would Like in a VC Tool =

The idea of using cryptographic checksums is a good idea. Hg and Git do this, based on Monotone's initial work, which in turn was probably based on rsync. However, I dislike ignoring the possibility of hashes colliding. In my original work on this sort of "content based hashing", I deliberately choose to use a poor hash (UNIX sum or cksum) to cause collisions to occur, with the option of using a stronger hash later.
I like the idea of having an unsafe mode, where hashes are assumed to be unique, and a safe mode that ensures that hash collisions do not occur.

I used to go to great extents to try to keep the version data for a file near to the file itself. E.g. I used to keep the versions of a file in a subdirectory like CVS or RCS. However, this is not robust across renames and source tree reorganizations, so I have fallen back to the now standard technique of placing the versions in a centralized directory somewhere, typically something like .git or .hg off the root of the project.
I like the idea of automatically finding the .git directory by walking up the directory tree; but I also like the possibility of having pointer metadata like CVS subdirs, since I have occasionally had to work on systems where the source tree did not fit on one disk partition. Or environment variables. Heck, this dates back to my 1986 era publication, boxes, links, and parallel trees; I just no longer think the parallel trees are necessary, although they are nice to have.

I prefer files to databases, since databases nearly always need sysadmin infrastructure such as backups. However, atomicity is good, even though it appears that the requirements of atomicity in git/hg's filesystem based implementation have resulted in some of the limitations I am railing against. I am thinking more and more about using an sqlite database as an option to make it easier to maintain certain invariants atomically. However, I am also thinking about making it possible to reconstruct all of the information from the filesystem, assuming you did not crash in the middle of a checkin. At least, file data in the filesystem.

I want portability: my home directory migrates from various UNIXes to Linux to FreeBSD to Cygwin, and parts migrate to Windows.
My VC tool must run on Windows.
I place portability above performance.
I'm willing that my VC tool may be entirely written in a scripting language like Perl or Python,
and hence need minimal installation.
(I can probably code it faster in Perl, since I have many existing pieces, but may want to do Python.)
Although I would be happy if the performance critical parts could optionally be in C or C++.

Git's approach of a user level filesystem is a good idea.
I am seriously tempted by the idea of structuring this as several different layers of user level filesystem tools.
Unfortunately, I am having trouble disentangling the layers of git that I want from the layers that I do not.
See [[User Level Filesystem]].
I want to leverage as much existed code as possible.

As explained above, I want my VC tool to be usable for my home directory, with all of its subdirectories/subprojects

But I also want to be able to use my VC tool for my wiki, to allow me to
(a) work while disconnected on subsets of my wiki, and then merge,
and
(b) use it as a Poor Man's file replication subsystem.
(Another usage model that worked well on CVS, but works much less well on Git or Mercurial.)

In my dreams, I could create repositories at different points in a tree, and then merge them.
E.g. do a checkin at a/x and a/y, and then merge into a single repository for at includes both x and y.
(This can be done wth existing tools, but is a pain.)

Even further in dream-space, I would like to be able to do a filesystem c -R of a subtree, and then propagate the history.
I must beware - the desire to do this wasted time; the current BKM is a user level filesystem, svn cp or git mv.
But, it can *almost* be done,
if you have per directory CVS metadir.
Some metadir could be deep, with actual VC files, some shallow. One can imagine an operation that copies versoon data from the centralized .hg or .git directory at the root of a repo, to the CVS metadir-like files, and vice versa.

The usual good stuff: renaming. Both manual, and git's implicit.

== User Level Filesystem ==

Linus' treatment of git as a filesystem is a good idea.

It would be nice to have several different user level filesystem layers
* Basic functionality
** Names
*** Store as files, with original names, parallel tree structure
*** Store as numeric names
*** Store as content hash names
**** handling hash collisions
*** Store as human friendly, but not strictly original names
*** Store human friendly original tree structure pointing into obscure numeric space.
** Metadata
*** Store file data and metadata, e.g. in filename/.data and filename/.metadata
*** Allow user to collect arbitrary metadata, potentially non-adjacent
*** Nice ls for metadata
** Storage
*** Store as files (or files in directories)
*** Store in a tar archive
*** Store in a database

None of the above is really VC tool specific. Some may be useful in other contexts.

More VC specific:
* Storage
** Store as unpacked file/objects
** Store as packed file/objects - delta storage

While this is "more VC specific", it is not necessarily completely so. One can imagine wanting tools that accomplish sinmilar compression. It's VC specific mainly in that te VC DAG is a guide to how things shuld be packed.

Git's pack-objects is almost exactly what I want. However, git-pack-objects seems not to be able to take ordinary filenames as the things to be compressed; it requires SHA hashes.

TBD: figure out if this can be disentangled from the rest of git.

If we get this far, we have the ability to manage user level storage. Heck, we culd move on to the next step before the delta storage was implemented - it should be that independent.

Finally, we get to the point that is really VC specific

* Managing versions, branches, etc., on a whole project, sub-project, subdirectory tree, and even individual file basis.

See the next section, [[how to reconcile multifile atomicity and subset checkins]]

= [[How to reconcile multifile atomicity and subset checkins]] =

I think that the fundamental problem that I am trying to solve,
that leads to poor support for subprojects in tools like git and hg,
is related to atomicity.

If you have a project in a repository, you want to make sure that changes to the project are atomic.
At least, changes to the main branch are atomic. All done, all tested, all checked in together.

But if a user checks out and modifies a subset of the files, then it cannot be checked back into the main branch.
You really need to start a branch for all projects that those files belong to, including all those files in tjose projects and those branches, including files tghat the user is not working on. Edit the files or module that the user is working on.
Check back in - to the branch (or branches).
Test.
Merge.

I.e. every subset checkout/checkin corresponds to creating a branch ON ALL PROJECT BRANCHES THAT MAY BE AFFECTED BY THE CHANGE.

Note that I am saying "subset" rather than "subproject". I am trying to emphasize that the subset may be defined dynamically, ad-hoc,
not necessarily a priori.

Note also that their is not necessarily a single superset or superproject.

But, imagine that the subset or subproject is in a separate organization from the superset or superprojects. In different repositories. You can't allow the subset.subproject to create a branch in the superset/superproject's repository, and vice versa. You have to provide the ability to track contours, tags, and branches without access to the other's repository.

I encountered this ability with AMD's RCS based build lists, BOM files that are essentially out of repository tags. By now obsolete, but an interesting insight. We don't need to go there yet, but it's an interesting insight.

So here's where I think that we need to go: If I make changes to a subset/subproject in particular to a branch of the subset that is depended on by a superset/superproject
# I do not want to automatically merge those changes back into the superset/superproject; you have to allow the chance for the merge to be tested,
# but I don't want all of the possibly very large number of superset/superprojects to have to know of the possibly very large number of dynamically created subset/subprojects.

The existing methodologies, the existing flows, seem to work when there is a small number of branches known in advance. It breaks down when we get a large number of implicit, dynamic, branches.

So, I think that we want to make these implicit relationships explicit. Instead of the main branch of SuperProject1 being manually merged from the main branches of SubProjectA and SubProjectB, I thijk that we want to record or deduce that SuperProject1 contains SubProjectA and SubProjectB, specifically, the main branches thereof. And when A and B get modified (let's assume A and B do not overlap), I want SuperProject1's main branch to get told "Hey, you are at Vxyz, but your subcompoments A and B have advanced, in two seaparate non-overlapping checkins. These are candidates fr merger into you, SuperProject1." And now the SuperProject1 maintainer can go and do the merges, together or separately as he may wish, directly into SuperProject1's main branch or into task branches for the merge as he may wish.

When I first started thinking about this, I thought that the checkin of SubProjectA might add itself as a candidate to SuperProject1. I know realize this is bogus. Only a project should manipulate itself.

However, the idea of candidates is a good one. It is not that the checkin of A creates a candidate for Seperproject1; it is that SuperProject1 knows how to go look for candidates that should be merged into it.

I.e. a project, or more precisely a branch of a project, should have rules about what stuff should be considered as candidates for merge. These rules could be something like "Automatically consider as a candidate for merge into the release branch stuff on the test branch that has been labelled TESTS_PASSED". But it may also be something like "Automatically consider as candidates for merge any modification or addition to any subdirectory under such and such a place in the hierarchy, which is marked as being MAIN branch for that subobject."

Now, it might be easiest to fall back to per-file or per-pathname history: a checkin to a subset or superset that contains a file automatically makes changes to the history objects for such a file. I'm not sure that this is required. It may make things easier. But it gets a bit confusing across renames, etc. But it means that we do not NECESSARILY have to scan subsets, etc., that have been dynamically created. A superproject may be composed or related to a set of known other projects or other branches, or to an enumerable set of locations in an abstract filesystem tree.

I think it is best to start with file objects, and then try eliminating them.

However, file objects, or even non-file subset objects, raises issues of atomicity. In git, you are only modifying a single file, a manifest, at any time. If we have superprojects and subsets and individual file history objects, do we need to manipulate these atomically?
* Not necessarily. We can imagine modifying the individual file history objects, and then modifying the subersets one at a time. The latest versions of the file objects may be inconsistent, but the versions of the set objects would be consistent. Heck, if you wanted, you could backlabel the file history objects with a tag saying that they have been consistently checked into some superset.
* However, we might want to prevent even those minor inconsistencies - even though, AFAICT, git and hg have them.
** Being willing to use a transactional database, like sqlite, as part of the metadata system may get us around even this.
*** I'm grudgingly willing to go along with sqlite because it is a database in an ordinary file.
*** I would hope the different user level filesystem instances can be created, both with and without.

Krazy Glew's Blog

Disclaimer

Friday, July 02, 2010

The Version Control Tool I Want

Blog Archive

Labels

Search This Blog

Followers

About Me

Links to Me