Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Thursday, April 24, 2008

Version Control Checkout and Sharing of *Partially Populated* Subtrees

I'm on the version control tool warpath again.

I'm a damned good example of why VC tools need subprojects, nested trees, whatever... Never you mind Keith Packard (Mr X Consortium) saying that I shouldn't bother.

At the moment I am messing around with
a) at least 2 diverged CVS trees for my home directory, one on my laptop, one on Linux.
b) a simulator, which uses BitKeeper
c) my own main project, which uses CVS (fortunately, one site only)
d) various other tools at work, using CVS and SVN
e) my TWiki site at work, which uses RCS (although I can make it look like CVS if necessary)
f) friends and coworkers who use GIT

My contributions to b), c), d), and e) make extensive use of stuff from my personal library. But, these other projects do not want to import my whole personal library - rather, they want to import just the modules they need, and their dependences.

Part of my personal source tree is structured as follows:

~/src:

include
" debug.h
" lprintf.h
" MathConst.h
" ag.h
" debug.h
" lfile.h
" libag.h
" simplelock.h
" strfamily.h*
libag
" assert
" bitpattern
" bitvector
" builtin_initialization
" class_name
" code_location
" compile " component
" " SimpleScalar-libag-component
" error
" error_stream
" File_Slurp
" file_slurp
" fmt
" getopt
" getter_setter
" indented_ostream
" Inherit
" interval
" IO_converter
" libuarch
" " cache-array
" " " examples
" " memory_image
" " multilevel_cache
" " SimpleScalar-libag-libuarch
" Memory_Object
" misc
" new-libag-ideas
" " quantiles
" number
" old-stuff-trying-to-reuse
" regd_ptr
" smart_ptr
" stack_trace
" stdio-stream
" test
" thisprintf
" to_bitstring
" to_string
" typename
" types
" unindent
" xml

You get the picture: lots of little modules, each in their own directories or directory trees.

Way back when I started, I put multiple header files in the same directory, like include/debug.h, include/lfile.h. I also created "utility" files like ag.h and libag.h. Over time I learned that it was best to put each module, each header file, in a directory of its own. Such a directory gives you a place to place tests, etc. There is a pattern of repeating the directory name and the include name #include "libag/test/test.h"

Above I have excerpted "header-only" libraries. I also have some C/C+ libraries that need to be compiled, but I have learned that "header only" lubraries, that only need to be #included, are a lot more convenient to use.

Some of my libraries have interdependencies. For example, libag/test/test.hh includes ../code_location/codelocation.hh.
First thing to note: the use of includer relative paths. If I simply copy (cp -R) libag/{test,code_location}, and #include "import/libag/test/test.hh", everything works.

Second, and most important thing to note: a typical user of some (but not all) of my libraries might want a pruned subtree:

import/libag
" test
" code_location

Importing all of libag is just clutter.

I.e. they do not want to import just complete subtrees.

They want to import a pruned subtree - stuff that is in the parent directory, libag, as well as subdirectories such as libag/test and libag/code_location.

I no longer put actual library functions at such parent-root directories. (Alright, I still have some files there, libag/stralloc.h, mainly dating back before I decided to use the "directory per module" pattern.) But I still have metadata there - libag/README.libag, explaining library organization. I also have tools like SConstruct files and files that guide test jig runners, so that I can simply say "make test" or "scons test" in libag, and all the tests run. (My test runner does NOT require a list of all the subdirectories.)
I would like to keep some of that metadata in libag when it is checked out for use by other projects.
I.e. I want all of the file contents of libag, but only a subset of the subdirectories.
I say again: I want a PRUNED SUBTREE of the source.

---

Really, what I want is a version control that makes it easy to share small modules. Not heavyweight subprojects, a la git. Not projects that I have to define in advance, like git or CVSROOT/modules. I want really lightweight modules. Directory subtrees seem to be a very natural way of specifying submodules.
I.e. at the very least I want to be able to do "vcs co libag/test libag/code_location".
Unfortunately, this least thing that I want to do is barely supported by any of the modern version control tools (git, hg, bzr). cvs, oronically, supports it fairly well.

But I want to do better than this least.

Ultimately, I want automatic dependency tracking. I want to be able to say "vcs co libag/test", and have something automatically recognize that I need to checdk out libag/code-location as well. Perl CPAN does it - I want it everywhere.

But I also want to be able to check out libag, with local files, and a pruned set of subdirectories.

---

OK, say I have gotten that far: I can check out just the modules I want.

Next, I want to be able to work in a single workspace - and check things into multiple version control repositories.
E.g. I want to be able to use CVS to check changes made to my libraries back into my personal epositories.
But the other projects that are importing my stuff also want to check things into their repositories. E.g. keiko/imprt/libag/test/test.hh gets BitKeepered, as well as CVS'ed.

Years ago, I gave the CVS community patches to allow this: I allowed there to be multiple CVS metadata directories, e.g. CVS1 and CVS2, and I allowed the user to specify which to use, CVS1 to check into repository 1, CVS2 for repository 2.

I've done similar kluges with BitKeeper and Subversion. E.g. I have checked CVS subdirectories, containg CVS/Root, CVS/Repository, and CVS/Entries files into BitKeeper - so that a bk clone or bk pull can always be kept in synch with CVS.
It is hard to go the other way: distributed version control systems such as git and BitKeeper, that clone an entire repository, *can* be checked into other VC systems, but don't behave nicely, wrt disk space and other issues. The problem: they do not clearly separate the concept of repository and workspace and links between the two.
AMD's "buildlist" methodology - basically a list of all files, and their RCS/CVS versions - lent itself well to this.


---

Reading through some of the interminable "Bzr versus Git versus Hg" comparisons, I see people asking "Who cares about history? You only need the last few years of changes."
Well, I care about history. I have been accumulating code snippets, library functions, and tools for more than 20 years. Some may date back to 1980. I probably have SCCS (!) files dating back to 1985-1987, copied and/or checked into RCS, and then CVS. There are some big gaps in the record, corresponding to periods where changes I made to my libraries were owned by companies, not me --- but I have a long history. Not very dense - some of these tools may sit unaffected for years, e.g. I am just now brushing off PerlSQL - but long lived.
But why care about this history? Well, I *have* on several occasions had to back up to 10 year old versions of some of my library functions, when the flavors of UNIX/Linux changed under me.

No comments: