The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Tuesday, May 19, 2009

Git submodules

Appear to be like CVS's modules.

Long ago I gave up on CVS's modules. I found that checking out subtrees worked reasonably well.

I want to figure out how to check out ad in just particular objects or directory subtrees from a git repo. Checking out should be easy enough, particularly if just exporting: scan whatever head of branch or contour you want, and check out the objects that match a criterion such as "under this tree".

Checking out the history is a bit more challenging. If file objects have always been at the same position in the tree. no worries. If file objects have moved ... moving within the sub-tree that is being checked out, agaiun no worries. But if they have moved into or out of the sub-tree being checked out, what should you do?

I lean towards checking out the history objects, but somehow preventing checking out of file version objects that are outside the current sub-tree. You could read about them in the log, diff against them, do the equivalent of cvs update -p on them to look at the contents ... but the nice default of checking out a version of such a file object into its historical position in the filesystem would not apply to such out-of-bounds file objects.

Checking in is more of a challenge, mainly because of assumptions wrt atomicity of multi-file, whole-project, checkins. Clearly this atomicity cannot be supported at all times if my goal of checking in/out individual file objects and subtrees is to be supported.

However, I think that we should support such atomicity as much as possible, since so many people like it. The lack was one of the biggest complaints about CVS.

I think that it boils down to questions such as "What is a branch?" and "What is the head of a branch?" Whole project commits correspond a set of file objects, with a guarantee that the next checkin on the branch will be linked backwards to the current set.

E.g. imagine a smal multi-file system, all on the same branch:

F1/v1 -> F1/v2 -> F1/v3
F2/v1 -> F2/v2 -> F2/v3
F3/v1 -> F3/v2 -> F3/v3

Checking in/out the whole project conceptually advances the versions of all file objects in lockstep, or, equivalently

Project/v1{F1,F2,F3} -> Project/v2{F1,F2,F3} -> Project/v3{F1,F2,F3}

Checking in some, but not all, of the files corresponds to something like this:

F1/v1 -> F1/v2 -> F1/v3
F2/v1 -> F2/v2 -> F2/v3 -> F2/v4
F3/v1 -> F3/v2 -> F3/v3 -> F3/v4

or almost equivalently:

Project/v1{F1,F2,F3} -> Project/v2{F1,F2,F3} -> Project/v3{F1,F2,F3}
-> IncompleteProject/v4{F2,F3}

On this branch, you could ask for the most recent whole-project atomic commit, or the latest version of all files.

Part of the problem is that the "filesystem" consists of the logical equivalent of directory files that list all filenames in a given version of the whole project, and point to the version (content) of the corresponding file objects. The filesystem does not really have a representation for a version of an individual file, except its content. Having per-file-object version tracking is conceptually easy, but would be more work, and would be liable to inconsistencies.

This is not a conceptual problem, just an implementation artifact. If we had database views...

We could keep the whole project viewpoint by doing something like

Project/v1{F1,F2,F3} -> Project/v2{F1,F2,F3} -> Project/v3{F1,F2,F3}
-> Project/v4{F1/v3,IncompleteProject/v4{F2,F3}}