Krazy Glew's Blog: 05/01/2009

Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Sunday, May 31, 2009

Towards my own git-like VC tool - user level filesystem in a directory tree, filesystem in a file.

I am not happy with git. I want to improve on it. It's depressing to think of how much good stuff would need to be duplicated, but anyway...

I want to explore my ideas on how branches, lines of development, tags, and subprojects work.

I think the filesystem oriented design is the way to go. Strictly speaking, I first saw this in svn, but Linus certainly took it further.

Think of it as an abstract interface to a filesystem - a filesystem embedded in a directory tree. Or possibly in a file. E.g. an XML file. I'll concentrate on the filesystem in a directory tree idea first, but want to always keep filesystem n a file, filesystem-in-XML, filesystem-in-tar-archive, etc., around at the back of my mind.

The API to such a user level filesystem is not UNIX kernel level open/read/write. In particular, we cannot assume file handles. No state apart from the filesystem itself.

The API must interface to an existing filesystem.

The API should be (a) in a scripting language, and (b) at the *IX command level. The mapping between them should be 1:1 as much as possible - i.e. I am lazy, and do not want to have to define both. I want to be able to automatically generate the command line tools from the scripting modules.

The filesystem should be layerable. E.g. for version control, I want a layer that handles my VC concepts - branches, tags, lines, etc. I want this layered on top of a contenht based, deduplicated, filesystem. And I want that layered on top of a compression layer, i.e. a packing layer. I want to be able to design these layers separately, and mix and match.

I want this layering so that, ideally, I can steal other people's implementations.
E.g. I might want to steal Linus' git packfiles, or bzr's packfiles.
E.g. I will probably write my own content based layer, deliberately using a hash that has lots of collisions, to prove that I can handle collisions. But I would like be able to use any other existying implementation.
E.g. I may want to implement things in a directory, or in some single file. If it is properly layered, that should just involve switching a layer. Although the performance considerations might be extreme.

I want the UNIX command line interface to support such layering. E.g. I may want to look at a filesystem in something like a .git directory tree as a VC level - or at the content based level, or at the raw pack level.

I want the API in the scripting language so that I can have the layering in the script language, somewhat efficiently. But if layering modules are written in different scripting languages, say Perl or Python or Ruby, or even in C/C++ like git, I want to be able to layer through tyhe command line as well.
I.e. command line layering required. Inside-script-language layering nice.

Given my hisory, I will probably start coding in object oriented perl. Hope to leverage lots of code out of CPAN. Call it PerlFS? No, that name is taken. How about FSinF or FSinD - FileSystem in a File and FileSystem in a Directory (Tree). Perl_FSinF, Perl_FSinD. FSinF being the generic name, since directories are just filesystems.

I'm not scared of using a scripting language. Bzr argues they have acheived good performance in Python. I like portability.
As usual, ultimately I would like the system to install as a single file. Likes scons.py. Or at least a directory.

Thoughts about the FSinF (FilesSystem in File/Directory) Interface

Can't have primitives like UNIX open/read/write/close. Thesed assume state - file descriptors.

Interface must specify both files on the FSinF filesystem, as well as data on the native filesystem.

Operations such as:

Copying

FSinF copy fileSrc fileDst
Copy fileSrc (in FSinF) to fileDst (in FSinF)

FSinF get fileSrc fileDst
get contents of fileSrc (in FSinF) and write to fileDst (in native filesystem)

FSinF put fileSrc fileDst
put contenhts of fioleSrc (in native filesystem) to fileDst (in FSinF)

If we have syntax to distinguish native and FSinF filesystem names, this might be a single command "copy", although get/put is nicely documenting.

Moving and Renaming

FSinF move fileSrc fileDst
move or rename fileSrc (in FSinF) to fileDst (in FSinF)

Again, syntax could allow moving and renaming between FSinF and the native filesystem.

Inter-filesystem operations

get and put are inter-filesystem operations. We don't need to limit ourselves to getting from the FSinF to the native filessytem. We could transfer between two different FSinFs, FSinF1 and FSinF2. Possibly might stage through the native filesystem, but not necessarily always.

Parts of Files

Don't want always to operate on whole files. Although that is natural.

May want to specify parts of files:

byte offset regions
line regions
XML clauses
record numbers

I don't think that we need to define these at the level of the FSinF API. They could be FSinF filesystem specific.

At the command line leve, these would be optional parameters.

E.g.

FSinF get -range "line 40:line 100" file

Hints and Requirements

Different filesystems have different semantics. E.g. "emove" may mean "remove completely", or "remove from current version, but keep earlier versions around".

Such semantics are FSinF specific. Optional arguments.

May want to indicate if the semantics are mandatory or optional. E.g. "remove from current version, but keep earlier versions around" motivated by desire to remove copyright infringing stuff is mandatory - and must be done. Or must receive an error if cannot be done.

E.g.

FSinF remove -require hard-removal file ...

FSinF remove -optional soft-removal file ...

Other Filesystem Operations

FSinF remove files ...

FSinF create files ....
Empty files created.
Possibler specify types.

FSinF mkdir dirname

FSinF mkfs fsinf-file-or-dirname

FSinF fsck fsinf-file-or-dirname

FSinF chperm files...
Changes permisions
Also used to change owner, group name, etc.
Expect to extend:
E.g.
May start off with UNIX permissions, or ACLs
May create my own extra permission classes and ACLs, more flexble than native OS.

FSinF metadata add file metadata...
FSinF metadata remove file metadata...
Manipulae extra metdata associated with files
Synchronization

FSinF get file1 file2 ...
atomically get consistent file versions

FSinF atomic compare-and-swap old-file oldcmpdata newfile

FSinF atomic compare-and-put old-file1 cmpdata1 newfile2

Have the "server" do the comparison, and then put/swap.

Possibly

FSinF lock file ...
FSinF unlock file ...
Although I dislike the stae implied by locking.

Layering, Access to

E.g. in a directory ~/myproject
with a subdirectory .git

I would want to be able t say

FSinF -type git move file1 file2

But, I might also want tio say, at the level of the conent based

FSinF -type git-sha-content-based-fs remove 0x6416fec17...

I might also have a .svn link, so I may want to say

FSinF -type svn get "-r4 file1 "
note how "-r4 file1" almost looks like a filename / specification to the svn filesystem.

I.e. I may want to have several different filesystems under a tree, such as .git / .cvn. Or even my old CVS / CVS.othr-reo stuff.

But I also want to be able to look at the layers in any of thenm.

Layers I want

Raw filesystem. I.e. basically a NOP.

FSinF pack:
Compacts the above.
Probably needs some form of delta specification, for delta compression.

FSinF content-based
Uses hashes to deduplicate.
Me, I want to be able to handle hash collisions. Want both stupid hashes, like UNIX compress, and fancier hashes almost guaranteed to be collsions free.

FSinF checksummed
Adds checksums, verifies integrity
Maybe even ECC

How about:
FSinF mirrored?

FSinF encrypted.

FSinF signed
bit not encrypted

FSinF version-conrol
version controlled
in my dreams, with my extended semantics.

FSinF dirtree, file, xml, tar ...

Friday, May 29, 2009

Git slowness

After attempting to use a separate git repo per subproject, I switched to using a single git repo for a superproject consisting of many subprojects.

Actually, even the "per subproject" was really "per subproject that happens to contain several sub-sub-projects". I.e. I didn't go to the really fine granularity of subprojects that I want. Because even this intermediate level of granularity was annoying to deal with. Kept sticking things in one repo.

So, I'm trying to find out if git can do the "multiple subprojects in same repo" style that I used on CVS.

Apart from the usual problems - you don't want to tag the whole tree for just a subproject, e.g. a library - I am finding a surprising issue:

Git is slow.

Checking in a single file is slow. Perhaps because it scans the whole tree? Or perhaps because it has to walk up the tree to find the .git directory? In any case, its slowness can be felt, compared to the separate repository for each subproject usage model. But, unfortunately, the separate repository for each subproject usage model seems to be error prone.

I find this surprising because I had heard that git was really fast. Certainly, it is fast for some operations, such as tag.

But one of my most common operations is checking in one or two files, in a small subproject, that is part of a super-project, that is part of a super-super-project. And it hurts.

In my copious spare time: benchmarking git.

I suspect that my usage patterns for a version control tool are different than many other folks'. I am definitely a "checkin early and often" sort of guy. If I was a "checking seldom" guy, I wouldn't mind so much.

Tuesday, May 26, 2009

git fetch and tag conflicts

As mentioned in earlier posts, I may merge two unrelated repositories.

Warning: the two unrelated repositories to be merged may both have tags of the same names. I.e. there may be tag conflicts.

By default, git fetch leaves tag conflicts pointing at were they were originally in the destination. By default, git fetch only fetches non-conflicting tags, in the range fetched.

git fetch --tags .. will fetch other tags. However, inh the case of conflicting tags, the source tag, the new one, will completely override the existing tag in the destination.

This can result in loss of the metainformation represented by the tags. E.g. if both of the repositories involved have the same tag, something like 'got tag Okay'. The Okay tag for one of the repositories will be lost.

Not that even if you embed the date within the tag, there may be lossage E.g. 'git tag Tests-passed-2009-05-29'. Even if you embed the time ... but the more fine grained the tag timestamping, the less likelihood of a tag conflict producing loss of information.

Fortunately (?), got fetch / merge / pull do not seem to merge tags on a file by file basis. That would be bad, indicating possibly inconsistent sets.

There appears to be no way - i.e. I have not found a way - to signal an error if a tag is being lost as a result of such a fetch/merge.

MORAL AND COMMENTARY:

Beware of the possibility of losing information via this mechanism.

Embedding the date and time in tags may be a good way to reduce the possibility of losing tag information.

You can embed time a priori.

A posteriori, you may wish to rename tags on a repository or branch that is about to be fetch/merge/pulled, from something like tagFooBar to branch1-tagFooBar. This is a posteriori, because you don't know at the time you create a tag what uniqifying branch prefix it whould be merged as.

... Or do you ...? If tags implicitly had (a) timestamp, (b) hostname, (c) pathname, (d) user, the probability of a tag conflict would be diminishingly low. If furthermore you had the contents (checksum thereof) of the files involved ... Then this would be effectively unique. After all, if two tags have the same name, refer to the same file versions, then for all intents and purposes they ARE identical. Involving time, user, etc. is just icing on the cake.

This tends to imply to me that tags should be first class objects, much as files. They should have an arbitrarily long unique name, specifiable by any set of coordinates that is uniqifying. However, when coordinates are not uniqifying, i.e. when two tags have the same name but differ in other coordinates such as file contents, then (1) both should be maintained in the repository, but (b) the shorthand using the tagname only should not be allowed.

This is relevant to "floating tags", such as "got tag tests-passed". The uniqifying coordinates described above would automatically be applied. This way, you could do "got tag tests-passed" as many times as you wanted.

Perhaps for floating tags you would want the name tests-passed to select the most recent. Or perhaps there should be operators such as "Most recent tag tests-passed on branch branchBelongToSomebodyElse".

Since got does not support such tag uniqification, it behooves to check for tag conflicts manually.

Git tag philosophy - don't share tags

The 'git tag' manpage,
e.g.
http://www.kernel.org/pub/software/scm/git/docs/git-tag.html

explains git's philosophy, that tags are really not meta-data that should be shared.

This happens because of the "multiple users" mindset.

Since I have a "single user" mindset, and also a "subprojects" mindset, it appears that I am probably more wont to want to preserve tags than the original git authors, Linus et al.

My approach to making tags be first class, implicitly uniqified, seems to solve this nicely: it preserves the information, without requiring extra work. Whereas the present git strategy seems to require considerable extra work to preserve tag history.

Tags on branches

Some tags want to be branch specific. E.g. tests-passed is a floating tag that should be independent on many different branches.

Whereas other tags may want to be repository wide. E.g. tag-the-only-version-on-any-branch-that-looks-god.

Uniqification may be the appropriate thing here. Tags may want to be implicitly made on branches, as one of the uniqifying coordinates.

What about different files in different branches having the same tag? Again, thing of it as a query: "SELECT files WHERE branch=* AND tag_name=foo"

Something for me to do?

Ahhhh.... maybe there is some way I can improve the sate of the art in version control. Subprojects, tags, directories.

Tags and subprojects

In CVS I grew into the habt of having subprojects live in seaparate directory trees of a great big mother source trewe.

However. one might want to consider using tags, on a per file basis, as an indcation of subprojects.

E.g. in that big source tree I would typically have a project-skeleton - README, bin, etc. I could tag those files 'skeleton'. If I then checked out just those files marked 'skeleton', I would get those files.

A subproject might be checked out embedded in the whole tree. It maight have the skeleton files, as well as its own contriobutions to shared directories such as bin.

I would prefer to use non-overlapping subdirectories, but, tags used in this manner would be useful, given how traditional UNIX distributes subproject files all over the standard directory tree (bin, lib, etc).

Such tags would need to float automatically. I.e. they would need to be associated with file object names, not particular file object versions.

git directories not first class?

git apparently does not treat directories as first class citizens.

E.g. there apparently is no way to checkin an empty directory. Or, at least, no way to provide a log entry when creating a directory.


SomeSecretHostname /users/glew/hack/git-hacking/ 441 : mkdir git-dir-example
SomeSecretHostname /users/glew/hack/git-hacking/ 442 : cd git-dir-example
Directory: /users/glew/hack/git-hacking/git-dir-example
SomeSecretHostname /users/glew/hack/git-hacking/git-dir-example/ 443 : got init
got: Command not found.
     # my usual typo: got instead of git
SomeSecretHostname /users/glew/hack/git-hacking/git-dir-example/ 444 : git init
Initialized empty Git repository in /fs30/home.directory.11/glew/hack/git-hacking/git-dir-example/.git/
SomeSecretHostname /users/glew/hack/git-hacking/git-dir-example/ 445 : git log
fatal: bad default revision 'HEAD'
SomeSecretHostname /users/glew/hack/git-hacking/git-dir-example/ 446 : echo hi > there
SomeSecretHostname /users/glew/hack/git-hacking/git-dir-example/ 447 : git add there
SomeSecretHostname /users/glew/hack/git-hacking/git-dir-example/ 448 : git commit -m'there'
[master (root-commit) 2849a27] there
 1 files changed, 1 insertions(+), 0 deletions(-)
 create mode 100644 there
SomeSecretHostname /users/glew/hack/git-hacking/git-dir-example/ 449 : git log
commit 2849a27602d10ec8192e931c5410491521f3fe73
Author: Andy Glew Linux 
Date:   Tue May 26 11:48:49 2009 -0700

    there
SomeSecretHostname /users/glew/hack/git-hacking/git-dir-example/ 450 : git mkdir foo
git: 'mkdir' is not a git-command. See 'git --help'.
SomeSecretHostname /users/glew/hack/git-hacking/git-dir-example/ 451 : mkdir foo
SomeSecretHostname /users/glew/hack/git-hacking/git-dir-example/ 452 : git commit
# On branch master
nothing to commit (working directory clean)
SomeSecretHostname /users/glew/hack/git-hacking/git-dir-example/ 453 : git add foo
SomeSecretHostname /users/glew/hack/git-hacking/git-dir-example/ 454 : git commit
# On branch master
nothing to commit (working directory clean)
SomeSecretHostname /users/glew/hack/git-hacking/git-dir-example/ 455 : echo hi > foo/there
SomeSecretHostname /users/glew/hack/git-hacking/git-dir-example/ 456 : git add foo/there
SomeSecretHostname /users/glew/hack/git-hacking/git-dir-example/ 457 : git commit
Waiting for Emacs...
[master aa4fd7d] foo/there
 1 files changed, 1 insertions(+), 0 deletions(-)
 create mode 100644 foo/there
SomeSecretHostname /users/glew/hack/git-hacking/git-dir-example/ 458 : git log
commit aa4fd7d713584ca2a3781f264fc9d48b1169f032
Author: Andy Glew Linux 
Date:   Tue May 26 11:50:01 2009 -0700

    foo/there

commit 2849a27602d10ec8192e931c5410491521f3fe73
Author: Andy Glew Linux 
Date:   Tue May 26 11:48:49 2009 -0700

    there
SomeSecretHostname /users/glew/hack/git-hacking/git-dir-example/ 459 :

Merging unrelated repositories

As I start up with git, I am also starting up work with several codebases that I am unfamiliar with.

The codebases are mostly under SVN, but not all; some are under CVS, some Perforce. I have decided to just suck everything into git for purposes of tracking any changes I am making myself. I am not linking to svn repos or anything else; I am just sucking in the SVN and CVS metadata, with the expectation that I can work under git, and then do a svn checkin.

The codebases are structured as overlapping projects, subprojects, and libraries. I was not aware of the relationship between the subprojects and libraries. E.g. I checked out a library, placed it under git, made changes - and now have checked out another tool that has this library as a subcomponent. I want the new tool to lie in git as well, and I want to checkin the vanilla source code - but I also, eventually, want to use my edits to the library with the new tool.

Basically, I am doing a lot of work with subprojects. I am doing a lot of work with initially unrelated repositories.

Here's a link to a procedure on "Merging two unrelated repositories":
http://www.simplicidade.org/notes/archives/2009/04/merging_two_unr.html

Sunday, May 24, 2009

What does it mean to temporarily converge branches?

In previous blog posts I have talked about how I would like to merge two diverged version control branches of my glew-home personal directory version control tree. Incrementally, however, not all in one go. Others have responded that git's data model does not support such file at a time merger.

True enough. But it inspires me to think about how such branch merger should behave.

I have these two branches. They are separate and distinct. I want them to converge, a file at a time. I don't really want to create a merged copy of a file that lives ob two different branches. What I really want is for certain files, on a file by file basis, to belong to two different branches.

What does it mean to "belong to two different branches?"

If the branches remain independent, and just happen to have the same file version, then checkins on one branch do NOT automatically affect the other branch. The first such checkin will cause the branches to diverge. A manual merger may ensure that the two branches stay converged - but it is a manual action. If three or more branches are "converged" but not eliminated, the merging work increases correspondingly.

That's the key: merging branches, but not intending to eliminate the merged branches. At least not immediately. "Converging" branches, with the intention that changes made to one branch be made to the other branch(es), as automatically as possible.

When converging branches that contain multiple files, some files may be converged as described above, whereas others may remain distinct. Thus, the multifile branches tahat are to be merged, even if the intent is to eliminate one or both of them eventually, may pass through this phase where certain files are unmerged, but others are merged or converged - as I am trying to describe here.

OK, so what does this "converged" state mean? I think what it means is that the intent is that the branches shall remain identically. It changes the default from a new checkin causing the beranches to diverge, with a manual re-merger required, to where a new checkin by default is made to both branches simultaneously. As correctly as possible.

Let's talk about a single file first.


File1:
  v0 -> branch1.v1 -> branch1.v2
  v0 -> branch2.v1' -> branch2.v2
where   
  branch1.v1 != branch2.v1'
  branch1.v2 == branch2.v2
which we might indicate as
  converged(branch1,branch2).v2

I would like to draw a diagram of diverging branches reconverging at this point, but I hate ascii art. Perhaps I can do a real drawing and attach it.

Anyway, if we have separate branches that have temporarily converged
v0 -> branch1.v1 -> branch1.v2
v0 -> branch2.v1' -> branch2.v2
then new checkins will cause the branches to diverge:
v0 -> branch1.v1 -> branch1.v2 -> branch1.v3
v0 -> branch2.v1' -> branch2.v2-> branch2.v3'

whereas if they are intended to be converged we might see
v0 -> branch1.v1 -> converged(branch1,branch2).v2
v0 -> branch2.v1' -> converged(branch1,branch2).v2
after the first checkin of v3 on branch1,
v0 -> branch1.v1 -> converged(branch1,branch2).v2 -> converged(branch1,?branch2).v2
v0 -> branch2.v1' -> converged(branch1,branch2).v2 -> converged(branch1,?branch2).v2

A new checkin v4 on branch1 may cause it to look like

v0 -> branch1.v1 -> converged(branch1,branch2).v2 -> converged(branch1,?branch2).v3 -> converged(branch1,?branch2).v4
v0 -> branch2.v1' -> converged(branch1,branch2).v2 -> converged(branch1,?branch2).v3 -> converged(branch1,?branch2).v4

which I prefer to indicate as a single history subsequent to the merge point:
-> converged(branch1,branch2).v2 -> converged(branch1,?branch2).v3 -> converged(branch1,?branch2).v4

The point of the ?branch2 notation is that, although the checkins of v3 and v4 were actually made ion branch1, they are intended to also be made on branch2. Underlying this is an assumption that there needs to be some testing or validation that these checkins are also good for branch2. If we did not require such testing, we would mark them as converged(branch1,branch2).v4.

But I have drunk deep of the koolaid of testing and agile methodologies, so I will not make such marking the default. I will assume the "tested on branch1, but not tested om branch2" state. And I will assume that there is a way of eventually removing the ? marking.

So, what does it mean for a file object to be in the converged, but tested only on one branch, state?
-> converged(branch1,branch2).v2 -> converged(branch1,?branch2).v3 -> converged(branch1,?branch2).v4

What it means is that branch1 can continue to check in, but branch2 must merge the branch1 changes marked as ?branch2, before advancing. E.g. if somebody wants to do a checkin of vx on branch2, they will have to remove the ?branch2 marking on v4.

-> converged(branch1,branch2).v2 -> converged(branch1,?branch2).v3 -> converged(branch1,branch2).v4 -> converged(?branch1,branch2).vx

If necessary, it may be required to merge v4 and vx

-> converged(branch1,branch2).v2 -> converged(branch1,?branch2).v3 -> converged(branch1,branch2).v4 -> converged(?branch1,branch2).v4x

and since I believe in checking everything in, vx would have to be checked in on sort of a merge/task branch. This time a real, temporary, branch.

I.e. "converged" file object branches just affect the defaults. They require that the ?branch2 branch be updated before advancing the merged branch.

In some ways this is like a CVS workspace versus a shared repository that has been updated elsewhere. The CVS workspace is in some ways a temporary branch, although not always recorded in permanent history. (Sometimes regrettably so - who has not update merged a working workspace, to have had it break - and not be able to recover the unmerged files?)

It should always be possible to have branch2 diverge away earlier - say right after v2. But the intention is for branch1 and branch2 to stay converged.

As for multifile branches:

Back to my earlier post: to incrementally merge two multifile branches, we try to converge file objects a few at a time, until the whole branches are converged. At that point we can choose to eliminate or combine the branches, or leave them independent but converged.

Actually, I really should not be talking about converged branches. These are better called streams, or lines of development, that occasionally converge, in whole or in part, but which remain independent and which may subsequently diverge.

Note also that multifile branches or streams may be converged in some files but not others. This could be a way of using streams to basically conditionalize code, at file granularity - e.g. distinct .logins used at different systems. I've tried to use conventional branches to track this sort of thing, and it is a real pain. Perhaps this convergence idea would make it easier.

Finally, it is interesting that this is really quite relevantr to my recent hardware interest in SIMT processors. Threads may drift into or out of coherence, aka convergence. Version control systems continue to be a good way to work out ideas for microarchitecture such as the maintenance of speculative state.

Tuesday, May 19, 2009

Git submodules

Appear to be like CVS's modules.

Long ago I gave up on CVS's modules. I found that checking out subtrees worked reasonably well.

I want to figure out how to check out ad in just particular objects or directory subtrees from a git repo. Checking out should be easy enough, particularly if just exporting: scan whatever head of branch or contour you want, and check out the objects that match a criterion such as "under this tree".

Checking out the history is a bit more challenging. If file objects have always been at the same position in the tree. no worries. If file objects have moved ... moving within the sub-tree that is being checked out, agaiun no worries. But if they have moved into or out of the sub-tree being checked out, what should you do?

I lean towards checking out the history objects, but somehow preventing checking out of file version objects that are outside the current sub-tree. You could read about them in the log, diff against them, do the equivalent of cvs update -p on them to look at the contents ... but the nice default of checking out a version of such a file object into its historical position in the filesystem would not apply to such out-of-bounds file objects.

Checking in is more of a challenge, mainly because of assumptions wrt atomicity of multi-file, whole-project, checkins. Clearly this atomicity cannot be supported at all times if my goal of checking in/out individual file objects and subtrees is to be supported.

However, I think that we should support such atomicity as much as possible, since so many people like it. The lack was one of the biggest complaints about CVS.

I think that it boils down to questions such as "What is a branch?" and "What is the head of a branch?" Whole project commits correspond a set of file objects, with a guarantee that the next checkin on the branch will be linked backwards to the current set.

E.g. imagine a smal multi-file system, all on the same branch:


     F1/v1 -> F1/v2 -> F1/v3
     F2/v1 -> F2/v2 -> F2/v3
     F3/v1 -> F3/v2 -> F3/v3

Checking in/out the whole project conceptually advances the versions of all file objects in lockstep, or, equivalently

Project/v1{F1,F2,F3} -> Project/v2{F1,F2,F3} -> Project/v3{F1,F2,F3}

Checking in some, but not all, of the files corresponds to something like this:


     F1/v1 -> F1/v2 -> F1/v3
     F2/v1 -> F2/v2 -> F2/v3 -> F2/v4
     F3/v1 -> F3/v2 -> F3/v3 -> F3/v4

or almost equivalently:


     Project/v1{F1,F2,F3} -> Project/v2{F1,F2,F3} -> Project/v3{F1,F2,F3}
            -> IncompleteProject/v4{F2,F3}

On this branch, you could ask for the most recent whole-project atomic commit, or the latest version of all files.

Part of the problem is that the "filesystem" consists of the logical equivalent of directory files that list all filenames in a given version of the whole project, and point to the version (content) of the corresponding file objects. The filesystem does not really have a representation for a version of an individual file, except its content. Having per-file-object version tracking is conceptually easy, but would be more work, and would be liable to inconsistencies.

This is not a conceptual problem, just an implementation artifact. If we had database views...

We could keep the whole project viewpoint by doing something like


     Project/v1{F1,F2,F3} -> Project/v2{F1,F2,F3} -> Project/v3{F1,F2,F3}
            -> Project/v4{F1/v3,IncompleteProject/v4{F2,F3}}

Monday, May 18, 2009

git usage model - individual files, subdirs

My attempts o use git are running into usage model problems.

As mentioned before, I have checked two diverged CVS trees into different branches of the same git repository. I want to merge the branches - but not all at once. I want to merge on a file by file, directory by directory, subprobject by subproject, basis, testing as I go.

Unfortunately, apparently git-merge is all or nothing. It merges all of the files in a branch, but does not have the option of merging one at a time.

I.e. it does not seem to have the ability to do what in CVS might look like


   cvs update -rbranch1
   cvs update -jbranch2 foo.c
   cvs ci

merging only foo.c

This is the rub. Git seems only capable of treating the entire repo, the entire branch, as an object, whereas I look on files, directories, etc. as potential subobjects.

I am falling back to checking out two different git repos for the branches, and merging "manually" by copying files from one to the other. But this sucks.

Krazy Glew's Blog