The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Monday, August 11, 2014

Version control branches are not branches - need "merging" of unrelated version control objects

Most version control systems have the concept of branching: a versioned object (e.g. a ,v file in RCS or CVS, an entire repository in most DVCSes) starts off with a single line of versions, but at some point the lines of development may diverge, and be developed almost independently.

"Almost independently", since it is common to synchronize two diverged branches - sometimes a complete synchronization, making then identical, sometimes incomplete. e.g. copying changes from a master branch to a maintenance release branch.

The term "branch" is a bad term, at least if you are thinking in terms of trees - unless your concept of branching includes re-merging, with tissue fusion where branches overlap. This often happens with vines like ivy, and occasionally happens with rubbing branches in trees.

The term "branch" is a bad term = but unfortunately I do not know of a better one.

Source code version control corresponds more closely to gener flow diagrams or "family trees" - but again the terminology is inaccurate.

I will no longer obsess about improving the terminology - but I do think that "branches" <=> trees has warped or limited our thinking.


The idea that branches reflect divergence followed, possibly, by convergence is also misleading.   The versioned objects may start off independent, and then converge first, before having the chance to diverge again.

Small real world example:  We recently changed VCS, from CVS (and HG) to Perforce.   All of the CVS files were converted en masse. Evolution then continued in the Perforce depot.

Later it was discovered that some edits had continued to be made (in neither CVS nor P4 nor HG).  These files were checked into a separate place in Perforce.

One may argue that what should have been done was to have created a branch off some revision along the CVS-converted-to-P4-history, and then checked those diverged versions in on that branch.  But that was not done.  Too late.

One may argue that these files are logically related to a common ancestor.  True - but that ancestor may not be represented in the Perforce depot.

What I argue is that it should be possible in a VCS to take separately versioned objects, and then merge them into a single versioned object.  Or, you may prefer "connect two independent graphs of versions into a graph with no supremum, no common ancestor".

Similarly, it should be possible in a VCS to create new ancestor versions at any time.  Not just to start off with a base or original version, and then move forwards in time - but also to go backwards in time.   Imagine, for example, that one is doing literary research, say into versions of ancient Greek literature that were copied and recopied by scribes in Alexandria, during the Islamic Golden Age, and also in monasteries in Ireland during the Middle Ages.  Then a new scroll is discovered preserved in a library in Pompeii - and it is obvious that it is an ancestor to some but not all of the later versions.  It should be possible to retroactively edit the history graph, inserting this new version.

Now, in this example the versions may be imagined as being descended from a common ancestor.  But perhaps not - perhaps two independent works were created, and then borrowed from each other until they became very similar, possibly identical after a certain point.

Linus has argued against explicitly recording file renamings in git - saying that comparing file content should suffice.  This is true... but imagine the literature research problem.  Here we may want to record the opinions of antiquities experts as to which version of the document is related to which, as a matter of opinion rather than incontrovertible fact.  Those experts may have used content comparisons to make their inferences, but it may well be impossible for an automated tool to repeatedly infer those connections, unless they are recorded explicitly.

Another example, beyond individual files:  I have version controlled my ~glew home directory for decades (yes, literally).  But I have occasionally switched version control systems without preserving history.  And the versioned tree has diverged on some OSes.  I should like to merge these graphs.

I need to improve my terminology wrt "versioned objects".  Were there two distinct versioned objects prior to connecting their version graphs, but one thereafter?    The concept "versioned object" is mainly used so that I can talk about versioned sets of versioned objects (and versioned sets of versioned sets of ...)   There are really default queries - there are relations between versions that are essentially or conceptually 1:1, such as just adding a few lines to a 1000 line file, but leaving it in the same place in the filesystem.   Similarly, moving from place to place in the filesystem.  There are relations that are "into", such as taking a file containing a function and making it just part of a larger text file.   This is much the same as including a file in a directory, except that it is easier to track the evolution of a file in a directory than it is a chunk of text in a larger text file. In my dreams, tracking a chunk of text that happens to be a function is easy, even across renamings - but functions may replicate, etc.

Plus there are graph edges that correspond to creating new versioned objects - such as splitting a file into pieces.

What I am trying to say is that it is easy to track the evolution of sets of files or oitheer version objects if the transformations are 1:1, or into, or...

Overall, if a set is identified by a rule, it is easy to track if the converged or merged objects all satisfy the rule for a set.  E.g. "all files under dir/subdir", is not affected if a file is renamed but livbes in the same directory.

But if a transformation means that some of the participants no longer are covered by the rule defining a set, then one may need to query.   E.g. if you have defined the set for a tool as "Module A = all files under tools/toolA",

but tools/toolA/foo.c has had a library function extracted into tools/lib/libAA/lib1.c, then Module A may no longer be freestanding.   Or we may want to modify its dependencies.