Sunday, May 24, 2009

What does it mean to temporarily converge branches?

In previous blog posts I have talked about how I would like to merge two diverged version control branches of my glew-home personal directory version control tree. Incrementally, however, not all in one go. Others have responded that git's data model does not support such file at a time merger.

True enough. But it inspires me to think about how such branch merger should behave.

I have these two branches. They are separate and distinct. I want them to converge, a file at a time. I don't really want to create a merged copy of a file that lives ob two different branches. What I really want is for certain files, on a file by file basis, to belong to two different branches.

What does it mean to "belong to two different branches?"

If the branches remain independent, and just happen to have the same file version, then checkins on one branch do NOT automatically affect the other branch. The first such checkin will cause the branches to diverge. A manual merger may ensure that the two branches stay converged - but it is a manual action. If three or more branches are "converged" but not eliminated, the merging work increases correspondingly.

That's the key: merging branches, but not intending to eliminate the merged branches. At least not immediately. "Converging" branches, with the intention that changes made to one branch be made to the other branch(es), as automatically as possible.

When converging branches that contain multiple files, some files may be converged as described above, whereas others may remain distinct. Thus, the multifile branches tahat are to be merged, even if the intent is to eliminate one or both of them eventually, may pass through this phase where certain files are unmerged, but others are merged or converged - as I am trying to describe here.

OK, so what does this "converged" state mean? I think what it means is that the intent is that the branches shall remain identically. It changes the default from a new checkin causing the beranches to diverge, with a manual re-merger required, to where a new checkin by default is made to both branches simultaneously. As correctly as possible.

Let's talk about a single file first.


File1:
v0 -> branch1.v1 -> branch1.v2
v0 -> branch2.v1' -> branch2.v2
where
branch1.v1 != branch2.v1'
branch1.v2 == branch2.v2
which we might indicate as
converged(branch1,branch2).v2


I would like to draw a diagram of diverging branches reconverging at this point, but I hate ascii art. Perhaps I can do a real drawing and attach it.

Anyway, if we have separate branches that have temporarily converged
v0 -> branch1.v1 -> branch1.v2
v0 -> branch2.v1' -> branch2.v2
then new checkins will cause the branches to diverge:
v0 -> branch1.v1 -> branch1.v2 -> branch1.v3
v0 -> branch2.v1' -> branch2.v2-> branch2.v3'

whereas if they are intended to be converged we might see
v0 -> branch1.v1 -> converged(branch1,branch2).v2
v0 -> branch2.v1' -> converged(branch1,branch2).v2
after the first checkin of v3 on branch1,
v0 -> branch1.v1 -> converged(branch1,branch2).v2 -> converged(branch1,?branch2).v2
v0 -> branch2.v1' -> converged(branch1,branch2).v2 -> converged(branch1,?branch2).v2

A new checkin v4 on branch1 may cause it to look like

v0 -> branch1.v1 -> converged(branch1,branch2).v2 -> converged(branch1,?branch2).v3 -> converged(branch1,?branch2).v4
v0 -> branch2.v1' -> converged(branch1,branch2).v2 -> converged(branch1,?branch2).v3 -> converged(branch1,?branch2).v4

which I prefer to indicate as a single history subsequent to the merge point:
-> converged(branch1,branch2).v2 -> converged(branch1,?branch2).v3 -> converged(branch1,?branch2).v4

The point of the ?branch2 notation is that, although the checkins of v3 and v4 were actually made ion branch1, they are intended to also be made on branch2. Underlying this is an assumption that there needs to be some testing or validation that these checkins are also good for branch2. If we did not require such testing, we would mark them as converged(branch1,branch2).v4.

But I have drunk deep of the koolaid of testing and agile methodologies, so I will not make such marking the default. I will assume the "tested on branch1, but not tested om branch2" state. And I will assume that there is a way of eventually removing the ? marking.

So, what does it mean for a file object to be in the converged, but tested only on one branch, state?
-> converged(branch1,branch2).v2 -> converged(branch1,?branch2).v3 -> converged(branch1,?branch2).v4

What it means is that branch1 can continue to check in, but branch2 must merge the branch1 changes marked as ?branch2, before advancing. E.g. if somebody wants to do a checkin of vx on branch2, they will have to remove the ?branch2 marking on v4.

-> converged(branch1,branch2).v2 -> converged(branch1,?branch2).v3 -> converged(branch1,branch2).v4 -> converged(?branch1,branch2).vx

If necessary, it may be required to merge v4 and vx

-> converged(branch1,branch2).v2 -> converged(branch1,?branch2).v3 -> converged(branch1,branch2).v4 -> converged(?branch1,branch2).v4x

and since I believe in checking everything in, vx would have to be checked in on sort of a merge/task branch. This time a real, temporary, branch.

I.e. "converged" file object branches just affect the defaults. They require that the ?branch2 branch be updated before advancing the merged branch.

In some ways this is like a CVS workspace versus a shared repository that has been updated elsewhere. The CVS workspace is in some ways a temporary branch, although not always recorded in permanent history. (Sometimes regrettably so - who has not update merged a working workspace, to have had it break - and not be able to recover the unmerged files?)

It should always be possible to have branch2 diverge away earlier - say right after v2. But the intention is for branch1 and branch2 to stay converged.

As for multifile branches:

Back to my earlier post: to incrementally merge two multifile branches, we try to converge file objects a few at a time, until the whole branches are converged. At that point we can choose to eliminate or combine the branches, or leave them independent but converged.

Actually, I really should not be talking about converged branches. These are better called streams, or lines of development, that occasionally converge, in whole or in part, but which remain independent and which may subsequently diverge.

Note also that multifile branches or streams may be converged in some files but not others. This could be a way of using streams to basically conditionalize code, at file granularity - e.g. distinct .logins used at different systems. I've tried to use conventional branches to track this sort of thing, and it is a real pain. Perhaps this convergence idea would make it easier.

Finally, it is interesting that this is really quite relevantr to my recent hardware interest in SIMT processors. Threads may drift into or out of coherence, aka convergence. Version control systems continue to be a good way to work out ideas for microarchitecture such as the maintenance of speculative state.