The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Thursday, October 25, 2012

Multiple paths of history

I have blogged, whined, ranted and complained about systems that allow you to, nay, encourage you to, edit history.  But I also whine about systems that do not allow you to do something as trivial as fix a typo in a checkin message, let alone systems like Mercurial that do not really allow you to add properties like "passes the days-long QA test suite" to a version of the code in the repository.

Moreover, I really do understand why you want to rearrange history. My classoc example is of interleaved changes

  V0---(a1)--->Va1---(b1)--->Va1b1---(a2)--->Va2b1---(b2)--->Va2b2 = Vab

I.e. starting off at version V0, apply changeset (a1) to get version Va1, then changeset (b1) to get version Va1b1, and so on until you get final version Va2b2, which we will call simply version Vab.

I.e. there are really two independent changes going on, (a)=(a1)+(a2) and (b)=(b1)+(b2).  But they got interleaved.

Try as you might, this happens.

Using history rewriting tools like git rebase and hg graft you can rearrange the histories.

  V0---(a1)--->Va1--->(a2')--->Va --->(b1')--->Vab1'---(b2')--->Vab

or in the other order

  V0---(b1")--->Vb1"--->(b2")--->Vb --->(a1")--->Va1"b---(a2")--->Vab

Note that I have tried to indicated by priming, ' and ", that the changesets applied in different order may not be identical. Note: we are not talking about applying the SAME changesets in a different order, which might give different final results Vab, but instead calculating MODIFIED changesets (a1"), (b1'), etc. to give the same final result Vab.

Compactly, letting (a)=(a1)+(a2) and (b)=(b1)+(b2)


or in the other order


I know how to do this with standard history editing tools.


But this is problematic when others have pulled and cloned and made modifications to the intermediate pre-history edited versions like Va2b1.  In Mercurial, if that version is deleted from the repository, you may not be able to merge back in - or, when you merge, it may revive the deleted version from the dead.

This is bad.

Moreover, this "you must edit history before pushing" philosophy leads to people delaying sharing code.  Which is bad.


I have been thinking more and more about recording multiple paths of history in the repo.

I.e. pushing first

    V0---(a1)--->Va1--->(a2')--->Va --->(b1')--->Vab1'---(b2')--->Vab

and then later, at your leisure, going and rearranging the history so that it makes sense:

    V0---+---(a1)--->Va1--->(a2')--->Va --->(b1')--->Vab1'---(b2')--->Vab
               \                                                                                   ||

I.e. the final nodes Vab are equivalent.

Or, pushing my ascii art skills

    V0---+---(a1)--->Va1--->(a2')--->Va --->(b1')--->Vab1'---(b2')---+--->Vab
               \                                                                                       /

(I think I like the equivalence || drawing better.)

Still later you may create a second alternate  history:

    V0---+---(a1)--->Va1--->(a2')--->Va --->(b1')--->Vab1'---(b2')--->Vab
               \                                                                                   ||
                  \                                                                                ||
The basic idea is that when you establish that you are creating alternate history paths, the operations are constrained to create the same final results.

And then, for those who want a simple history, you hide the paths, the changesets and the versions, that you do not want to appear by default.  But they still are recorded in the history, so that if somebody made modifications based off an intermediate version, you can still merge meaningfully.   You get to decide if that merge appears as if based of a version that is currently hidden, or if you want to roll up the visibility of its ancestors.


Andy "Krazy" Glew said...

One nice thing about this approach is that you don't have to be in a hurry to do the right thing. You can edit in a quick and dirty manner, share, but also clean up the history after the fact. You don't have to decide between integrating early and often, and having a nice history.

Andy "Krazy" Glew said...

So what is the name for a version?

Mercurial creates a hash of all of the history leading up to a version.

This can work with my multiple paths approach. A new path can be constrained to point to a version determined by an old path.

Except when you may delete versions, delete paths. Then it may not be possible to calculate the name from the history any more.

But why do we need to calculate the name? Why not store? We may lose an arbitrary integrity check.

I mean, basically the reason that we do not arbitrarily assign numbers is to handle collisions when disconnected. Content based names are very unlikely to collide. Even if part of the content has been deleted, still unlikely to collide.

However, I suppose that one of the advantages of content based names when all content is available is that you can in theory detect collisions. I don't know of anyone who does this in a DVCS (except me in my old RCS merge). If the content that went into the computation of a hash for a name is deleted, you can't verify.

But you can do perfunctory checks. For example, you can compare the current file contents (assuming you have them). This will detect different files having same history hashes. And of course all hash collisions are hopefully unlikely.

Hash the file contents.

Hash the file contents and current metadata.

Hash the file contents and metadata and history.