The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Tuesday, October 09, 2012

Why rewrite history? Why disentangle?

Why ever rewrite history in a version control system?

Linus Torvalds does not need or want to see a thousand different branches, one for each contributor.
Linus has a longer post somewhere, along the lines of "the history should show how the code should have been written, not how it actually was written".
Basically, to make things understandable.  

Because in real life, the actual development history, changes get entangled. Not nicely 

   A -> B -> C

but instead

    A1 -> B1 -> C1 -> B2 -> A2 -> B3 -> C2 -> ...

which we might like to see reordered as

    A1 -> A2 -> B1 -> B2-> B3 -> C1 -> C2

and smashed to 

     A -> B -> C
          A = A1 -> A2, etc.

If the changes, patches, are operators that commute Darcs-style, all well and good.  But many patches don't commute - logically they should, but in actuality do not.


Note that this entangling can be at file granularity, but is more painful wen within the same file.  Worse still if in the same line of code.

Hey - "entanglement".  Darcs is inspired by physics, right?


OK, so we want to rewrite history.  But the danger in rewriting history is that we might not end up where we want to.

More and more I am thinking that it woyuld be a good idea to rewrite history in a lattice defined by the actual history.

E.g. if we have (bolding the stats, using dA1B2 etc to indicate differences)

   0 -d0A1->  A1 -dA1B1-> B1 -dB1C1-> C1 -dC1B2-> B2 -dB2A2-> A2 -dA2B3-> B3  = FINAL

and we want

   0 -d0A-> A -> dAB-> B -dBC->  C FINAL

then we should constrain the rewritten history to arrive at the same final state.

I.e. the actual history and the rewritten, virtual, for understanding history should be considered alternate paths to the same final state.

Imposing this constraint may be helpful in history rewriting tools.

Representing these alternate paths may also be helpful. Sure, present the elegant history that Linus wants, but also preserve the grotty history that... well, historians like me want.


When rewriting history manually I find that the FINAL state keeps changing.  Often rewriting history exposes issues that were not seen in the original path.

   0 -d0A1->  A1 ... -> B3  FINAL  -dPostFinal-> FINAL'
     \                                                                                               \

      ------d0A-> A -----> dAB-----> ---------dBC----->  C FINAL'

A moving target, perhaps, but still alternate paths to the same ultimate final state.

How to do partial merges and cherrypick in Mercurial

Officially, you cannot. Mercurial has no formal supports for partial merges and cherrypicking.

Actually, there are a few history editors that let you select thunks to apply.  Plus, I have not yet really tried out mq, the quilt extension. So far my experience with Mercurial history editing tools is hit and miss, mainly miss. They mostly fail in the presence of merges, anything other than a simple linear divergence.

But you can do much by hand.

You can "cherry pick" individual files by using hg revert:

   hg diff -r branch1 -r branch2
   // look at diffs, figure out files that can be moved independently
  hg update -r branch1
  hg revert -r branch2 file1 file2
  // test and then
  hg ci

I find the name "hg revert" very strangely named.  I think of revert as going back to an earlier version, going backwards, not makuing progress.  But once you get past this, ok.

Merging changes in the midle of a file, interleaved with other changes, is more of a pain.  Basically, edit the diff to create a patch file.  Various tools can help.

The biggest problem with fdoing this by hand is that Mercurial does not really understand that you have done a partial merge. It is not recorded in the history.  Revsets exprssions cannot be used to look at the ancestors or merges.

rewriting history means rerunning tests - for all versions in new history?

Something that annoys me about rewriting history in a version control system like Git or Hg, e.g. rebase, is that strictly speaking one should rerun all tests on all of the new versions.

E.g. if you had

        trunk:     A->B->C

and made your edits

        trunk:     A->B->C->X->Y->Z

and eventually wanted to merge into history from elsewhere

        trunk:     A->B->C->D->E

by rebasing

        trunk:     A->B->C->D->E->X'->Y'->Z'

then you *might* want to rerun tests on all of X', Y, and Z' -- not just on the final Z'.

Because while X' *should* be similar to X, except with D and E's changes, hopefully non conflicting - sometimes there will be conflicts that are not detected by the merge, only by running tests.  Interferences.

I.e. sometimes X' will break tests, although X dos not, and although Z' does not either.


Of course, this only matters if your project takes the "All checkins should pass all tests" approach.  The approach that simplistic bisect uses.

It doesn't matter so much if you go back and retroactively label versions according to what tests have passed, etc.  (Which, of course, Mercurial cannot do worth a darn.)

Gosh darn, I want hg push messages

Gosh darn I want hg push messages.

I want to be able to push, and say something like "Everything in this changegroup is on a branch.  Don't panic.  It is being cleaned up before it gets merged into the default trunk."

Exceptions that pop the stack versus occurring WITHIN the stack frame

In a high reliability situation where you might want to throw typed errors that are easily parsed so that you can mitigate the problem ... in such a situation, would you not at least like to throw the error WITHIN the context so that you can return to the erroring instructions?  E.g. open a file if necessary, fix a permission problem by asking the user, etc.

I.e. exceptions that can fix up, and return, as if the exception never occurred.

The other sort of exception, the C++ style exception, pops/unwinds the stack.  And can't return.   The only way that you can handle such errors is to have the calling functions loop:

Hmm..  might call this PSH versus POP exception handling.

             outer() {
                             do {
                                    bool failed = 0;
                                    try {
                                     catch(...) {
                                            // attempt fixup and repeat
                                            failed = 1;
                              } while( failed );

Imagine if you had to handle page faults in this manner?    Actually, you don't need to imagine - just look at how stack probes in old UNIX shells used to work.

This seems to be the criterion:

* if exception handling can be transparent to the calling code, you want WITHIN handling

* if exception handling requires the cooperation of the calling code, you want POP handling.

Examples of possibly transparent handling:
* page faults, where VM system can bring in
* removable disk or tape not mounted
* OS asking the user "are you sure?" before doing something that might be a security issue
* emulating unimplemented operations or instructions

Examples of non-transparent:
* stack overflow - you can't PUSH onto the present context, because there is no room left

There's a case for both.   And, certainly, if PUSH is provided, there must be a way so that it can POP.


Older OSes like DEC VMS(?) reportedly provided both.

Obviously OSes provided PUSH exception handling for stuff like virtual memory.

But modern languages like C++ have definitely tended towards POP or UNWIND exception handling.   If you can call C++ modern.     (Q: Java? Javascript?)


It almost seems that PUSH is associated with change of privilege, while POP is not.  Perhaps.  But, I would like to be able to provide things like user mode page fault handlers, or user mode integer overflow handlers.   That is only change of privilege is you have fine grain privileges, more fine than modern OSes provide.

Garbage collection makes it more practical to return good error messages

Elsewhere I have discussed how I buy into Andrei Alexandresciu and D's contention that throwing C++ style exceptions is the best way to signal errors, since a programmer cannot accidentally or deliberately forget to handle an error.  My preference is to throw generic string error messages, unless I am in a high reliability situation where errors will be parsed and try to be cured, since you can then at least provide a context,. or stack, of error messages.

But... languages with garbage collection do remove one objection to signalling errors by a return code.  One problem with return codes is that they are so cryptic.  Integers.  Ugh.   But if you can construct a meaningful string and return that, then the user at least has the option of printing a meaningful message.


             FUNCTION foo(int bar, char* file) RETURNS char* error_msg;

             if( char* error_msg = foo(42,"bazz") ) {
                     std::cerr << "call to foo got error message: " << error_msg << "\n";


             foo(42,"bazz");    /// error code is ignored

At least with GC this pattern does not cause a memory leak.


I still prefer throwing C++ style exceptions, though.