Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Wednesday, August 27, 2014

Version control clean: unknown / ignored / skipped over

The reference is not what I am talking about in general, but an example



     General - hg clean is pure evil:



     'via Blog this'



Tools such as "hg purge" (or got clean)



have options such as



"remove all unknown files" (hg purge)



"remove unknown and ignored files" (hg purge --all)



Methinks there is a third option needed - not files that you have ignored because they are generated, but files that are skipped over, e.g. because they are controlled by a different version control system.



E.g. -X directory -- often I have a different VCS in the subdirectory, .bzr rather than .hg.



It is dangerous to type in "hg purge --all" in such a situation.  E.g. it may delete the .hg/.bzr/.git subdirectories.



This is an example of "splitting": what is a simple action "exclude" from the point opf view of "hg add" is actually two flavors from the point of view of "hg purge".


Thursday, August 21, 2014

Perforce Anti-Patterns --- labels

Perforce Anti-Patterns | Perforce:



'via Blog this'



While I agree with encouraging use of things other than labels (changesets, automatic labels, brancghes can all be equivalent for labels)



I think that calling this an anti-pattern is specious.



There are anti-patterns that are fundamentally bad.



But there are also anti-patterns that are bad just because of a poor implementation - and apparently Perforce's implementation of labels is poor.



Conceptually, a changeset is very much like a label - a set of files and versions, with comment metadata.  (Again, using the "files at the bottom" mindset, which I don't like, but which I find easier to think about.)



Similarly, a branch is very much like a label - or, rather, the place on a parent where a branch starts.



These are similar conceptually, and probably should be interconvertible.



It is foolish to create a rather heavyweight concept like a branch, a contour, will do.



Perforce;'s slowness probably arises because it has a table that looks like



filename : version : branch-name



which grows as #Files*#Labels.



Even worse if the metadata is stored RCS like, replicated per-file.  I can't imagine Perforce being that silly ... now ... although I suspect they may have been historically.

Friday, August 15, 2014

Workflow is (almost) expressable as a Perforce branch mapping

Recently I have been working on a painful manual merge of some diverged, forked, documents.  Painful because no automatic merge tools in Framemaker - I use compare documents, and then manually investigate diffs.  (I remember doing this to the BSD 4.x kernel at Gould Real Time UNIX in the 1980s... gah!!!)



Key is developing a workflow: I assembled the files to work on in XXX.merge-needed.

I do my stuff, and then move the files to XXX.merge-pending.



In XXX.merge-pending, I wait to get approval, and then distribute stuff from there to unfortunately still replicated and forked source/destinatiions, call them XXX.fork1 and XXX.fork2.  I also move my work item files to XXX.merge-done, once the distribution is done.



Editing of stuff in XXX.merge-needed is manual, as is moving to XXX.merge-opending.



But thereafter, it can be automated:  when I have a batch of stuff in XXX.merge pending I can do



     p4 cp XXX.merge-pending XXX.fork1

     p4 cp XXX.merge-pending XXX.fork2

     p4 mv XXX.merge-pending XXX.merge-done

Yes: I could be tracking the workflow in Perforce - even though IT discourages that.

The copy-copy-move script is *almost* a Perforce branch-mapping:

     XXX.merge-pending/... --cp--> XXX.fork1/...

     XXX.merge-pending/... --cp--> XXX.fork2/...

     XXX.merge-pending/... --mv--> XXX.merge-done/...



Which would be nice - I could save it as a branch mapping, and reuse.



Almost but not quite - as far as I can tell, Perforce branch mappings really amount to "p4 cp", copies, buit do not have the ability to do moves.



Q: what is the difference between a script, copy-copy-move, and a branch mapping (extended to also do moves)?



The branch mapping is conceptually used to create a script of commands - but it also does various integrity checks.  Plus it is transactional (at least I think that Perforce is transactional, ACID) - all succeed or none does.



Integrity checks, like ensuring files overwritten are not locked.  It might be nice if it could be better - in this case ensuring that the destination files have not been changed since their common ancestor in the workflow loop.  Which is basically a very simple merge or integration algorithm.



It is nice to have that sort of integrity wrapper around a script.  It is a pain to code a script with all the necessary error handling.



I have long suggested that the folks who want transactional memory start by creating a transactional filesystem based scripting environment - for Perl, Python, whatever.



A DVCS is, in some ways, a mechanism that allows the user to create quite complicated transactions while disconnected, operating in a clone - and then later pushing, committing, rebasing or grafting.   Interactive, because when a merge/integration fails, the user has to get involved.









Thursday, August 14, 2014

Versioned Label Sets

I like labels in version control systems. Like "Compiles". "Passes Tests".  "Passes all tests except Test#44".   Status if you will.

Of course, such status must be applied to set of files+versions.  Or a repo-project-version. Whatever.  (I will not use the whole-repo-project viewpoint here, since I am trying to think about partial checkouts. Whole-repo trivially fits - it is just a single entry, the rep version.)

You can think of a label as a file itself, containing a list of files and their version numbers.   Such a label-file might also contain branch info, etc. - i.e. more metadata.

Generalize to an arbitrary package of metadata associated with a set of files+versions.  "Labels" may be such that their name is the only metadata that matters.

Such a label-or-metadata-file can itself be versioned.  Must be, should be.

In fact, just about everything we care about can be considered a set of objects+versions, in a file, itself versioned.  

Branches may be defined by rules such as a list of filenames or patterns.  Possibly with versions that are frozen in the branch. 

OK, there is a difference: branch histories are graphes.  Steps along the history are the sets of objects+versions that most closely correspond to a label set.  

I.e. there are graphs whose nodes are objects+histories.  

Anyway... : the default action is where the difference arises.

When a workspace is checked out from a branch head, when trying to check in the default is to extend the branch. 

When a workspace is checked out from a label, the default is not to extend. The label.

We can imagine interconversion: forcing a checkin to a label, making the label into a branch.

---

Who stores the linkages?

Label-sets may be marked inside a branch-graph file, or outside.

Outside allows non-privileged access.  Library users can label library versions.

Inside may be faster and more convenient.

It is important to be able to track the configuration of stuff that you are not allowed to write into the "home" VCS for. 


The DVCS people say "just clone", but that may not always be possible

I may want to have a local repo, linked to a master repo without incorporating all, and be able to define cross repo actions.

Click through IP licensing

The p4ideax forums terms of use has some interesting details: http://p4ideax.com/terms-of-use

Starts off mild:

User Submissions and License By using P4IdeaX, you agree that any information you send to Perforce via P4IdeaX, including suggestions, ideas, materials, and comments, (collectively referred to as the "Materials") is non-confidential.

But then gets stronger:

 Furthermore, by submitting the Materials using IdeaX, you grant Perforce and its designees an irrevocable, unrestricted, perpetual, non-exclusive, fully-paid up and royalty free worldwide license to make, use, sell, import, modify, reproduce, transmit, display, perform, create derivative works, combine with other works, and distribute such Materials for any purpose whatsoever to the extent permitted by law. This license to Perforce includes the right for Perforce to sublicense these rights to third parties.
Perforce may be working on a same or similar idea at the time of your submission. You understand that we may continue to develop our own idea independent of your submission without acknowledging your Materials.
As part of its license to your Materials, Perforce may make modifications to, derivative works of, or improvements to your Materials. These modified or improved versions shall be owned exclusively by Perforce.
Submission under a Patent or Patent Application You agree to disclose to Perforce if your Materials are protected by a patent or subject to a pending patent application. If your Materials are not yet patented, but you wish to patent your idea in the future, you also agree to disclose this information to Perforce.

Now, I think that recent updates to US patent law mean that there is no grace period here. If you post to a pretty-much-public website like p4ideax, then you have made a public disclosure and may not patent.

If your Materials are patented, subject to a pending patent application, or you intend to file for patent protection, these Terms of Use will automatically grant Perforce a license under the terms of the previous section entitled User Submission and License. Such license may be superseded only by a separate written license or assignment agreement between you and Perforce.

This is interesting. What if the materials are not yours to license?  What if you are posting GPL'ed materials? I can imagine some lawyer arguing that because you did not specify GPL when you posted, than the GPL would not apply.
Posting your idea to P4IdeaX may impact your ability to protect your idea under patent laws. If your goal is to patent your idea, we suggest you consult with an attorney before posting your idea on IdeaX. You agree not to hold Perforce liable for any loss of patent protection.

This is the other side of  ARM's "click-through licemnsing": to view ARM materials you have to promise not to use them to detect patent infringement.

---

As for p4ideax:  haven't registered yet.  

What about posting a link to a blog on my own site?  The link is licensed, but is the content I linked to licensed (I doubt it).

---

I guess my interest is left over from working at IV.


Perforce Software p4ideax | Intelligent symbolic links in the depot

Perforce Software p4ideax | Intelligent symbolic links in the depot:



'via Blog this'



I have also been looking for this "symlinks in depot".



It is possible that streams may do this - I may not totally grok streams yet (not helped by our IT forbidding us from using streams in P4, and highly discouraging branching (p4 branching support, is of course, primitive)).  But based on what I have seen so far, streams are much more complicated than what I want to do with symlinks/



Here is one of the use cases where I want to use depot side symlinks:



I want to merge two directories that have diverged versions of files.



Unfortunately, they are NOT branches.   The user who created them did not understand branching.   Instead, she copied the files outside perforce, and then added the copy as a separate set of files that, from Perforce's point of view, are totally independent, unrelated. (Fixing that is a separate topic.)  Call this a "fake branch".  (E.g. think about cp -R from to creating a fake branch of a directory tree - logically a branch, just one that your version control tool may not be able to figure out.)



Unfortunately^2 they are binary files that I can merge, but must do so by hand.  Painful. Slow.  I can't get the merge done all in one day.



So here is what I want to do: as I merge the several hundred files in the fake branch directory

- let's call the original

       DEPOT/a/b/GoodDir

and the "fake branch"

       DEPOT/c/d/FakeBranchDir

I must leave the two directories GoodDir and FakeBranchDir around.



But as I merge files GoodDir/file1#666 and FakeBranchDir/file1#1 into GoodDir/file1#667,

I want to make FakeBranchDir/file1#2 into a "depot symlink" to GoodDir/file1

so thereafter anyone attempting to work with FakeBranchDir/file1 will get whatever the latest version of GoodDir/file1 is.



And I will do this one by one for all of the files.



(By the way, I can do this because I know the dependencies.  I.e. I can do continuous partial integration (merging, reconciliation).

Sometimes I have to do several files together atomically, but not the entire directory.)



When all of the files are merged, so that every file in FakeBranchDir/fileN is a "depot symlink" to GoodDir/fileN,

I can do the following:



* remove all FakeBranchDir/fileN depot symlinks, and make DEPOT/c/d/FakeBranchDir a depot symlink to  DEPOT/a/b/GoodDir

* potentially just plain remove FakeBranchDir completely, and stop the insanity of having unnecessary fake branches in the depot



Anyway... streams may do this, but they seem like overkill, plus IT has forbidden p4 streams. Heck, my team barely knows how to use branches - actually, I am strongly discouraged from using branches (but I am so used to branching...)



Lacking depot symlinks or other support, here is what I am doing:



+ Merging the files

+ Once merged, copying the files into BOTH GoodDir/file1 and FakeBranchDir/file1, etc.

+ hoping that nobody modifies the merged files separately, causing them to re-diverge.

   + unfortunately, not allowed to create a long-lived lock. Folks still want to edit in ther diverged directories



I have thought about using p4 branch mappings to accomplish the same thing as a "depot symlink", but that is a pain - I would have to edit the branch mapping every time a file GoodDir/fileK and FakeBranchDir/fileK were merged.



Basically, "depot symlinks" are just a way of allowing you to edit the branch mapping, without actually having to edit the mapping in a central place.   They are a "distributed" view of the branch mappings.















---



Now, yes, I know: this creates a "fragile base class" problem.   Somebody checking something into GoodDir/fileM might break FakeBranchDir/fileM (if it is a depot symlink), because the "context", the surrounding files, may break it in the FakeBranchDir context.   Yes, I realize that we really need to be using branches here (not p4's primitive branching, but some sort of branching for a partial subset of the depot - which may be what p4 streams are trying to do.).  So that when somebody checks into GoodDir/fileM, FakeBranchDir/fileM can detect that it needs to be updated, but is not automatically updated until you have tested it in the FakeBranchDir context.



(Hmm, what this really means is that FakeBranchDir/fileM#2 may be a depot symlink to GoodDir/fileM (after some base revision)

FakeBranchDir/fileM#2-->GoodDir/fileM(#latest,validated=1011). Using notation to indicate that we are supposed to link to the latest, but at the last time of checkin that value was GoodDir/fileM#1011; as opposed to linking to FakeBranchDir/fileM#2-->GoodDir/fileM#1011, which would be a depot symlink, but one that is not normally updated by default.

     I.e,. a depot symlink really wants to be a branch.  But it is a branch that you normally want to be encouraged to update as quickly as possible, perhaps by default, as opposed to having to do an explicut branch merge.)

     But, these are dreams for my own VCS.

     Just plain old depot symlinks, though, are a darn good first step.)














Monday, August 11, 2014

Version control branches are not branches - need "merging" of unrelated version control objects

Most version control systems have the concept of branching: a versioned object (e.g. a ,v file in RCS or CVS, an entire repository in most DVCSes) starts off with a single line of versions, but at some point the lines of development may diverge, and be developed almost independently.



"Almost independently", since it is common to synchronize two diverged branches - sometimes a complete synchronization, making then identical, sometimes incomplete. e.g. copying changes from a master branch to a maintenance release branch.



The term "branch" is a bad term, at least if you are thinking in terms of trees - unless your concept of branching includes re-merging, with tissue fusion where branches overlap. This often happens with vines like ivy, and occasionally happens with rubbing branches in trees.



The term "branch" is a bad term = but unfortunately I do not know of a better one.



Source code version control corresponds more closely to gener flow diagrams or "family trees" - but again the terminology is inaccurate.



I will no longer obsess about improving the terminology - but I do think that "branches" <=> trees has warped or limited our thinking.



Anyway...



The idea that branches reflect divergence followed, possibly, by convergence is also misleading.   The versioned objects may start off independent, and then converge first, before having the chance to diverge again.



Small real world example:  We recently changed VCS, from CVS (and HG) to Perforce.   All of the CVS files were converted en masse. Evolution then continued in the Perforce depot.



Later it was discovered that some edits had continued to be made (in neither CVS nor P4 nor HG).  These files were checked into a separate place in Perforce.



One may argue that what should have been done was to have created a branch off some revision along the CVS-converted-to-P4-history, and then checked those diverged versions in on that branch.  But that was not done.  Too late.



One may argue that these files are logically related to a common ancestor.  True - but that ancestor may not be represented in the Perforce depot.



What I argue is that it should be possible in a VCS to take separately versioned objects, and then merge them into a single versioned object.  Or, you may prefer "connect two independent graphs of versions into a graph with no supremum, no common ancestor".



Similarly, it should be possible in a VCS to create new ancestor versions at any time.  Not just to start off with a base or original version, and then move forwards in time - but also to go backwards in time.   Imagine, for example, that one is doing literary research, say into versions of ancient Greek literature that were copied and recopied by scribes in Alexandria, during the Islamic Golden Age, and also in monasteries in Ireland during the Middle Ages.  Then a new scroll is discovered preserved in a library in Pompeii - and it is obvious that it is an ancestor to some but not all of the later versions.  It should be possible to retroactively edit the history graph, inserting this new version.



Now, in this example the versions may be imagined as being descended from a common ancestor.  But perhaps not - perhaps two independent works were created, and then borrowed from each other until they became very similar, possibly identical after a certain point.



Linus has argued against explicitly recording file renamings in git - saying that comparing file content should suffice.  This is true... but imagine the literature research problem.  Here we may want to record the opinions of antiquities experts as to which version of the document is related to which, as a matter of opinion rather than incontrovertible fact.  Those experts may have used content comparisons to make their inferences, but it may well be impossible for an automated tool to repeatedly infer those connections, unless they are recorded explicitly.



Another example, beyond individual files:  I have version controlled my ~glew home directory for decades (yes, literally).  But I have occasionally switched version control systems without preserving history.  And the versioned tree has diverged on some OSes.  I should like to merge these graphs.



I need to improve my terminology wrt "versioned objects".  Were there two distinct versioned objects prior to connecting their version graphs, but one thereafter?    The concept "versioned object" is mainly used so that I can talk about versioned sets of versioned objects (and versioned sets of versioned sets of ...)   There are really default queries - there are relations between versions that are essentially or conceptually 1:1, such as just adding a few lines to a 1000 line file, but leaving it in the same place in the filesystem.   Similarly, moving from place to place in the filesystem.  There are relations that are "into", such as taking a file containing a function and making it just part of a larger text file.   This is much the same as including a file in a directory, except that it is easier to track the evolution of a file in a directory than it is a chunk of text in a larger text file. In my dreams, tracking a chunk of text that happens to be a function is easy, even across renamings - but functions may replicate, etc.



Plus there are graph edges that correspond to creating new versioned objects - such as splitting a file into pieces.



What I am trying to say is that it is easy to track the evolution of sets of files or oitheer version objects if the transformations are 1:1, or into, or...





Overall, if a set is identified by a rule, it is easy to track if the converged or merged objects all satisfy the rule for a set.  E.g. "all files under dir/subdir", is not affected if a file is renamed but livbes in the same directory.



But if a transformation means that some of the participants no longer are covered by the rule defining a set, then one may need to query.   E.g. if you have defined the set for a tool as "Module A = all files under tools/toolA",

but tools/toolA/foo.c has had a library function extracted into tools/lib/libAA/lib1.c, then Module A may no longer be freestanding.   Or we may want to modify its dependencies.