The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Friday, July 18, 2014

How to determine what may have changed (figuring out how to work around perforce slowness)

Determine what has changed using full content comparison

- slow (especially transferring whole content across net)

- completely accurate

Determine what has probably changed using content checksums

- transfer only checksums across net => fast if no change

- may be inaccurate if checksums collide (yeah, I am paranoid)

- computation (of checksum) on both sides

   - or, a VC tool may cache checksums of original checkout => local change

- false negatives - no change detected - if checksums collide - but ubnlikely

- no false positives

Determine what may have changed using heuristics and metadata

- e.g. file modification dates

- e.g. whether user has checked a file out for editing or not

- false positives and false negatives

- false negatives  - undetected changes - may be common, e.g. if not using modification dates or if M times can be changed

p4v "get latest" seems to use the third approach, heuristics and metadata.   I got bitten by false negatives - true changes not reported - almost immediately.

p4v "get revision / latest / force" uses the full content transfer.

I realized that I was implicitly assuming the second approach, content checksums. Fast, very l;ow chance of false negatives (checksum collision).   I.e. I have grown used to rsync and its descendants.  It does not appear that p4/p4v have this ability.

It is not clear where "p4v reconcile" falls.   It is possible that p4v reconcile uses local metadata heuristics, and that p4v reconcile in combination with p4v get latest is high confidence.   But having been burned once, I am using p4v-get-revision/latest/force far too often.  And it is very slow.

Perhaps what I need to do is keep a clean workspace, use p4v get latest on that, and then diff using local tools.   Avoids the slow net transfers of p4v-get-revision/latest/force.

p4 partial checkouts, file sets, spatial and temporal

I like Perforce's partial checkouts - workspaces that do not need the whole depot - but it comes at a cost in speed.

With a whole-repo DVCS like hg or git, diffing can be quite fast:  you check the whole-repo version object.. If unchanged, that's it - you have essentially checked only a single file.  Whereas p4, because it seems to operate by assembling versions of individual file objects, has to check each.  IT complains when I try to reconcile a p4 workspace with >3000 files, saying "prune your workspace".

Something similar applies to diffs when files have changed.


I think my concept of "file-sets" or "versioned-object-sets" can help here.

A versioned-object-set can be the whole repo, or an individual file.  And possibly stuff in between, subsets, like directory trees.

Let's imagine that at the lowest level we have individual file-versioned-objects.   (This is not necessarily true - might have versioned-objects corresponding to parts of a file, or even parts of several different files (e.g. function definition and declaration.  But it's nice to have an atomic level to think about.)

A spatial set describes while file versioned objects are considered.  It might be a list of disjoint files,

or it might have predicates such as "the subtree under /path/subdir".

The spatial set's rules, that define what files are considered in the spatial set, may itself be versioned.  E.g. you may add or delete a subdirectory from a spatial set.

Note difference: adding or deleting a subtree from the whole repo (or from some other spatial set), versus adding or deleting it completely from the repo.

      "I don't want to see this subtree in this subset any more" (doesn't propagate to other sets overlapping the subset)

      Versus "remove from this set, and all other sets, going forward." (propagates to other spatial sets and subsets (when they decide to merge), but doesn't affect saved history

      Versus "remove content completely from the history"  (le.g. licensing problems)

Apart from versioning a spatial set's rules, the spatial set's contents, the list of files inside it, may be versioned.

Call that a versioned spatial set instance.

Partial checkins do not necessarily immediately affect a spatial set's contents when next updated.   But the spatial set may be directed to merge candidates.

I.e. a spatial set may be constructed from the latest trunk version of files specified by the spatial sets rules.

For that matter, a spatial set may be constructed from the last version of files on different versioning branches - e.g. the latest trunk version of files under main/...,  and a development branch version of files under library/...

In so doing, we are transitioning from spatial sets being "pure" spatial descriptions, to spatial-temporal sets, combinations of spatial and branch versioning descriptions.    Operations such as "all files spatially under subdir/..."  and "latest files on branch bbb..." and intersections and unions and other set operations thereof.

I dislike the world "spatial-temporal", since temporal seems to imply versions as of a precise time.

Better?:  "spatial" for file position, and "lineage", for things like "the latest of a branch".


This concept of a versioning sets enables us to have simple tracking branches that do not fully propagate changes, at least on file granularity.

E.g. if you have Branch1 and Branch2, each with corresponding READMEs, and you do not want to propagate README.Branch1 to Branch2, or vice versa.

Branch1 = common-spatial-set + README.Branch1
Branch2 = common-spatial-set + README.Branch2

checking stuff in on Branch1 may affect common-spatial-set and README.Branch1.

updating Branch2 receives changes to the common-spatial-set but not changes to README.Branch1.

Flip-side, we may clone Branch2 from Branch1, which will give us README.Branch1 in Branch2.   When we prune README.Branch1 from Branch2, we have two options - making it a deletion that propagates, or not.
 (Q: what does "propagation" look like?
     "Propagate a property to any new checkin - like branch."
     "Do not propagate property - tag that applies to a version"
     "Propagate across merges by default."
     "Do not propagate across merges by default - branch specific file".
     "Propagate to child branches, but not to parent branches..."