Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Friday, July 18, 2014

How to determine what may have changed (figuring out how to work around perforce slowness)

Determine what has changed using full content comparison

- slow (especially transferring whole content across net)

- completely accurate



Determine what has probably changed using content checksums

- transfer only checksums across net => fast if no change

- may be inaccurate if checksums collide (yeah, I am paranoid)

- computation (of checksum) on both sides

   - or, a VC tool may cache checksums of original checkout => local change

- false negatives - no change detected - if checksums collide - but ubnlikely

- no false positives



Determine what may have changed using heuristics and metadata

- e.g. file modification dates

- e.g. whether user has checked a file out for editing or not

- false positives and false negatives

- false negatives  - undetected changes - may be common, e.g. if not using modification dates or if M times can be changed









p4v "get latest" seems to use the third approach, heuristics and metadata.   I got bitten by false negatives - true changes not reported - almost immediately.







p4v "get revision / latest / force" uses the full content transfer.



I realized that I was implicitly assuming the second approach, content checksums. Fast, very l;ow chance of false negatives (checksum collision).   I.e. I have grown used to rsync and its descendants.  It does not appear that p4/p4v have this ability.







It is not clear where "p4v reconcile" falls.   It is possible that p4v reconcile uses local metadata heuristics, and that p4v reconcile in combination with p4v get latest is high confidence.   But having been burned once, I am using p4v-get-revision/latest/force far too often.  And it is very slow.





Perhaps what I need to do is keep a clean workspace, use p4v get latest on that, and then diff using local tools.   Avoids the slow net transfers of p4v-get-revision/latest/force.














p4 partial checkouts, file sets, spatial and temporal

I like Perforce's partial checkouts - workspaces that do not need the whole depot - but it comes at a cost in speed.



With a whole-repo DVCS like hg or git, diffing can be quite fast:  you check the whole-repo version object.. If unchanged, that's it - you have essentially checked only a single file.  Whereas p4, because it seems to operate by assembling versions of individual file objects, has to check each.  IT complains when I try to reconcile a p4 workspace with >3000 files, saying "prune your workspace".



Something similar applies to diffs when files have changed.



---



I think my concept of "file-sets" or "versioned-object-sets" can help here.



A versioned-object-set can be the whole repo, or an individual file.  And possibly stuff in between, subsets, like directory trees.



Let's imagine that at the lowest level we have individual file-versioned-objects.   (This is not necessarily true - might have versioned-objects corresponding to parts of a file, or even parts of several different files (e.g. function definition and declaration.  But it's nice to have an atomic level to think about.)



A spatial set describes while file versioned objects are considered.  It might be a list of disjoint files,

or it might have predicates such as "the subtree under /path/subdir".



The spatial set's rules, that define what files are considered in the spatial set, may itself be versioned.  E.g. you may add or delete a subdirectory from a spatial set.



Note difference: adding or deleting a subtree from the whole repo (or from some other spatial set), versus adding or deleting it completely from the repo.

      "I don't want to see this subtree in this subset any more" (doesn't propagate to other sets overlapping the subset)

      Versus "remove from this set, and all other sets, going forward." (propagates to other spatial sets and subsets (when they decide to merge), but doesn't affect saved history

      Versus "remove content completely from the history"  (le.g. licensing problems)



Apart from versioning a spatial set's rules, the spatial set's contents, the list of files inside it, may be versioned.



Call that a versioned spatial set instance.



Partial checkins do not necessarily immediately affect a spatial set's contents when next updated.   But the spatial set may be directed to merge candidates.



I.e. a spatial set may be constructed from the latest trunk version of files specified by the spatial sets rules.



For that matter, a spatial set may be constructed from the last version of files on different versioning branches - e.g. the latest trunk version of files under main/...,  and a development branch version of files under library/...



In so doing, we are transitioning from spatial sets being "pure" spatial descriptions, to spatial-temporal sets, combinations of spatial and branch versioning descriptions.    Operations such as "all files spatially under subdir/..."  and "latest files on branch bbb..." and intersections and unions and other set operations thereof.



I dislike the world "spatial-temporal", since temporal seems to imply versions as of a precise time.



Better?:  "spatial" for file position, and "lineage", for things like "the latest of a branch".



---


This concept of a versioning sets enables us to have simple tracking branches that do not fully propagate changes, at least on file granularity.

E.g. if you have Branch1 and Branch2, each with corresponding READMEs, and you do not want to propagate README.Branch1 to Branch2, or vice versa.

Branch1 = common-spatial-set + README.Branch1
Branch2 = common-spatial-set + README.Branch2

checking stuff in on Branch1 may affect common-spatial-set and README.Branch1.

updating Branch2 receives changes to the common-spatial-set but not changes to README.Branch1.



Flip-side, we may clone Branch2 from Branch1, which will give us README.Branch1 in Branch2.   When we prune README.Branch1 from Branch2, we have two options - making it a deletion that propagates, or not.
 (Q: what does "propagation" look like?
     "Propagate a property to any new checkin - like branch."
     "Do not propagate property - tag that applies to a version"
     "Propagate across merges by default."
     "Do not propagate across merges by default - branch specific file".
     "Propagate to child branches, but not to parent branches..."
)

















Monday, July 14, 2014

Transformations when moving changes between branches

I often want to have transformations automatically applied when I perform operations between branches.



Very simple example: I have occasionally had readmes for specific branches, that I want to live only in that branch. E.g. README.vcs-branch-name1, README.vcs-branch-name2



Therefore, when merging from branch1 to branch2, I do NOT want to transfer README.vcs-branch1.



But when doing a reverse merge from branch2 to branch1, I do not want to transfer README.vcs-branch2, and I especially do NOT want to delete README.vcs-branch1.



Mercurial's merge tracking will arrange to delete the README.vcs-branch1 file on the reverse merge.  Bad, mercurial.



You can think of this as a patch that is implicitly applied whenever there is a cross branch operation.  Patch may be too specific: possibly a programmed transformation expressed as code.



(Would also want to notify on cross-branch diffs about such transformations.)



===



A contrived example: if tracking Linux installations, may want to change text in some control files.



E.g. some file may contain  a user name, like "UserThatRunsFooBar"



On one machine it may be FooBarUser.   On another it may be SamJones.



All of the rest of the diffs to the file may transfer, just not that variable name.



May want a different branch for the two systems.



Hence, a desire for a transformation applied whenever such a file is moved between the branches for the two systems.



===



Partial checkouts can then be considered to be branches with such transformations based on filesystem structure.



A partial checkout of a subtree may have the transformation rules:



* include all stuff under tress T1, T2, ...



* exclude all stuff not under those trees.










DVCS branches = sets++

It is a good idea to be able to identify "sets" of revisions. Both by predicate functions, and by tagging with names.



Branches are sets that automatically extend: when you do a checkin from a workspace with a parent set that is a branch, the checkin automatically gets added to the branch set.



This allows branches to converge and then diverge:



of course, a version can be tagged as being in multiple sets



similarly, a version can be tagged as being in multiple branches at the same time.



Two versions on different branches can merge, and the branches can be converged for a while.  But then later diverge.



This can be done on a file by file basis:  not just whole repo versions, but individual file versions.



--



repo-versions

repo-version-sets



file-versions

file-version-sets



file-sets - most meaningful when file-version-sets



named-file-sets => these are objects that can be versioned



named-file-set-versions



set operations on named-file-sets

=> partial, union, difference



BTW, parse this as named--file-sets or named(file-sets)



Doesn't need to be named(file-sets).  Can be anonymous.  Perhaps better called (explicit(file-sets)

or identified(file-sets)



---



I have elsewhere figured out that



partial checkouts are easy,



while partial checkins correspond to creating a branch, at least temporarily, from which changes can be propagated to larger filesets.



Probably with some sort of nagging system:



Partial checkin doesn't automatically check into containing filesets,

but does automatically check into candidate filesets for enclosing branches.



This might be a good place to exploit file versioning as opposed to whole repo version - candidate-filesets or candidate-branches on a per file basis.

UNIX tools and special characters in filenames

See, fior example:  bash - Is there a grep equivalent for find's -print0 and xargs's -0 switches? - Stack Overflow:



'via Blog this'





UNIX tools are great, with their composability - find | grep | xargs | etc.



But UNIX tools have problems handling entities or objects, such as filenames, that have special characters such as blank spaces or newlines within them.



UNIX tools typically operate on lines (grep, xargs'input), or on words separated by whitespace (e.g. backtick expansion, xargs' invocation of other tools).



Some UNIX tools provide the option of using null separated strings, such as find -print0 or xargs -0.



But as the stackoverflow page shows, people want such flexibility in other tools, like grep. Of course, GNU grep has provided it - --null - but there are probably other such tools.   ... cat?  but of course tr '\n' '\0' ...   still, the list continues.  Mercurial?  Git?



Moreover, null separated is by no means the last word.   What if nulls are allowed in the strings that your are manipulating?  Need either a quotation system, such as XML (and then we get into the issue of quotes upon quotes), or a strings-with-length system.



I have elsewhere talked about making all UNIX tools work with XML.  This is a generalization.



Strings-with-length is most general.  Possibly fragile.  Possibly XML clauses wrapped around simple "obvious" quoting.






Saturday, July 05, 2014

I wish that EverNote / OneNote had 1990s era Infocentral's linking

Why InfoCentral?'via Blog this'



For the umpteenth time, I am trying to use EverNote to collect shopping research.  And it sucks because Evernote doesn't really have hierarchy.



Evernte has notebooks. And stacks. And tags.



OneNote is slightly, moderately, better than EverNote.  It has books, folders, groups of folders, and notes can have subnotes.  But that's it.  Oh, yes, it has tags.



Gmail has tags, aka labels.  Or are they folders?  Really, folders implemented by constraining the labels system.

 Better, but\the tree structured folder constraints make non tree structured labels harder to use.   Some labels want to bve tree structured, some do not.



I think the problem is that developers are trying to maintain a paper mindset, using "abstractions" that behave somewhat like real objects.  Real paper manila folders cannot be arbitrarily recursively nested, and hence EverNote // OneNote should not. Bzzt!!! Wrong!!! I want to take advantage of what a computer can do that paper cannot do.



  And, yes, tags in theory can be used to implement everything that a folder hierarchy has - but only in theory.  Because to really accomplish this you have to create a really ugly tag naming system.



I have elsewhere posted about how I even want my tags to be organized, possibly in a hierarchy.  Because just plain searching through the approved list of tags can be a pain, when you have a lot of tags.





--



Gnashing my teeth about this, I reminisced about InfoCentral.  The very first note organizng software that I used on a tablet PC - way back in 1996-7.



InfoCentral was by no means perfect, but it was better than tags, better than hierarchy.  Infocentral was all about links between objects. Links that were reversible, unlike in hierarchy.  But where you could use hierchical browsing up to the point where it faild, abd then "shale the tree"



So you could look at a family as



       Father - John

           Son - William

               Grandson - Simon

               Granddaughter - Evelyn

           Daughter - Sonia

                Grannddaughter - Mildred

Or shake the tree to look at it from somebody else's point of view

       William
           Father - John

           Son - Simon

           Daughter - Evelyn

           Sister - Sonia

               Niece - Mildred

and then continue browsing.

OK, so InfoCentral wasn't smat enough to know that son's son = grandson.

Or to group sons and daugters as childrewn.  Or sisters and brothers as siblings.

And Infocentral wasn't smart enough to do the classic pivoting:
          Sales/Year/Month
          Sales/Month/Year   for month comparisons between different years

But Infocentral allowed me to do a lot of what I wanted.





I wish something like nfoCentral were available on the web, in Evernote or OneNote.



I'd love to have the time to extend the approach.





Thursday, July 03, 2014

Hidden Files in Perforce — Encodo Systems AG

Hidden Files in Perforce — Encodo Systems AG:

'via Blog this'


Security model:

  • user can see everything - file names, file contents
  • user can see file names, but not file contents
    • with an error indication if trying to access forbidden file contents
  • user can see neither file names nor file contents
    • with an error indication "some information was forbidden for you to see"
    • with no error indication
 Different error models when scanning / listing directory trees / enumerating
and probing for a single filename / identifier.


Any query that might potentially return multiple file objects - e.g. opening by "filename", on a system where there can be multiple file objects with the same name, to be disambiguated by extra metadata (keywords, version numbers) can have the above apply.


filenames are just one form of metadata that can apply to file objects.   Other metadata may apply: keywords, version numbers, cryptographic signatures.    Should be able to handle situation where some but not all metadata is accessible:

     e.g. filename is allowed, file contents access is allowed, but access to certain crypto signatures is not allowed - may not even be allowed to see who has signed things.

     Each such metadata instance should have any of the above properties: visible, forbidden with error notification, forbidden silently.


This extends past visibility to permissions such as writeable, appendable.


Similar treatment for "obliterate" - completely removing an object from repository.  E.g. removing proprietary code erroneously checked in to an open source project, or vice versa:

Such removal is just like a permissions failure, with no possibility of getting around it (except possibly for backups...).