Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Saturday, November 13, 2010

Content Based Query Install

Throwing documents into my document archive, I realize one of the forms of semantic exposure of deduplication that I want:

When I say "put(file)", if file is already in the archivem I want to get told "Hey, you, the file is already in the archive!!!

Perhaps not under the name I want to put it in.

put(Archive,nameInArchive,file) => warn if already there

put(Archive,nameInArchive,file,force=1) => force install, even if already there


Interactively, I want the default o be "don't install if already there".

Editing History

Messing around with Mercurial to keep my stuff in synch, while working on ulfs.

While I am reasonably happy with my scheme for synchronizing partial checkins and checkouts - not merging, but making candidate for merge - it's not yet ready for prime time. Hence my use of Mercurial.

Can't say continued use of Mercurial, because I have only lately switched to using Mercurial from Git.

Anyway: historically, the main reason for wanting to be able to edit history has been legal. E.g. if a court settlement or a sale meant that you were not supposed to have versions of a file that is determined not to belong to you.

Today, another reason: Mercurial on cygwin kept dying. Looks like it was related to large files. I really wanted to remove those large files from the history, but the "Though shalt not edit history" attitude got in the way. Used hg strip, but that was too much; I really just wanted to delete a few entries. Not enter a counteracting entry, but to delete the original.

Really, I am using DVCS not just for revision control of a single project, but for history tracking of an archive of largely unrelated files.

This should not be that hard. Mercurial docs make noise about names incorporating all content checksums. ... Well, while that might be a good integrity check, it suggests that such content based naming is not so great an idea. If it makes history editing hard.

Realistically, all versions of all objects should have names separate from content hashes.

Scratch that: s/name/key/

All versions of all objects should have (a) content hash names, and (b) non-content hash names.

This applies just as much to the "indexes" that list all file versions in a contour, as to individual files.

The history data structure is the interesting issue: should version-sets point to parent version sets, and, if so, by what name? If V3->V2->V1, what happens if V2 gets deleted. How do we convert V3->...->V1 to V3->V1? Especially if V3->V2 lives outside the current repository?

This suggests we may want to keep version-hashes of objects deleted, if only to act as placeholders, and to prevent them from getting recreated.