The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Sunday, May 31, 2009

Towards my own git-like VC tool - user level filesystem in a directory tree, filesystem in a file.

I am not happy with git. I want to improve on it. It's depressing to think of how much good stuff would need to be duplicated, but anyway...

I want to explore my ideas on how branches, lines of development, tags, and subprojects work.

I think the filesystem oriented design is the way to go. Strictly speaking, I first saw this in svn, but Linus certainly took it further.

Think of it as an abstract interface to a filesystem - a filesystem embedded in a directory tree. Or possibly in a file. E.g. an XML file. I'll concentrate on the filesystem in a directory tree idea first, but want to always keep filesystem n a file, filesystem-in-XML, filesystem-in-tar-archive, etc., around at the back of my mind.

The API to such a user level filesystem is not UNIX kernel level open/read/write. In particular, we cannot assume file handles. No state apart from the filesystem itself.

The API must interface to an existing filesystem.

The API should be (a) in a scripting language, and (b) at the *IX command level. The mapping between them should be 1:1 as much as possible - i.e. I am lazy, and do not want to have to define both. I want to be able to automatically generate the command line tools from the scripting modules.

The filesystem should be layerable. E.g. for version control, I want a layer that handles my VC concepts - branches, tags, lines, etc. I want this layered on top of a contenht based, deduplicated, filesystem. And I want that layered on top of a compression layer, i.e. a packing layer. I want to be able to design these layers separately, and mix and match.

I want this layering so that, ideally, I can steal other people's implementations.
E.g. I might want to steal Linus' git packfiles, or bzr's packfiles.
E.g. I will probably write my own content based layer, deliberately using a hash that has lots of collisions, to prove that I can handle collisions. But I would like be able to use any other existying implementation.
E.g. I may want to implement things in a directory, or in some single file. If it is properly layered, that should just involve switching a layer. Although the performance considerations might be extreme.

I want the UNIX command line interface to support such layering. E.g. I may want to look at a filesystem in something like a .git directory tree as a VC level - or at the content based level, or at the raw pack level.

I want the API in the scripting language so that I can have the layering in the script language, somewhat efficiently. But if layering modules are written in different scripting languages, say Perl or Python or Ruby, or even in C/C++ like git, I want to be able to layer through tyhe command line as well.
I.e. command line layering required. Inside-script-language layering nice.

Given my hisory, I will probably start coding in object oriented perl. Hope to leverage lots of code out of CPAN. Call it PerlFS? No, that name is taken. How about FSinF or FSinD - FileSystem in a File and FileSystem in a Directory (Tree). Perl_FSinF, Perl_FSinD. FSinF being the generic name, since directories are just filesystems.

I'm not scared of using a scripting language. Bzr argues they have acheived good performance in Python. I like portability.
As usual, ultimately I would like the system to install as a single file. Likes scons.py. Or at least a directory.

Thoughts about the FSinF (FilesSystem in File/Directory) Interface

Can't have primitives like UNIX open/read/write/close. Thesed assume state - file descriptors.

Interface must specify both files on the FSinF filesystem, as well as data on the native filesystem.

Operations such as:


FSinF copy fileSrc fileDst
Copy fileSrc (in FSinF) to fileDst (in FSinF)

FSinF get fileSrc fileDst
get contents of fileSrc (in FSinF) and write to fileDst (in native filesystem)

FSinF put fileSrc fileDst
put contenhts of fioleSrc (in native filesystem) to fileDst (in FSinF)

If we have syntax to distinguish native and FSinF filesystem names, this might be a single command "copy", although get/put is nicely documenting.

Moving and Renaming

FSinF move fileSrc fileDst
move or rename fileSrc (in FSinF) to fileDst (in FSinF)

Again, syntax could allow moving and renaming between FSinF and the native filesystem.

Inter-filesystem operations

get and put are inter-filesystem operations. We don't need to limit ourselves to getting from the FSinF to the native filessytem. We could transfer between two different FSinFs, FSinF1 and FSinF2. Possibly might stage through the native filesystem, but not necessarily always.

Parts of Files

Don't want always to operate on whole files. Although that is natural.

May want to specify parts of files:
  • byte offset regions
  • line regions
  • XML clauses
  • record numbers
I don't think that we need to define these at the level of the FSinF API. They could be FSinF filesystem specific.

At the command line leve, these would be optional parameters.


FSinF get -range "line 40:line 100" file

Hints and Requirements

Different filesystems have different semantics. E.g. "emove" may mean "remove completely", or "remove from current version, but keep earlier versions around".

Such semantics are FSinF specific. Optional arguments.

May want to indicate if the semantics are mandatory or optional. E.g. "remove from current version, but keep earlier versions around" motivated by desire to remove copyright infringing stuff is mandatory - and must be done. Or must receive an error if cannot be done.


FSinF remove -require hard-removal file ...

FSinF remove -optional soft-removal file ...

Other Filesystem Operations

FSinF remove files ...

FSinF create files ....
Empty files created.
Possibler specify types.

FSinF mkdir dirname

FSinF mkfs fsinf-file-or-dirname

FSinF fsck fsinf-file-or-dirname

FSinF chperm files...
Changes permisions
Also used to change owner, group name, etc.
Expect to extend:
May start off with UNIX permissions, or ACLs
May create my own extra permission classes and ACLs, more flexble than native OS.

FSinF metadata add file metadata...
FSinF metadata remove file metadata...
Manipulae extra metdata associated with files

FSinF get file1 file2 ...
atomically get consistent file versions

FSinF atomic compare-and-swap old-file oldcmpdata newfile

FSinF atomic compare-and-put old-file1 cmpdata1 newfile2

Have the "server" do the comparison, and then put/swap.


FSinF lock file ...
FSinF unlock file ...
Although I dislike the stae implied by locking.

Layering, Access to

E.g. in a directory ~/myproject
with a subdirectory .git

I would want to be able t say

FSinF -type git move file1 file2

But, I might also want tio say, at the level of the conent based

FSinF -type git-sha-content-based-fs remove 0x6416fec17...

I might also have a .svn link, so I may want to say

FSinF -type svn get "-r4 file1 "
note how "-r4 file1" almost looks like a filename / specification to the svn filesystem.

I.e. I may want to have several different filesystems under a tree, such as .git / .cvn. Or even my old CVS / CVS.othr-reo stuff.

But I also want to be able to look at the layers in any of thenm.

Layers I want

Raw filesystem. I.e. basically a NOP.

FSinF pack:
Compacts the above.
Probably needs some form of delta specification, for delta compression.

FSinF content-based
Uses hashes to deduplicate.
Me, I want to be able to handle hash collisions. Want both stupid hashes, like UNIX compress, and fancier hashes almost guaranteed to be collsions free.

FSinF checksummed
Adds checksums, verifies integrity
Maybe even ECC

How about:
FSinF mirrored?

FSinF encrypted.

FSinF signed
bit not encrypted

FSinF version-conrol
version controlled
in my dreams, with my extended semantics.

FSinF dirtree, file, xml, tar ...