Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Friday, June 12, 2009

Implicit vs. Explicit Filenames

In an earlier post I talked about creating a UNIX command line tool for user level filesystems.

I described primitives such as
    FSinF get fileSrc fileDst
    get contents of fileSrc (in FSinF) and write to fileDst (in native filesystem)

    FSinF put fileSrc fileDst
    put contents of fileSrc (in native filesystem) to fileDst (in FSinF)
I must admit that I was thinking in terms of UNIX like filesystems, where the user specifies the filename. However, there is a more basic sort of filesystem: one where the storage system assigns the name.

My first real encounter with such a system was the MH Mail Handling system. Other mail systems are similar: when you save or file an email, you don't have to give it a name. MH gave it a number, sequentially incrementing. Outlook doesn't even do that, although there is probably a message ID somewhere. Apple Itunes' music manager is similar. The user typically looks at listing of the files in a folder, using metadata that may be extracted from the file, or otherwise associated. In other words, there is a browse/search/select interface. It is harder to automate than a simple filename interface, because uniqueness is harder to guarantee for a query selection.

Content based filesystems are similar. E.g. git, where the name is the hash.

Some content based filesystems cannot handle conflicts - two different files with the same hash. Git seems to have little provision for conflicts, although I have been told by Linus that it handls them. My own work in content based filesystems handles hash conflicts - the filename is the hash, with a version number to handle conflicts. The comparison to handle conflicts can be avoided, at the risk of data loss (such as occurs in other content based filesystes).

In any case: for a strictly hash based filesystem that does not handle conflicts, the user could calculate the hash outside and install it using a filename interface. However, for a content based filesystem such as mine, only the filesystem can return the filename.

This suggests an interface:
    fileDst := FSinF put fileSrc
    put contents of fileSrc (in native filesystem)
    into fileDst (in FSinF). FSinF specifies the filename.
where the storage system (FSinF) returns the filename.

Call these two types of filesystem implicitly versus explicitly name filesystem.

Obviously, an implicit filesystem can be built on top of an explicit filesystem. The wrapper calculates the names.

An explicit filesystem can be built on top of an implicit filesystem, but it needs bootstrapping: one file that can be located without knowing the implicit name, that contains the mappings of explicit to implicit names.

Version Control - File Branches and Directory Tree Branches

You have file version objects. These are the contents of a file at some particular point in time.

You have file objects.

I know that file objects are not necessarily the right thing. E.g. Linus' objection, that is is better to have tools that can track pieces of code around, and automatically figure out, e.g. when a file has been renamed or a function has been moved between files. E.g. the old Cray and IBM version control systems that treated entire projects as card decks, and could version particular regions or intervals of cards. But these just say that there should be provision for subobjects at finer granularity than file objects - line objects, function objects. expression objects? Character objects? When I start thinking of XML as source code, such things become natural. Nevertheless, for now, file objects are a natural unit.

There are file name objects.

File name objects are grouped into directory tree objects. (We'll probably just say tree objects.) There obviously should be no conflicts - no two file objects should have the same pathname within a tree object. If a given file object appears at two dfferent places in a sdirectory tree, they are logicaly linked. We may want these links to be hard links a change to one fileame is reflected in a change in the others), or copy-on-write links (chanes break the links). Soft links, such as UNIX symlinks, are different sorts of file objects.

It s possibly to traverse from file version objects to file objects, and hece to filename objects and directory tree objects. And vice versa.

File objects may be versioned. Obviously.

File objects may have branches or lines. Lists of file version objects. Wth description of intent: if "on" branch1 for fileA, a new checkin of fileA is made to branchA, extending branch1.

It is possible to go from any particular file version object to file objects, and to all file object branches or lines associated with the file object.

File object branches or lines are associated with individual file objects. (Often, file name objects.) Usually they are not visible to users, since users normally care about multiple files, i.e. directory tree objects.

Directory tree objects may have branches or lines.

Directory tree branches or lines may refer to particular file object branches or lines. This is an indication of *intent*. It says that any new checkins on those file object branches or lines should be included in the directory tree branch or line. After appropriate testing.

Directory tree branches or lines may refer to particular directory tree file version sets. Which are, of course, objects themselves. A directory tree version set is a set of filename objects, file objects, and file version objects. I.e. a directory tree file version set object is a particular checkpoint of file versions.

Directory tree branches refer to sets or sequences of directory tree file version set objects. Typically aranged in a directed graph. However, where in git branches are really just these tree objects, which point to each other, in my ideal version control system directory tree branch objecs are separate from and conain additional information beyond the directory tree file version set objects.

The directory tree file version set objects are just particular points or contours of file objects. The file objects themselves have their own lines or branches. File object versions that belong to file object branches that are part of a directory tree object, but which have not yet been incorporated into a directory tree file version set object for that branch, can be said to be EXTRAPOLATIONS of the branch. Similarly, file versions (say file.v2) that belong to file branches that are between other versions file.v1 and file.v3, where file.v1 and file.v3 belong to file version set objects on the branch, but where file.v2 does not, can be said to be INTRAPOLATIONS of the branch. Logically, such intrapolated file versions belong to the directory tree's history, but they are not first class. It may be useful to know abot them, so that you can go back to them while trying to figure out where a bug was introduced. But they do not necessarily belong to a consistent set or contour, as reflected in the file version set object.

Directory tree branches or lines may be bound to different file object branches or lines at different times. Such varying tree/file branch or line bindings should be recorded historically. Notice that there is a difference beween current bindings, the bindings at any particular point in time, and the set of all such.

Branches or lines are necessarily history objects. There is a distinction between the branch, and the branch head.

In git, branches seem mainly to be poiners between objects or tgree versions. I want branches to be first class. git gives examples of rebasing, followed by garbage collection. I want the excursions and backtracking to be recorded in the tree history. Of course, it should be possible to change history - to go v1 -> v2 -> v3 -> v4 / backtrack, and get a pruned history v1 -> v2 -> v3 -> v5. But I want also to include
  • v1 (time t1)
  • v1->v2 (time t2)
  • v2->v3 (time t3)
  • v3 was backed out, and reverted to v2 (time t4)
  • v3->v4 (time t5)
Now that if the file object branch or directory tree object branch is itself a file, then the slightest change in history will create a new version. (A new file object branch version file.) We can expect that they will pack nicely.

What does it mean to have a subtree branch? The subtree branch is just a directry ree branch, that does not contan all of the (current) file objects. Because checkins to subtree branches automatically are made to individual file object braches, they automatically become candidates for inclusion in other directory tree branches that refer to that file object branch.

Branching is hard or soft. Hard branches actually break the default, implicit, connection with other directory tree objects sharing a file object branch. (Paradoxically, this is amost exactly the opposite meaning as hard links.) Directory tree branches may share file object branches in a soft manner: changes to soft shared branches are candidates for inclusion , i.e. they are extrapolated, in other directory tree branches sharing that file object branch.

Directory tree branches or lines will probably be named.

File object branches or lines probably usually do not have names. They are made as side effects of the user visible directory tre branches or lines. E.g. if a directory tree branch declines to accet a file version object that was checked in on a soft shaed file object branch, then, implicitly, a new branch is probaby going to be created for that file iobject. We don't want the user to have to create such names. Creating names is hard work. We probably want to annotate the file version objecs and file branch objects with the directory tree branch or lines that led to their creation - also, for that matter, which diectory tree branches incorporate tyhem - we do nolt want to use this in the naming of the file object branch, since it would incorrectly imply that the file object branch belongs exclusively to that directory tree iobject branch. Which is not the case. A file object branch belongs to many directory tree branches or lines, especially when there is soft sharing (extrapolation or intrapolation).

Directory tree objects may be logically assembled out of other drectory tree objects; and/or directory tree object branches may be logically assembled out of other directory tree object branches. Probaby with simple assembly rules: no conflicts, or, override on a per-file-object basis, or, override on a per subtree basis, or ...

Although directory trees may be assembled this way, it may be wisest to flatten this informaion for recording into a directory tree file version set. This way we avoid revising history on one branch affecting the history recorded on others that include that branch. However, we probaby want to record the overlaying history.

This flattening is rather like AMD's build-list.

Tags are normally applied to directory tree file version set objects.

Tags may also be applied to subtree objects or file version objects. e should distinguih "whole" and "partial" tags.

Tags applied to branch objects are, naturally enough, good candiates for branch names.

Tags themselves are versioned. E.g. "This tag XYZZY applied to ... at time t0, and ... at time t1".

--

What data structure? Files? I keep coming back to relational tables - albeit ones with many conraints. But my usual objects to RDBMSes apply.