The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Monday, September 18, 2006

Ideal Communication Tool

On 9/17/06, Dean Kent wrote:
So - let me ask you a question: Do you have an idea of what would be your
ideal tool for all your communication needs? Without sticking to what
exists today. For example, if everyone on Usenet would instantly convert to
this new tool as well as everyone else you currently communicate with, or
download information from, etc. What kind of things would such a tool need
to do in order to satisfy you (Yes, I read your post all the way through,
but it seems like much of it was based upon 'what is', rather than 'what
would be ideal'.)

Just curious... (of course, you can post the answer to comp.arch if you
wish, rather than directly to me - if you desire to answer it at all).

I may post it to my blog, but it's off-topic for comp.arch. Not sure what USEnet group it would be on-topic for.

First off, "without sticking to what exists today" is impractical. The purpose of a communication tool is communication - the best tool is useless if there is nobody to talk to. So, a first requirement is that the tool talk all of the standard protocols - email, nntp, IM, RSS, etc.

Let's start off with the text oriented tools:
I/O: email, USEnet news, blogging, IM'ing
I was about to put RSS down as input ony, but that is not true - I may have a wiki site that generates an RSS feed to others.

Output: one of my key desires is multimode multicast: you compose a block of text. Maybe with graphics - your favorite format, XML, with graphic SVG, bitmaps, etc. You want to be able to send it or post it to any subset of selected services. NOT you cut and paste it from one to another, e.g. write email and cut and paste it to post in your wiki. Instead, you write the text once, and then you can set one or all of the following:
Email: To, CC, and Bcc
USEnet newsgroups to post to
Blog sites to post to
e.g. I often want to post the same item to my inside-the-company internal blog and my external blogger blog
Wiki sites to post to
wikis are a bit odd in that they are context sensitive - typically you click on a page.
but nevertheless I find that I often need the same stuff posted in more than one wiki
Similarly, many wiki pages contain text boxes and the like, such as TWiki's comment.
All of these are really separate potential output streams
IM conversations to post to
Yes, occasionally I am in 3 or 4 IM conversations at once - not the same conversation, but several separate
- and I want to post something to more than one
FAX: send to fax
SMS: send to cell phone text

I want to be able to create a "list" which consists of any or all of the above.

Note: above I am just talking about a user interface for media. I am not designing new media. Let's do a media mode taxonomy, and see if anything is missing:
Email: transient, deferred, disconnected, point to point
Blog: transient, semi-permanent, but temporal; deferred, disconnected, broadcast, one to many
IM: immediate, point to point (with multicast); transient, temporal
Wiki: permanent, reference - not intened to be temporal

Out of all of these, I think that what is missing are tools to assist you in cretaing both the transient, temporally contextual, stuff, as well as the current latest reference stuff. E.g. I was in a meeting this AM. Rajeev wrote up minutes: I updated our POR (Plan of Record) document. It would be nice if we could get both at once.

I guess that I am really vocifereously arguing against having to do things more than once.

Also... I think that everybody, every individual, needs a personal issue tracking system. Something to keep track of the highest priority thing to do next.

Note that YOU said "communications system". I said "communications and information management system". Communication is just one form of information. Memory is another. I want my communications, both incoming and outgoing, to be prioritized in the same list as my To-do items.

Most issue tracking systems are too heavyweight. You have to create an issue report. I want to be able to just say "Action Required" in a document, e.lg. in an outgoing email, and I want an entry in the issue tracker to be automatically created. (E.g. with a pointer back to the orignal document, possibly with the surrounding paragraphs extracted to explain the issue.

Similarly "Action Required By YYYY MM DD" should automatically do the date extraction thing. As should PORs, etc.

The personal information management system should track resources. So, e.g., if you commit to 3 days of work here, and 4 there, in the next 5 days, it should warn you of an overcommit.

... But you asked about communications, not calendaring, so I will not go too far down this path in this email.

Let's return to communication. We have already talked about multimodal multicast output. Now let's talk about input:

Basically, I see all of these separate input streams - email, RSS, IM, phone, etc.

I want my filtering going on, prioritizing. E.g. any communication from my wife or boss takes priority, no matter the input channel.

I want the same reader interface to be used for all inputs. Certainly easy enough to see for textual. For recorded voice or video, similarly easy.

For live voice or video, may be harder - you snooze, you lose. Although that's actually part of what I want to avoid. Record everything - all text, all voice, all visual inputs. If you are listening in on 2 phone meetings simulaneously while listening to email, have the ability to play back the last 2 minutes when something catches your ear. Compress... it's fairly easy to compress voice into 1/3rd the elapsed time. So, do small scale time shifting and scaling - listen in a minute or so behind. With this, you may actually be able to listen to several meetings all at once, without losing anything.

Speech to text - its a help. I can manage 4 IM conversations, but only 2 phone meetings.

Enough for now. I will now cut and paste this to my blog.

Saturday, September 16, 2006

Refactoring returns by introducing booleans

Consider code that looks like:
  1. void foo() {
  2. if( some-condition ) return;
  3. if( some-other-condition ) { do-something; return; }

Say this code is replicated in several places. Obviously we want to extract it into a common subroutine or class - the extract method refactoring.

If you can jump right to the end of the refactoring, great. But, I admit that I have occasionally tripped myself up doing this. (The example is, of course, oversimplified.)

So, here's a micro-refactoring that can help:

Extract to a function that returns a bool, named check_and_do_some_things_and_maybe_return()

  1. void foo() {
  2. if( check_and_do_somethings_and_maybe_return() ) return;
  3. ...

  1. bool check_and_do_somethings_and_maybe_return() {
  2. if( some-condition ) return 1;
  3. if( some-other-condition ) { do-something; return 1; }
  4. return 0;
  5. }

This requires editing each return.

If you want to avoid that, maybe add a boolean temporary to the original code - instead of returning immediately, delay all of the returns to

  1. void foo() {
  2. bool return_early = false;
  3. if( some-condition ) return_early=true;
  4. if( !return_early ) {
  5. if( some-other-condition ) { do-something; return_early = true; }
  6. if( !return_early ) ...
  7. }
  8. if( return_early ) return;
THEN you can safely extract the method a line or statement at a time.

Company abandons internal news servers; thoughts about PC versus Google

Minorly off-topic, but I feel impelled to note that _I_n_t_e_l_ has just ZBB'ed its internal NNTP news servers. Actually, they were ZBB'ed many years ago, but volunteers kept them going. Those volunteers may now be ZBB'ed. New volunteers may arise; heck, I may; but the path of least resistance is to give up on getting USEnet news inside the company, and go to some external service. E.g. today I am posting from Google Groups.

Personal relevance to comp.arch: my employer's internal news servers have been my main connection to comp.arch since 1991. Brief exceptions while I was in Madison and at AMD. Prior to joining _I_n_t_e_l_ I participated in comp.arch and its predecessor net.arch on news servers from the University of Illinois and from Gould and Motorola. I still maintain that I learned more computer architecture from comp.arch than I did in any school; moreover, I am fairly confident that I would never have gotten my job with _I_n_t_e_l_ without my comp.arch mediated acquaintance with Bob Colwell.

Generic relevance to comp.arch: this is a trend. Actually two trends.

Trend #1 is that less and less personal computing can be done at work, and that more and more work related computing is "freeloading" on personally paid for computing.

Most people used to have only one email address that they used for both work and personal matters. You can still do this, but it is becoming increasingly hard to do so because companies like my employer do not allow you to access the corporate network from your own computers; you can only do so from a company owned device.
So you have a personal mail service, as well as your work mail service. Maybe your personal mail servce is from an ISP, and changes whenever your ISP changes, you move, or when Qwest gets bought out by Verizon. Maybe you have your own domain.
But your company doesn't allow you to run POP across the firewall. Similarly for newsgroup access: your company desn't allow you to run NNTP across the firewall.

This leads to Trend #2: Google. More generically, the rebirth of "Big Iron", centralized, computer service companies.
Google *is* "Big Iron". Maybe not in the IBM mainframe sense, but anyone who has seen a Google machine room knows that it is a completely different scale than a desktop or laptop PC.
For many years I tried to keep my personal computing environment PC based. I ran my mail reader on my laptop or desktop PC, sometimes via a client technology such as POP, IMAP, sometimes peer-to-peer stuff like SMTP. Similarly, up until now I have read news on my laptop or desktop PC. When I saved a file, it was saved on my PC's hard disk. I could not access my environment of saved files and email without being on my PC. Maybe I could read my email from other computers, but I did not have my mailreading environment on those other computers, so I tried to avoid doing so.
But, not being able to access my personal email from work - no POP, no ssh - was the last straw. I switched to Google mail. Now I can access my personal email from any computer - at work, at home, from my wife's computer. From my relatives' computers. I no longer need to drag my personal laptop around with me.

Downside: I cannot access my Google email when I am not connected to the net. For many years this was the biggest reason that I stayed PC based. Broadband took a long time to get to many of the places where I spend time, like Oceanside, Oregon, and the Ottawa river valley in Canada. Broadband is still not available in many of my favorite places, such as Eastern Oregon. Heck, cell phone service is not available. (I am waiting for reports of the Microsoft/KVH mobile broadband with interest.)
Perhaps most important for business folk, I cannot access Google email on a plane, when I am not connected to the Internet.
Yes, I know: you can access Google mail via POP, downloading it to a mobile PC where you can read it disconnected. But that just puts you back in the "your mailreadimg environment lives on only one PC" mode. So far as I know, there is no way to download Google mail to your PC, and then upload back to Google any annotations, tags, classifications, and spam markings you have made to your email while disconnected.
I hope that Google will soon remedy this, and provide disconnected operation, not just for email, but also for other Google services such as Google groups.

Interestingly, moving to Google mail has provided more freedom from the point of view of form factor. In my "my mailreading environment lives on a single laptop PC" days, I needed to have a laptop that met my minimum needs for all common situations. E.g. it had to have a big enough screen, enough disk, and a keyboard. But now that I am Google based I can seriously consider reading email on a keyboardless tablet in my living room, or a PDA, or... Since I can always go to another device. I.e. I am more likely to buy a "widget" specialized computer now that I am using Google mail than I was when I used a PC.
I hypothesize that this is true not just of me, but also of other users. Perhaps the long awaited flowering of specialized devices for ubiquitous computing is now about to begin.

Terminology change: I used to read my mail on my laptop PC. Now I read it on Google, via a web browser that happens to run on a PC, but which could run elsewhere. I used to be a PC user. Now I am a Google user.

USEnet news is just another information service, like email. Same considerations apply. Since I have switched to Google mail, I might as well switch to Google groups. Ditto RSS, and other information services.
What I really want is to receive all of my information inputs in a common environment, that can seamlessly prioritize and sort email, USEnt news, RSS, regular news, IM, and telephony. Google is the most likely company to achieve this.

Interestingly, I have been forced into schizophrenia. My work information feeds are in one place, my personal feeds in another. At the moment it appears that the personal feeds on Google are more integrated, have better search abilities, etc., although far less storage.
Will this keep up? Or will the quality of information management at work play leapfrog with Google? I do not know... but I predict that at least some fraction of companies will just outsource their employees email, etc., to Google. I.e. I predict that Google will be able to provide a single stop for both work and personal information management. And that because of this, it will have a larger critical mass than companies that are stuck just supporting an individual's work computing and information needs.

Returning to trend #1: not only will less and less personal computing be done at work, but more and more work computing will be done personally... because the personal computing environment, whether Google based or whaever, is pulling ahead of the work environment. (Unless Google takes over the work environment, as predicted above.)
The item that sparked this post is just an example: reading USEnet newsgroups such as comp.arch is recreation, but I also fairly regularly post queries to newsgroups such as comp.lang.c, etc., for work related questions. Closing down the company's internal news service, of course, means that I will be now doing this using my personal computing resources. I.e. Google groups.
More evidence of this trend: back in the old days companies paid for 2nd phone lines for computer access. Nowadays you are expected to pay for your own broadband access, and to use it for after hours work. I keep meaning to take an income deduction for my broadband for tax purposes, since the logs plainly show that it is mostly used for work, not pleasure.

Summing up:
Prompted by: my company abandoning internal new servers.
Hypothesis: there is a trend away from PC based computing and information services, towards centralized computer services like Google.

The computer industry battle is not _I_n_t_e_l_ versus AMD, or Microsoft versus Linux. It is the PC versus Google.

(Here, I use "Google" as representative of web based computing services, ubiquitously accessible so long as you have Internet access.)

Saturday, September 02, 2006

Lifestyle computers

Labor Day weekend, sitting on the beach - I'm not surfing because I am watching my daughter play in the sand. No book to read. It would be nice to be able to use my computer on the beach - but it would have to be a laptop with no airholes for cooling, no PC-Card slots. Sealed, no entry for sand or water. Plus a daylight visible screen.

Merging my diverged CVS trees

I hope that I can use Git's content based features to merge my diverged CVS trees.

Git, Renaming, etc

Git does not track file renames.

Linus's email, http://permalink.gmane.org/gmane.comp.version-control.git/217, is essentially correct, but shortsighted. Linus argues that file renames are just a special case of moving content around. E.g. how does file rename tracking help you if you take a source file foo.h, move all of its content to bar.h, and then put #include "bar.h" inside foo.h? Linus argues that a content based tool that can detect this sort of movement, is more useful than tracking renames in Git.

Linus is right. But, Linus is shortsighted. Linus is considering only one usage model: that of a programmer studying code history. There is a human in the loop.

I want to use a version control system such as Git to perform such actions when a human is not in the loop.

E.g. I want to use Git, instead of RCS, as the version control system in twiki. Then, if a user requests http://mytwikisite.net/OldTopic, I can tell him or her that OldTopic has moved to NewTopic. To do this, the content tracking system must be able to make a single suggestion.

Of course, a content tracking system can make a single suggestion. But, it may get confused - e.g. two different wiki pages may start off from a common answer. Some human may have to decide who is the most logical primary choice. If that choice has been made, why not record it? Which amounts to recording the "primary" renaming.

Similarly, even a wiki may want to have a content tracking system that has a human in the loop. But it is nice to have a fully automated system with no human in the loop, that takes advantage of recorded rename tracking.

And, I hazard, even an SCM will need the fully automated version.


Rename tracking is a convenience.

Content tracking, with or without a human in the loop, is of additional value.

Content based filesystems and VC

It's time that I switched to Git for version control of my home directory - which has CVS'ed for the last, what, 10 years, maybe longer - and which has also been RCS'ed, SCCS'ed, and, briefly, BK'ed.

A few months ago Linus told me that Git was ready for folks like me to use it. At the same time he also discussed the following issue:

Git is, essentially, a content based filesystem or object store. It uses the SHA-1 cryptographic hash of a file as that file's name.

When last Linus and I talked (which was also the first and only time we have ever talked in person, at a restaurant - we've only briefly talked via email and newsgroups otherwise) Linus assured me that Git could handle the case where two objects had the same hash. I haven't seen how it is handled in the code. The Git discussion groups seem to be populated by people who cannot imagine that a collision would ever occur. I'm willing to believe that Linus has handled it properly. I'm going to use this blog to discuss what properly is.

First off, the basics: I'm with Val Henson, http://infohost.nmt.edu/~val/review/hash.pdf: using compare-by-hash without being prepared to handle collisions is bad, Bad, BAD. Given a large enough collection of files, a hash collision will occur. Worse if one can be constructed by a malicious user seeking to corrupt a repository.

But I really like the convenience and efficiency of compare-by-hash --- I call it "content based" addressing. So I want a content based scheme that handles collisions.

I can't understand why everyone doesn't do it: it's pretty simple.

E.g. in the object store, you create an index that maps hashes to object names. The object names could still be the cryptographic hash, perhaps with an instance number appended: i.e. the first object in the repository that has hash F00 would be named F00-0, the next F00-1, etc.
If you are so confident that there will never be collisions, you will only see F00-0 names, right? What does it harm you to append the instance number?

Or, you could use some other scheme for creating a unique name for the repository, such as a sequece number. In fact, I would probably suggest using both such designation schemes: sequence numbers inside the repository are more likely to need to be changed when two different repositories are merged. Hashes are more likely to be unique - although, if you merge two repositories that both have F00-0 objects, which are different, one is going to be named F00-1 after the merge. But, again, for those who are confident that collisions will never occur, you will never see this _slight_ renaming.

The only way in which this deviates from compare-by-hash content based systems such as Monotone is that, when you are adding an object to the repository, you don't just stop when you have computed the hash, and have seen that the hash is already in the repository. You have to compare the new object to the object that is already in the database that has the same hash. I.e. you use the hash to determine who to compare to, but you still do a full comparison.

Note: to compute the hash of the new object, you have to scan all of its bytes. That's O(n) work. To do the full comparison is still O(n) work - albeit with a bigger constant multiplier, circa 3X (one scan for the hash, then a scan of both the new and old object for the compare).

This suggests that you may not want to use a cryptographic hash of the entire file's n bytes. You may want a sampled hash. E.g. a hash of the first kilobyte, or a random selection of bytes. The purpose of the hash is now just to reduce the probability of collision, not eliminate it.
(Of course, the real issue is that no hash ever eliminates the possibility of collision:
people just fool themselves that it does.)

Another trivial optimization is to use the object size as well as its hash. If two objects that have the same hash differ in size, you can avoid the full comparison. If you are one of those people who believe that SHA-1 hash collisions can never occur, then certainly they will never occur for objects of the same size. So, by your own standard of belief, the full file comparison will never be done.

So, let's compare what I propose - using the hash as a hint, but handling collisions - to what Monotone did, using hash as a name, not providing for collisions. Let's call them collision-handling-content-based, and collision-neglecting-content-based:

If you believe that hash collisions will never occur:

  • when naming objects in the repository, my collision-handling and Monotine's collision ignoing approach will work the same.
  • when adding objects to the repository, my approach *might* require a full comparison - but the simple expediet of including the object size in the hash comparison avoids this -- in all of the cases where no hash collision occurs.

I.e. my collision handling approach and Monotone's collision ignoring approach are equivalent, so long as no collisions occur. And, of course, if collisions occur, Monotone breaks, but my approach still works.

The only reason that I can possibly imagine for not using my approach is code complexity. Why add code that will never be exercised? How do you test it? When I started writing rcs-file-merge -- until I learned that Linus was writing Git - I planned to use the content-based approach. But, I deliberately used a really weak hash - UNIX checksum - to deliberately create collisions, so that I could test that I could handle them. Found collisions in my home directory. Of course, a stronger hash would be used in actual production.

I really, really, hope that Git handles hash collisions. As I describe above, it is very straightforward, and costs nothing in performance.