A few months ago Linus told me that Git was ready for folks like me to use it. At the same time he also discussed the following issue:
Git is, essentially, a content based filesystem or object store. It uses the SHA-1 cryptographic hash of a file as that file's name.
When last Linus and I talked (which was also the first and only time we have ever talked in person, at a restaurant - we've only briefly talked via email and newsgroups otherwise) Linus assured me that Git could handle the case where two objects had the same hash. I haven't seen how it is handled in the code. The Git discussion groups seem to be populated by people who cannot imagine that a collision would ever occur. I'm willing to believe that Linus has handled it properly. I'm going to use this blog to discuss what properly is.
First off, the basics: I'm with Val Henson, http://infohost.nmt.edu/~val/review/hash.pdf: using compare-by-hash without being prepared to handle collisions is bad, Bad, BAD. Given a large enough collection of files, a hash collision will occur. Worse if one can be constructed by a malicious user seeking to corrupt a repository.
But I really like the convenience and efficiency of compare-by-hash --- I call it "content based" addressing. So I want a content based scheme that handles collisions.
I can't understand why everyone doesn't do it: it's pretty simple.
E.g. in the object store, you create an index that maps hashes to object names. The object names could still be the cryptographic hash, perhaps with an instance number appended: i.e. the first object in the repository that has hash F00 would be named F00-0, the next F00-1, etc.
If you are so confident that there will never be collisions, you will only see F00-0 names, right? What does it harm you to append the instance number?
Or, you could use some other scheme for creating a unique name for the repository, such as a sequece number. In fact, I would probably suggest using both such designation schemes: sequence numbers inside the repository are more likely to need to be changed when two different repositories are merged. Hashes are more likely to be unique - although, if you merge two repositories that both have F00-0 objects, which are different, one is going to be named F00-1 after the merge. But, again, for those who are confident that collisions will never occur, you will never see this _slight_ renaming.
The only way in which this deviates from compare-by-hash content based systems such as Monotone is that, when you are adding an object to the repository, you don't just stop when you have computed the hash, and have seen that the hash is already in the repository. You have to compare the new object to the object that is already in the database that has the same hash. I.e. you use the hash to determine who to compare to, but you still do a full comparison.
Note: to compute the hash of the new object, you have to scan all of its bytes. That's O(n) work. To do the full comparison is still O(n) work - albeit with a bigger constant multiplier, circa 3X (one scan for the hash, then a scan of both the new and old object for the compare).
This suggests that you may not want to use a cryptographic hash of the entire file's n bytes. You may want a sampled hash. E.g. a hash of the first kilobyte, or a random selection of bytes. The purpose of the hash is now just to reduce the probability of collision, not eliminate it.
(Of course, the real issue is that no hash ever eliminates the possibility of collision:
people just fool themselves that it does.)
Another trivial optimization is to use the object size as well as its hash. If two objects that have the same hash differ in size, you can avoid the full comparison. If you are one of those people who believe that SHA-1 hash collisions can never occur, then certainly they will never occur for objects of the same size. So, by your own standard of belief, the full file comparison will never be done.
So, let's compare what I propose - using the hash as a hint, but handling collisions - to what Monotone did, using hash as a name, not providing for collisions. Let's call them collision-handling-content-based, and collision-neglecting-content-based:
If you believe that hash collisions will never occur:
- when naming objects in the repository, my collision-handling and Monotine's collision ignoing approach will work the same.
- when adding objects to the repository, my approach *might* require a full comparison - but the simple expediet of including the object size in the hash comparison avoids this -- in all of the cases where no hash collision occurs.
I.e. my collision handling approach and Monotone's collision ignoring approach are equivalent, so long as no collisions occur. And, of course, if collisions occur, Monotone breaks, but my approach still works.
The only reason that I can possibly imagine for not using my approach is code complexity. Why add code that will never be exercised? How do you test it? When I started writing rcs-file-merge -- until I learned that Linus was writing Git - I planned to use the content-based approach. But, I deliberately used a really weak hash - UNIX checksum - to deliberately create collisions, so that I could test that I could handle them. Found collisions in my home directory. Of course, a stronger hash would be used in actual production.
I really, really, hope that Git handles hash collisions. As I describe above, it is very straightforward, and costs nothing in performance.