ZFS - Wikipedia, the free encyclopedia: "With Oracle Solaris, the encryption capability in ZFS is embedded into the I/O pipeline. During writes, a block may be compressed, encrypted, checksummed and then deduplicated, in that order."
But per-user encryption loses the ability to deduplicate across users. Cross-user deduplication is certainly desirable to save storage when there is replication of data, and possibly also to just plain observe such replication, as in spam. Consider an email server like Gmail.
I am going to make the simplifying assumption that deduplication only applies to data encrypted with the same key. It is possible that different plaintext may encrypt to the same ciphertext, and therefore share storage via deduplication. But I will mostly ignore this possibility, and certainly not optimize for it. (Although - I am one of those belts and suspenders guys who prefers to actually verify that deduplicated data matches at both the hash and the raw level, rather than just hoping that the hash does not collide. In my deduplication work (as part of DVCS, merge-rcs-files) I have deliberately weakened the hash to increase collisions to test this. I would do the same thing for encrypted deduplcated data.)
I am also going to ignore systems where the deduplication service has access to all keys. Single point of failure.
Obviously: we want cross-user deduplication for less-sensitive stuff, but not for sensitive stuff. It is tempting to say no-encryption for non-sensitive stuff.
But this nests: we may want cross-team encryption and deduplication, but not global.
OK for a-priori identification --- "any email from a .COM domain should not be encrypted and should be deduplicated". But I want these things to be as transparent as possible.
I.e. I want to just plain store data, a message, a block, a file, and have the opportunity for deduplication with shared encryption observed. With rules, but with as much without rules as possible.
One basic approach to compute the deduplication hashes, and then compare to whatever hashes are stored, detect conflicts, and then decide to use the keys already stored.
Computing such storage hashes from plaintext, however, might allow an observer, the storage service, to break user privacy. If the storage service knows the plaintext and the storage service observes a user request for a hash, then the storage can infer the user's plaintext. Even if the storage service doesn't know the plaintext, it could infer that groups of users are sharing a message - e.g. exposing a samizdat network, or a TOR file sharing service.
Steganography... False traffic... Requesting large blocks of hashes (and keys).
Note: while it might seem that the storage service is going to always be able to detect such sharing, if it stores all data, we can imagine systems where A storage service is only storing some of the data. E.g. a user might want to store a file; the user may ask a storage service if it can cheaply deduplicate with that service; but if there is no cost reduction for storing with that service, the user may store the data elsewhere. Hence, the user might ask "How much to store data of size SZ with hash key for deduplication H", and the service may respond with a price quote, and, eventually, possibly, keys (if not already implied). (The storage service may not have the private key.) Therefore, the user may want to hide the needle of its single request in a haystack of other requests. And similarly cache the server hashes, possibly via Bloom filters and various precisions.
Let's make progress by assuming that everything is local - either that we trust the storage service, or that we are caching large parts of its hash and key index.
The user has access to a large number of storage pools, each with an encrypt-key.
The user computes deduplication hashes for each of these pools. Detects matches.
This can detect unanticipated replication. At this point an interactive program might say "Your supposedly top-secret private email is available in the encrypted but widely shared storage pool of Scientology.com, and also on skeptics.net. Do you still want to encrypt it privately so that no deduplication is possible, or do you want to save money and encrypt-deduplicate-store it in one of these shared pools?"
Such interaction is not always possible in synchronous real time. Hence rules, ranging from "share always" to "initially private, but ask me later to deduplicate".
We still want to preserve as much privacy as possible when we ask storage services for deduplication hashes.
IDEA: perhaps we might start by asking for hashes that are deliberately weak and collision-full. These are, after all, a Bloom filter - if there is no match with a weak hash, there will not be a match with a stronger hash, so we would only progress to more exact hashes when there is a chance of a match (and a savings).
Bit-vector Bloom filters with multiple bits set for a match are a nice example. It might not be unreasonable for a user to locally cache megabyte-scale bitvectors of a storage server's hash index. (Although if it becomes really sparse, then this approaches multiple hash functions.)
A storage server could game this, reporting false positives at the lower resolutions, just to get the user to request the higher resolution hashes.
We might also filter deduplication requests. E.g. only investigate the possibility of deduplication for longer messages. For messages that resemble human uttered text, or x86 machine language, or Java byte codes, or for messages that are or are not compressible, that have certain measiures of entropy, etc... Yes, I am back to talking about rules - but I am brainstorming as to what rules may be useful.
Since computing strong hashes for deduplication and other purposes can be expensive, suggesting computing mutiple such hashes to decide which encrypted-deduplicated-storage-pool to use may be very expensive. Back to rules, and possible cheaper low-resolution hashes to start with.
Good crypto usually doesn't parallelize well. There is usually a serial depedency - although that may not be needed so much for deduplication as it is for encryption and attestation.
But it may be possible to use SIMD-style parallelization when computing multiple such hashes at the same time.
In any case: I like crypto and dedup. I suspect that they will be more and more important in future computer systems.
I wonder what quantum computing does to deduplication hashing, beyond Google's supposedly quantum-proof lattice base PK.
(All the more reason why
These latent and long held thoughts were prompted for blogging by the ZFS article saying "compressed, encrypted, checksummed and then deduplicated, in that order".
Since what I am talking about here amounts to speculatively investigating the possibility of deduplication before encryption.
Ah, there's the rub - may use the coarse-grain hashes before encryotion, but ultimately must encrypt and then deduplicate.