The content of this blog is my personal opinion only. Although I am an employee - currently of Imagination Technologies's MIPS group, in the past of other companies such as Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Friday, July 22, 2016

Perl prototypes quite useless

I already knew this - but to emphasize the point.

perlsub - perldoc.perl.org: "prototypes have no influence on subroutine references like \&foo or on indirect subroutine calls like &{$subref} or $subref->()"

Method calls are not influenced by prototypes either, because the function to be called is indeterminate at compile time

the intent of this feature is primarily to let you define subroutines that work like built-in functions

'via Blog this'

Apple Reminders - at place, on day, but not before time (NOT)

Apple's iPhone / MacOS reminders with time and place sound like a good idea - but they have  been so damned annoying, occurring at the wrong time, that I have been disabling them.

It's the usual "AND/OR" confusion.  The usual UI/UX designer hubris "Users want to keep it simple, so we won't give them the feature that I, the omniscient UIX/UX designer, can't imagine needing in my limited imagination".  Self-fulfilling prophecy.

Use Reminders on your iPhone, iPad, or iPod touch - Apple Support: "With Reminders, you can set notifications that alert you when reminders are due, or when you arrive or leave a location."
'via Blog this'
Investigating the annoying notifications that occur when I do not want them:

E.g. I want to be reminded  to pay bills on the weekend, typically any time after 6pm on an evening, or whenever I arrive home on the weekend.

But Apple keeps reminding me at the wrong time- like, I get reminded to pay bills when I am working during the day on Friday.

I would insert a screen clipping, but Blogger does not make it easy to upload images.  (In fact, I think that Blogger only uploads URLs or images, i.e. by reference, not by value.)

The MacOS reminder dialog says

remind me
[x] on a day (with day and time, which I have set to 6pm)
[x] at a location
  [ ] Leaving [x] Arriving
On DATE you will be reminded when you arrive at this location, or by 6pm that day at the latest.
repeat  Every Week
So I can't complain too much.

  • The web page documentation correctly says "OR", both in the MacOS dialog and in the web page clipped above.
  • But in the iPhone screen, there is no "OR".

The iPhone Reminder app screen looks like

Pay bills

Remind me on a day YES
Alarm Fri, date, time
Repeat WEEKLY 
Remind me at a location YES
Location  Arriving: Home

I think that I assumed that these separate controls, on a day and at a location, were AND'ed together --- because such ANDing  is, I think, more common in user interfaces.  E.g. most email rule systems AND the conditions, at least in the first generation.

Of course, what I really want is dependency, and event based.  (Hmm, look at some temporal and event based notations.)

  • AT or AFTER date-time
    • create new reminder one week later
    • DO: remind me at that date-time
      • WHEN arriving home DO: remind me
Of course, I might have wanted the slightly different structure

  • AT or AFTER date-time
    • create new reminder one week later
    • DO: remind me at that date-time IF I am at home
      • WHEN arriving home DO: remind me
We really need standard languages and notations for such compound constraints.

Even at the level of code.

But also at the level of UI.

Possibly graphical - e.g. nested boxes.

ZFS - Wikipedia, the free encyclopedia

ZFS - Wikipedia, the free encyclopedia: "With Oracle Solaris, the encryption capability in ZFS[32] is embedded into the I/O pipeline. During writes, a block may be compressed, encrypted, checksummed and then deduplicated, in that order." 
'via Blog this'
Everything should be encrypted by default.  For individuals, on a per-user basis, perhaps at finer granularity for particular roles and sensitive data.  For companies, perhaps team- or organization-wide.

But per-user encryption loses the ability to deduplicate across users.  Cross-user deduplication is certainly desirable to save storage when there is replication of data, and possibly also to just plain observe such replication, as in spam. Consider an email server like Gmail.

I am going to make the simplifying assumption that deduplication only applies to data encrypted with the same key.  It is possible that different plaintext may encrypt to the same ciphertext, and therefore share storage via deduplication.  But I will mostly ignore this possibility, and certainly not optimize for it. (Although - I am one of those belts and suspenders guys who prefers to actually verify that deduplicated data matches at both the hash and the raw level, rather than just hoping that the hash does not collide.  In my deduplication work (as part of DVCS, merge-rcs-files) I have deliberately weakened the hash to increase collisions to test this.  I would do the same thing for encrypted deduplcated data.)

I am also going to ignore systems where the deduplication service has access to all keys.  Single point of failure.

Obviously: we want cross-user deduplication for less-sensitive stuff, but not for sensitive stuff.  It is tempting to say no-encryption for non-sensitive stuff.

But this nests: we may want cross-team encryption and deduplication, but not global.

OK for a-priori identification ---  "any email from a .COM domain should not be encrypted and should be deduplicated".   But I want these things to be as transparent as possible.

I.e. I want to just plain store data, a message, a block, a file, and have the opportunity for deduplication with shared encryption observed.  With rules, but with as much without rules as possible.

One basic approach  to compute the deduplication hashes, and then compare to whatever hashes are stored, detect conflicts, and then decide to use the keys already stored.

Computing such storage hashes from plaintext, however, might allow an observer, the storage service, to break user privacy.  If the storage service knows the plaintext and the storage service observes a user request for a hash, then the storage can infer the user's plaintext.  Even if the storage service doesn't know the plaintext, it could infer that groups of users are sharing a message - e.g. exposing a samizdat network, or a TOR file sharing service.

Steganography...   False traffic...  Requesting large blocks of hashes (and keys).

Note: while it might seem that the storage service is going to always be able to detect such sharing, if it stores all data, we can imagine systems where A storage service is only storing some of the data.  E.g. a user might want to store a file; the user may ask a storage service if it can cheaply deduplicate with that service; but if there is no cost reduction for storing with that service, the user may store the data elsewhere.   Hence, the user might ask "How much to store data of size SZ with hash key for deduplication H", and the service may respond with a price quote, and, eventually, possibly, keys (if not already implied). (The storage service may not have the private key.)   Therefore, the user may want to hide the needle of its single request in a haystack of other requests. And similarly cache the server hashes, possibly via Bloom filters and various precisions.

Let's make progress by assuming that everything is local - either that we trust the storage service, or that we are caching large parts of its hash and key index.

The user has access to a large number of storage pools, each with an encrypt-key.

The user computes deduplication hashes for each of these pools.  Detects matches.

This can detect unanticipated replication.  At this point an interactive program might say "Your supposedly top-secret private email is available in the encrypted but widely shared storage pool of Scientology.com, and also on skeptics.net.  Do you still want to encrypt it privately so that no deduplication is possible, or do you want to save money and encrypt-deduplicate-store it in one of these shared pools?"

Such interaction is not always possible in synchronous real time.  Hence rules, ranging from "share always" to "initially private, but ask me later to deduplicate".

We still want to preserve as much privacy as possible when we ask storage services for deduplication hashes.

IDEA:  perhaps we might start by asking for hashes that are deliberately weak and collision-full.   These are, after all, a Bloom filter - if there is no match with a weak hash, there will not be a match with a stronger hash, so we would only progress to more exact hashes when there is a chance of a match (and a savings).

Bit-vector Bloom filters with multiple bits set for a match are a nice example.  It might not be unreasonable for a user to locally cache megabyte-scale bitvectors of a storage server's hash index.   (Although if it becomes really sparse, then this approaches multiple hash functions.)

A storage server could game this, reporting false positives at the lower resolutions, just to get the user to request the higher resolution hashes.

We might also filter deduplication requests.  E.g. only investigate the possibility of deduplication for longer messages.   For messages that resemble human uttered text, or x86 machine language, or Java byte codes, or for messages that are or are not compressible, that have certain measiures of entropy, etc...  Yes, I am back to talking about rules - but I am brainstorming as to what rules may be useful.

Since computing strong hashes for deduplication and other purposes can be expensive, suggesting computing mutiple such hashes to decide which encrypted-deduplicated-storage-pool to use may be very expensive.  Back to rules, and possible cheaper low-resolution hashes to start with.

Good crypto usually doesn't parallelize well.   There is usually a serial depedency - although that may not be needed so much for deduplication as it is for encryption and attestation.

But it may be possible to use SIMD-style parallelization when computing multiple such hashes at the same time.


In any case: I like crypto and dedup.   I suspect that they will be more and more important in future computer systems.

I wonder what quantum computing does to deduplication hashing, beyond Google's supposedly quantum-proof lattice base PK.

(All the more reason why


These latent and long held thoughts were prompted for blogging by the ZFS article saying "compressed, encrypted, checksummed and then deduplicated, in that order".

Since what I am talking about here amounts to speculatively investigating the possibility of deduplication before encryption.

Ah, there's the rub - may use the coarse-grain hashes before encryotion, but ultimately must encrypt and then deduplicate.