Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Wednesday, November 16, 2011

Reagan Cattledog Links - reverifiable COW BOW links

Thinking about updating shared libraries:

Shared libraries' true advantage is their true disadvantage: multiple users can be updated, whether for good or ill.

Perhaps what we need are shared library linkages that are not automatically updated. Where an update is marked pending, encouraging  the user of the library to update it asap, but the change is not made.

I am calling this a reverifiable COW link. A link,that is broken when somebody else writes to the linked object (hence COW, Copy on Write, or BOW, Break on Write).  But which is reverifiable. Retestable.  (As one of my friends says "If you really believe in unit tests..."  (I do, he doesn't).

I would like very much to be able to have acronym COWBOY instead of COW BOW.  But I am humour deprived.

In the meantime I can call them Reagan Cattledog links.  Get it? BOW, as in bowwow, dog.  Reagan, as in "trust, but verify."

---

This not just for shared libraries.  Any sharing.  web pages. Like the "cached link" I have described elsewhere.  Cached links are really just COW BOW links which are assumed to be updated when the linkee comes back online.

Shared libraries and data deduplication

People have talked about the advantages of shared libraries: reducing virtual memory requirements, reducing disk space requirements, etc, because of sharing.

Here's a thought: Q: if we had truly ubiquitous data deduplication, what would be the advantages of shared libraries?

A: none of the performance wins through sharing need apply.  Deduplication beats them in a more flexible, more abstract, way.

(Of course, we do not have truly ubiquitous deduplication. And it usually requires things to be block or file aligned.)

This leaves the only fundamental advantage of shared librares

  • the fact that you can effect a ubiquitous change by updating a shared library.
Which is also their fundamental disadvantage.  You can propagate a bug fix.  But you can also propagate bugs.

Modules(1)

So I read Furlani et al's papers on "Modules".

Modules is not so bad - it is a nice way of dealing with an annoying problem.

Or, rather, Modules may be the best possible way of dealing with environment dependent code.  But it might be better to discourage environment dependent code in the first place.  See my earlier post about environment dependent code being a dominant meme.

--


Minor observation: I would like to bring some non-Modules source scripts "kicking and screaming into the '90s with Modules".  I would like to simply wrapperize some existing legacy code that requires you to "set some environment variables and then source foo".  I.re.I don't want to rewrite foo - I would just like to wrap it in a module.

Modules does not seem to be able to do this.

Although it looks as if it would only be a minor extension to modules to handle it.

To do foo, start off in a new window

How many times have you seen "How to" directions begin:

To do foo

  1. Start in a fresh xterm
  2. Start in a fresh shell (typicaly csh)
  3. Log out and log in again so that you get a clean environment
etc.

While this may be good advice, certainly good advice for debugging brokenesses and/or avoiding bugs in the first place  - it is basically an admission that something is not right.

Some tool depends on the environment in weird ways.Possibly not just the standard UNIX environment string;possibly also the extended shell environment.

Tools should be callable from almost arbitrary environments.  They should not DEPEND on environment variables. It may be acceptable to USE environment variables to change some behaviors, but, consider: if you had a setuid script, it would probably be unwise to depend on environment variables.  Environment variables should be considered tainted.

I suppose my version of the above is to say

To do foo

  1. Empty all of your environment variables, and start off with a minimum environment
  2. Type the absolute path to the tool,/a/b/.../c/d/tool
IMHO tools should work when invoked like this.  If they are using the equivalent of Perk FindBin, they should be able to locate all of the relevant library files, etc., they need. Or else they should have, embedded in them, the paths to same.

GLEW OPINION: much of the reason for environment abuse is the broken, non-object oriented, UNIX installation model, where a tool may be put in /bin, its libraries in /usr/lib, etc - where the directories are not known at build time.  PATH, LIBPATH. MANPATH.  FindBin can live with this - a FindBin script can be relocated by copying - so long as the relative locations of what it depends on are maintained.

source scripts

I am going to try calling the sort of shell command file that must be source'd, i.e. read into the user so that it can modify the environment, a "source-script".

As opposed to a "shell-script" or other command which is usually executed as a subprocess.

UNIX-style subprocess execution of commands provides isolation.  The parent is unaffected by the child, except for interactions through the filesystem.  (Although with the /proc filesystem, or the child applying a debugger to the parent, that could be significant.)

Whereas, consider a csh style source-script.  It can be changing directories all over the place to get its work done.  And it may terminate with an error before finishing properly.  So the "caller" or the source-script may not know what directory he is in after the source script terminates.

Q:  how many people do:

set saved_dir=`pwd`
source srcscript.csh
cd $saved_dir

And,of course, even that has obvious bugs.

environment setting a dominant meme?

Thinking about why I go through this paroxysm of disgust whenever I encounter a toolchain that depends on environment variables.  Like Modules or modulefiles(1). Like many CAD tools.

This morning it struck me: they are a dominant meme.  An evolutionarily stable strategy.

Not because environment based tools are better.

But because cleanly written stuff,like I try to write, can be called pretty safely from anywhere.  Whereas stuff that  does what I consider unclean environment modifications cannot be called so easily from other code.  It can call other code, but it is hard to be called from other code.  So there is a tendency for users to just give in, and write in csh (since csh in so often the language associated with such environment dependent tools).

Sure, you can try to write code that prints the environment and which then gets called. However, this only catches the UNIX environment - modulefiles(1) rely on sideeffects in the extended shell environment, shell functions and aliases.  You could print these, but would have to parse them to pass to a different language, or at least reread them if passing to a later compatible shell.

Bandaid.

The best way to work with such tools is to start a persistent subprocess, pass it commands, and interpret the results.  Expect style.  Coroutines. Which is doable, but is more complex than order function calls / UNIX style subprocess invocations.