Disclaimer

The content of this blog is my personal opinion only. Although I am an employee - currently of Nvidia, in the past of other companies such as Iagination Technologies, MIPS, Intellectual Ventures, Intel, AMD, Motorola, and Gould - I reveal this only so that the reader may account for any possible bias I may have towards my employer's products. The statements I make here in no way represent my employer's position, nor am I authorized to speak on behalf of my employer. In fact, this posting may not even represent my personal opinion, since occasionally I play devil's advocate.

See http://docs.google.com/View?id=dcxddbtr_23cg5thdfj for photo credits.

Wednesday, November 16, 2011

Reagan Cattledog Links - reverifiable COW BOW links

Thinking about updating shared libraries:

Shared libraries' true advantage is their true disadvantage: multiple users can be updated, whether for good or ill.

Perhaps what we need are shared library linkages that are not automatically updated. Where an update is marked pending, encouraging  the user of the library to update it asap, but the change is not made.

I am calling this a reverifiable COW link. A link,that is broken when somebody else writes to the linked object (hence COW, Copy on Write, or BOW, Break on Write).  But which is reverifiable. Retestable.  (As one of my friends says "If you really believe in unit tests..."  (I do, he doesn't).

I would like very much to be able to have acronym COWBOY instead of COW BOW.  But I am humour deprived.

In the meantime I can call them Reagan Cattledog links.  Get it? BOW, as in bowwow, dog.  Reagan, as in "trust, but verify."

---

This not just for shared libraries.  Any sharing.  web pages. Like the "cached link" I have described elsewhere.  Cached links are really just COW BOW links which are assumed to be updated when the linkee comes back online.

Shared libraries and data deduplication

People have talked about the advantages of shared libraries: reducing virtual memory requirements, reducing disk space requirements, etc, because of sharing.

Here's a thought: Q: if we had truly ubiquitous data deduplication, what would be the advantages of shared libraries?

A: none of the performance wins through sharing need apply.  Deduplication beats them in a more flexible, more abstract, way.

(Of course, we do not have truly ubiquitous deduplication. And it usually requires things to be block or file aligned.)

This leaves the only fundamental advantage of shared librares

  • the fact that you can effect a ubiquitous change by updating a shared library.
Which is also their fundamental disadvantage.  You can propagate a bug fix.  But you can also propagate bugs.

Modules(1)

So I read Furlani et al's papers on "Modules".

Modules is not so bad - it is a nice way of dealing with an annoying problem.

Or, rather, Modules may be the best possible way of dealing with environment dependent code.  But it might be better to discourage environment dependent code in the first place.  See my earlier post about environment dependent code being a dominant meme.

--


Minor observation: I would like to bring some non-Modules source scripts "kicking and screaming into the '90s with Modules".  I would like to simply wrapperize some existing legacy code that requires you to "set some environment variables and then source foo".  I.re.I don't want to rewrite foo - I would just like to wrap it in a module.

Modules does not seem to be able to do this.

Although it looks as if it would only be a minor extension to modules to handle it.

To do foo, start off in a new window

How many times have you seen "How to" directions begin:

To do foo

  1. Start in a fresh xterm
  2. Start in a fresh shell (typicaly csh)
  3. Log out and log in again so that you get a clean environment
etc.

While this may be good advice, certainly good advice for debugging brokenesses and/or avoiding bugs in the first place  - it is basically an admission that something is not right.

Some tool depends on the environment in weird ways.Possibly not just the standard UNIX environment string;possibly also the extended shell environment.

Tools should be callable from almost arbitrary environments.  They should not DEPEND on environment variables. It may be acceptable to USE environment variables to change some behaviors, but, consider: if you had a setuid script, it would probably be unwise to depend on environment variables.  Environment variables should be considered tainted.

I suppose my version of the above is to say

To do foo

  1. Empty all of your environment variables, and start off with a minimum environment
  2. Type the absolute path to the tool,/a/b/.../c/d/tool
IMHO tools should work when invoked like this.  If they are using the equivalent of Perk FindBin, they should be able to locate all of the relevant library files, etc., they need. Or else they should have, embedded in them, the paths to same.

GLEW OPINION: much of the reason for environment abuse is the broken, non-object oriented, UNIX installation model, where a tool may be put in /bin, its libraries in /usr/lib, etc - where the directories are not known at build time.  PATH, LIBPATH. MANPATH.  FindBin can live with this - a FindBin script can be relocated by copying - so long as the relative locations of what it depends on are maintained.

source scripts

I am going to try calling the sort of shell command file that must be source'd, i.e. read into the user so that it can modify the environment, a "source-script".

As opposed to a "shell-script" or other command which is usually executed as a subprocess.

UNIX-style subprocess execution of commands provides isolation.  The parent is unaffected by the child, except for interactions through the filesystem.  (Although with the /proc filesystem, or the child applying a debugger to the parent, that could be significant.)

Whereas, consider a csh style source-script.  It can be changing directories all over the place to get its work done.  And it may terminate with an error before finishing properly.  So the "caller" or the source-script may not know what directory he is in after the source script terminates.

Q:  how many people do:

set saved_dir=`pwd`
source srcscript.csh
cd $saved_dir

And,of course, even that has obvious bugs.

environment setting a dominant meme?

Thinking about why I go through this paroxysm of disgust whenever I encounter a toolchain that depends on environment variables.  Like Modules or modulefiles(1). Like many CAD tools.

This morning it struck me: they are a dominant meme.  An evolutionarily stable strategy.

Not because environment based tools are better.

But because cleanly written stuff,like I try to write, can be called pretty safely from anywhere.  Whereas stuff that  does what I consider unclean environment modifications cannot be called so easily from other code.  It can call other code, but it is hard to be called from other code.  So there is a tendency for users to just give in, and write in csh (since csh in so often the language associated with such environment dependent tools).

Sure, you can try to write code that prints the environment and which then gets called. However, this only catches the UNIX environment - modulefiles(1) rely on sideeffects in the extended shell environment, shell functions and aliases.  You could print these, but would have to parse them to pass to a different language, or at least reread them if passing to a later compatible shell.

Bandaid.

The best way to work with such tools is to start a persistent subprocess, pass it commands, and interpret the results.  Expect style.  Coroutines. Which is doable, but is more complex than order function calls / UNIX style subprocess invocations.

Sunday, November 13, 2011

calling a function to change the environment

I think that what I really object to in tools that dedpend on environmebt variables is that it is hard to put a wrapper around environment variables. I.e. it is hard to "call a function" to set environment variables.

However, I have to parse out what I mean by the terms above.
      Remember: I am talking about scripting, in shell and Perl, etc.
      "Calling a function" in a shell script usually means executing a child process. And, by definition, a child process dos not pass environment variables back to its parent.
      Now, of course, in every scripting language worth its salt, it is possible to write a function that sets environment variables. But that's a function in the language you are running in. It's a lot harder to have, e.g. perl or bash call a csh function, and have that child csh set environment variables in the parent perl or bash.
      Similarly, you can source a file in csh, or "." it in bash, and hae some degree of modularization. But again it is same language.

Why do I care? Mainly because I am writing stuff in bash or perl or python, and I want to get whatever environment variables legacy csh scripts set up.
      But, in general, you lose abstraction if you are constrained to call functions only written in your current language.  Or, even then, if only callable via a special syntax, eg. csh's "source file" rather than just executing file.
      Loss of abstraction.  But, requiring a special syntax provides a but of security.You can tell who is side effectful, and who is not.  Pure vs impure functions.

My clear-env script does things oppositely - it allows me to call a script in a subshell with a clean environment, but I don't necessarily see what it set up in the environment.
      Similarly, my friends trick where bash can get a csh script's environment by doing something like
csh -c "source module; exec bash
is NOT a "function call". It's more like continuations.

Part of the trouble is that the whole point of processes is to isolate the parent from the child.
    Except that here, the whole point is to get access tyo the child's side effects.

I think that I may need to create a family of scripts in various languages that execute or source or whatever, and then, at the end, printenv into a file or stdout - so that the parent can parse the printenv.

A better way would be to do something like stop the child process in a debugger - and then have the parent look through the child's environment with the debugger.

---
I haven't even touched on other side effects.  Not just the environment, but file descriptors.

E.g. "I have a bash script that wants to call a csh script that redirects stdout and stderr - and then have the parent bash script use those settings".

Saturday, November 12, 2011

Why depend on UNIX environment variables?

Previously I have posted about how I find tools that depend on sourcing csh files to get environment variables set a real pain to deal with. Largely because I don't have good ways of automating such procedures, both for use and test. And also because of environment variable interference.

So, I ask myself: Q: why depend on environment variables?

Then I remember how excited I was to learn of shell environment variables.

By setting an environment variable I could arrange for a parameter to be passed from the outside world right into the middle of a tool. Without having to modify any of the interveing levels.

No need to create argument parsing code.

In fact, in Bell Labs type shells,argument parsers are implicitly created for environmrnt variables:
VAR1=val1 VAR2=val2 program options...

I hate (csh) environment based tools

I hate tools that have heavy, undocumented, environment dependencies.

csh scripts seem to be the classic example. Beware of anything that says
source file1
source file2
do this
source file3
do that

where the csh files you are source'ing mainly act by setting up environment variables, but also may act by side effects such as cd'ing.

---

Why do I hate these things? They are hard to automate. Especaly to automatically test.

Really, to automate I need to insert checks into the script above after each step. At least if it is flakey. (If it is not flakey and is all working, I don't care what it does. As long as I can put t in a black box, and don't have to let its side effects escape.)

---


Why do I hate these things?

So often the are designed for interactive use. And interfere with other stuff you may be using interactvely.

Oftentimes I need to fall back to re,oving all of my interactive customizations to get something like this working in a clean environment.

---

I have a script I call clear-env that deletes environment variables, and starts up new subshells. Has saved my bacon many times

However, today I am running into problems that depend on running exactly the sit standard initialization files, .login and .cshrc, before running any other csh-&**^&**^&^-source modules.

Testing: just (start) doing it

Trying to test something that I don;t know how to automate. In fact, my goal is to automate - but I can't automate it enough even to run the smallest self checking automated test. The blinking procedure only works, at all, interactively.

So, do what I can:

I can't run the program under test automatically. (Yet: I hope to change this.)

But the program under test does leave some output files, etc.

Therefore, I can automate CHECKING the output of the manual test.

Perhaps in a while I will be able to automate the whole thing.

--


It is amazing the sense of ease that even this small degree of automation brings.

Tuesday, November 08, 2011

hgignore

hgignore:

'via Blog this'

Most version control tools have a .ignore file or ignore - .cvsignore, .hgignore, vk ignore, bzr ignore, etc.

All that I am aware of ignore fils based on pattern matching on the filename. E.g. in Mercurial:
An untracked file is ignored if its path relative to the repository root directory, or any prefix path of that path, is matched against any pattern in .hgignore.
I would like to extend this to be able to do simple queries.

E.g. I usually have an ignore rule that looks something like
*.novc/
I.e. I ignore directories that are suffixed .novc (for no version control).

This works fine, but is somewhat verbose. Plus, it gets in the way when certan tools have other conventions about names.

I should like to get the .novc directvetive out of the filename, and into a placeholder file in the directory.

E.g. if .../path/.novc exists, then ignore .../path

Q: is it there, and I do not know?